History

Andras Schmelczer f5e6894c0f Add postcode boundary calculation		2026-02-07 21:23:05 +00:00
..
__init__.py	Add postcode boundary calculation	2026-02-07 21:23:05 +00:00
__main__.py	Add postcode boundary calculation	2026-02-07 21:23:05 +00:00
inspire.py	Add postcode boundary calculation	2026-02-07 21:23:05 +00:00
memory.py	Add postcode boundary calculation	2026-02-07 21:23:05 +00:00
oa_boundaries.py	Add postcode boundary calculation	2026-02-07 21:23:05 +00:00
output.py	Add postcode boundary calculation	2026-02-07 21:23:05 +00:00
process_oa.py	Add postcode boundary calculation	2026-02-07 21:23:05 +00:00
README.md	Add postcode boundary calculation	2026-02-07 21:23:05 +00:00
test_postcode_boundaries.py	Add postcode boundary calculation	2026-02-07 21:23:05 +00:00
uprn.py	Add postcode boundary calculation	2026-02-07 21:23:05 +00:00
voronoi.py	Add postcode boundary calculation	2026-02-07 21:23:05 +00:00

README.md

postcode_boundaries

Synthesizes postcode boundary polygons for England and Wales from three datasets. UK postcodes don't have official boundary polygons — Royal Mail defines postcodes as sets of delivery addresses, not geographic areas. This pipeline constructs a plausible polygon for every postcode by combining Output Area boundaries, UPRN point locations, and INSPIRE cadastral parcels.

The three input datasets

1. Output Area (OA) boundaries — ONS Census Output Areas are the smallest geographic unit in the UK census (~125 households each). They tile all of England and Wales with no gaps or overlaps. Stored in a GeoPackage in British National Grid (EPSG:27700, meters). ~190,000 OAs.

2. UPRN lookup — Every Unique Property Reference Number in England and Wales, with its grid coordinates (easting/northing in BNG), its postcode (PCDS), and its OA code (OA21CD). ~37 million rows. This is the critical bridge: it tells you which postcodes exist inside each OA, and where each address physically sits.

3. INSPIRE Index Polygons — Land Registry cadastral parcels covering most of England and Wales. Each ZIP contains a GML file with polygon coordinate lists representing individual land parcels (buildings, plots of land). ~24 million polygons. These give fine-grained building/plot outlines that are much more precise than anything you could derive from point locations alone.

The four phases

Phase 1: Loading data

OA boundaries (oa_boundaries.py): Opens the GeoPackage via SQLite, reads every row from OA_2021_EW_BGC_V2. Each row's SHAPE column is a GeoPackage binary blob — a standard 8-byte header, then a variable-size envelope (bounding box), then WKB geometry. parse_gpkg_geometry reads byte 3 to extract the envelope type (0-4), looks up the envelope size, skips past the header, and hands the remaining WKB bytes to Shapely. Single-polygon MultiPolygons are unwrapped. Result: dict[oa_code, Polygon], all in BNG.

UPRNs (uprn.py): The raw parquet has far more columns than needed. The lazy scan selects only four columns, filters out Scotland (OA codes starting with S), drops nulls and blank postcodes (stripping whitespace first), then sorts by OA code. The sort uses sink_parquet to write to a temp file — this avoids polars doubling memory from an in-memory sort on ~37M rows.

After reading the sorted file back, it builds an offset dictionary. Rather than grouping into Python lists (which would create 37M Python string objects), it detects group boundaries by comparing each row's OA code to the previous row's. The result is offsets[oa_code] = (start_row, end_row) — a simple slice into the DataFrame. The OA column is then dropped since it's no longer needed, saving ~400MB.

get_oa_uprns later retrieves a single OA's data by slicing df[start:end] and extracting the coordinates and postcodes.

Phase 2: INSPIRE data

INSPIRE comes as ~350 ZIP files, each containing a GML file with thousands of PREDEFINED elements. Each element has a posList — a flat string of coordinate pairs.

Parsing (inspire.py:parse_inspire_zip): Uses iterparse for streaming XML parsing (constant memory per ZIP). For each PREDEFINED element, extracts the posList text, splits into floats, reshapes to Nx2. Calls elem.clear() after each element to free XML nodes immediately.

Caching (inspire.py:cache_inspire): Parsing 350 ZIPs takes a while, so results are cached as three files:

inspire_coords.bin — flat binary dump of all float64 coordinate pairs, streamed to disk as each ZIP is parsed
inspire_bboxes.npy — (N, 4) array of [min_e, min_n, max_e, max_n] per polygon
inspire_offsets.npy — (N, 2) array of [byte_offset_into_coords_bin, n_points]

Pre-allocates numpy arrays at 25M capacity and grows by 1.5x if needed (using in-place resize with refcheck=False). This avoids Python list overhead for 24M polygons. The coords file is written sequentially — each polygon's raw bytes are appended, and its byte offset is recorded.

Loading (inspire.py:load_inspire): Bboxes and offsets are loaded into RAM (~1.1GB). Coords are memory-mapped — the OS pages them in on demand from the ~3GB file, never loading the whole thing.

Candidate retrieval (inspire.py:get_inspire_candidates): Given an OA's bounding box, performs a vectorized numpy overlap test against all 24M INSPIRE bboxes — four comparisons broadcast across the entire array. Typically matches 10-500 parcels per OA. Only those matches are materialized as Shapely Polygon objects by reading their coordinate slice from the memory-mapped file. Invalid polygons are repaired with make_valid.

Phase 3: Processing OAs

The main loop in __main__.py iterates through every OA that has both a boundary polygon and UPRNs. For each OA, it retrieves the OA's UPRN points and postcodes.

Fast path: If every UPRN in the OA shares the same postcode, the entire OA polygon is assigned to that postcode. No geometry computation needed. This covers the majority of OAs (~70-80%).

Slow path (process_oa.py): For multi-postcode OAs, the algorithm has three stages:

Stage A: INSPIRE-based claiming

Build an STRtree spatial index over the INSPIRE candidate polygons. Convert all UPRN points to Shapely Point objects and batch-query the tree with predicate="intersects". This returns pairs of (point_index, candidate_index) — which UPRNs fall inside which parcels.

For each INSPIRE parcel that contains at least one UPRN, run a majority vote: whichever postcode has the most UPRNs inside that parcel wins the parcel. Accumulate winning parcels per postcode, union them, and clip to the OA boundary. The result is claimed[postcode] = polygon_within_oa.

Then resolve overlaps: INSPIRE parcels can overlap geographically (digitization overlaps), so two postcodes might claim the same square meters. Walk through the claimed dict in insertion order (the postcode with the most parcel wins gets priority by virtue of appearing first), subtracting the running union from each subsequent postcode's geometry.

Stage B: Voronoi distribution of remaining area

Subtract all claimed area from the OA polygon to get remaining. If remaining area > 0.01 sqm, pass ALL UPRN points (not just unclaimed ones) and the remaining polygon to compute_voronoi_regions.

The Voronoi computation (voronoi.py):

Converts coordinates to float64 (since BNG grid refs are integers)
Deduplicates points, keeping one per (coordinate, postcode) pair. When multiple postcodes share the same coordinate (e.g. a block of flats straddling a postcode boundary), each postcode gets its own point with a tiny 0.01m jitter so Voronoi can distinguish them
Adds 4 dummy points far outside the real points (10x the spatial extent). This guarantees every real point gets a bounded Voronoi region (otherwise edge points get infinite regions) and also prevents collinearity from crashing scipy
Runs scipy.spatial.Voronoi on all points
For each real point's Voronoi cell, constructs the polygon from the Voronoi vertices, clips to the boundary, groups by postcode
Unions per-postcode fragments

The effect: every unclaimed patch of OA gets assigned to the nearest postcode by straight-line distance (Voronoi tessellation is exactly the set of all points nearest to each generator).

Stage C: Combine

Each postcode gets its INSPIRE-claimed polygon (if any) plus its Voronoi share (if any). These are unioned together, validated, and stripped of any non-polygonal geometry debris from make_valid.

The output of process_oa is list[(postcode, polygon)] — the per-OA fragments. A single postcode that spans two OAs produces two separate fragments (one from each OA's processing).

Phase 4: Merging and writing

Fragment merging (output.py:merge_fragments): Groups all fragments by postcode, unions them. If the result is a MultiPolygon (meaning the postcode has disconnected pieces — either from spanning OAs with a gap, or algorithm artifacts), applies a 1m buffer-then-unbuffer to close tiny gaps from floating-point mismatches at OA boundary edges. If still a MultiPolygon after that, keeps only the largest polygon — postcodes are contiguous delivery routes, so detached fragments are artifacts.

GeoJSON output (output.py:write_district_geojson): Groups postcodes by district (the outward code, e.g. SW1A from SW1A 1AA). For each district, converts every postcode polygon from BNG to WGS84 using pyproj, simplifies with 1m tolerance (Douglas-Peucker), rounds coordinates to 6 decimal places (~0.1m precision), and writes a single {district}.geojson FeatureCollection. Each Feature has postcodes (formatted like "SW1A 1AA") and mapit_code (no space: "SW1A1AA") in its properties.

Memory architecture

The pipeline is designed to run in <12GB:

Dataset	Representation	Memory
OA boundaries	Python dict of Shapely objects	~2GB
UPRNs	Polars DataFrame (Arrow columnar) + offset dict	~1.5GB
INSPIRE bboxes	numpy float64 (N,4)	~777MB
INSPIRE offsets	numpy int64 (N,2)	~290MB
INSPIRE coords	memory-mapped file	~0MB resident
Fragments	Python list of (str, Shapely)	grows during processing

Key design choices:

INSPIRE coords are memory-mapped, not loaded — the OS pages in only the ~100-500 polygons needed per OA
UPRNs sorted + offset dict avoids per-OA groupby allocation
sink_parquet for the sort avoids doubling memory
release_memory() calls gc.collect() + glibc malloc_trim(0) to return freed pages to the OS between phases
All three large datasets are explicitly deleted before Phase 4

Key invariants

Every square meter of every OA is assigned to exactly one postcode — the combination of INSPIRE claiming + Voronoi fills the entire OA, and overlap resolution ensures no double-counting
Every postcode that exists in the UPRN data gets a polygon — unless all its UPRNs share coordinates with another postcode's UPRNs (handled by jitter) or it has zero UPRNs
Postcode polygons never extend outside their OA(s) — all geometry is clipped to OA boundaries
Output is always single Polygon, never MultiPolygon — the largest-polygon extraction in both merge_fragments and to_wgs84_geojson ensures this

Module structure

postcode_boundaries/
  __init__.py         — Package docstring
  __main__.py         — CLI entry point, four-phase orchestration
  memory.py           — release_memory() glibc malloc_trim helper
  oa_boundaries.py    — GeoPackage parsing, OA boundary loading
  uprn.py             — UPRN loading (sorted DataFrame + offset dict), per-OA access
  inspire.py          — INSPIRE GML parsing, caching, loading, bbox candidate retrieval
  voronoi.py          — Voronoi region computation clipped to boundary
  process_oa.py       — Per-OA processing (INSPIRE assignment + Voronoi fallback)
  output.py           — BNG to WGS84 transform, fragment merging, GeoJSON writing

Invoked as:

uv run python -m pipeline.transform.postcode_boundaries \
  --uprn data/uprn_lookup.parquet \
  --oa-boundaries data/oa_boundaries.gpkg \
  --inspire data/inspire/ \
  --output data/postcode_boundaries/