# postcode_boundaries

Synthesizes postcode boundary polygons for England and Wales from three datasets. UK postcodes don't have official boundary polygons — Royal Mail defines postcodes as sets of delivery addresses, not geographic areas. This pipeline constructs a plausible polygon for every postcode by combining Output Area boundaries, UPRN point locations, and INSPIRE cadastral parcels.

## The three input datasets

**1. Output Area (OA) boundaries** — ONS Census Output Areas are the smallest geographic unit in the UK census (~125 households each). They tile all of England and Wales with no gaps or overlaps. Stored in a GeoPackage in British National Grid (EPSG:27700, meters). ~190,000 OAs.

**2. UPRN lookup** — Every Unique Property Reference Number in England and Wales, with its grid coordinates (easting/northing in BNG), its postcode (`PCDS`), and its OA code (`OA21CD`). ~37 million rows. This is the critical bridge: it tells you which postcodes exist inside each OA, and where each address physically sits.

**3. INSPIRE Index Polygons** — Land Registry cadastral parcels covering most of England and Wales. Each ZIP contains a GML file with polygon coordinate lists representing individual land parcels (buildings, plots of land). ~24 million polygons. These give fine-grained building/plot outlines that are much more precise than anything you could derive from point locations alone.

## The four phases

### Phase 1: Loading data

**OA boundaries** (`oa_boundaries.py`): Opens the GeoPackage via SQLite, reads every row from `OA_2021_EW_BGC_V2`. Each row's `SHAPE` column is a GeoPackage binary blob — a standard 8-byte header, then a variable-size envelope (bounding box), then WKB geometry. `parse_gpkg_geometry` reads byte 3 to extract the envelope type (0-4), looks up the envelope size, skips past the header, and hands the remaining WKB bytes to Shapely. Single-polygon MultiPolygons are unwrapped. Result: `dict[oa_code, Polygon]`, all in BNG.

**UPRNs** (`uprn.py`): The raw parquet has far more columns than needed. The lazy scan selects only four columns, filters out Scotland (OA codes starting with `S`), drops nulls and blank postcodes (stripping whitespace first), then sorts by OA code. The sort uses `sink_parquet` to write to a temp file — this avoids polars doubling memory from an in-memory sort on ~37M rows.

After reading the sorted file back, it builds an offset dictionary. Rather than grouping into Python lists (which would create 37M Python string objects), it detects group boundaries by comparing each row's OA code to the previous row's. The result is `offsets[oa_code] = (start_row, end_row)` — a simple slice into the DataFrame. The OA column is then dropped since it's no longer needed, saving ~400MB.

`get_oa_uprns` later retrieves a single OA's data by slicing `df[start:end]` and extracting the coordinates and postcodes.

### Phase 2: INSPIRE data

INSPIRE comes as ~350 ZIP files, each containing a GML file with thousands of `PREDEFINED` elements. Each element has a `posList` — a flat string of coordinate pairs.

**Parsing** (`inspire.py:parse_inspire_zip`): Uses `iterparse` for streaming XML parsing (constant memory per ZIP). For each `PREDEFINED` element, extracts the `posList` text, splits into floats, reshapes to Nx2. Calls `elem.clear()` after each element to free XML nodes immediately.

**Caching** (`inspire.py:cache_inspire`): Parsing 350 ZIPs takes a while, so results are cached as three files:
- `inspire_coords.bin` — flat binary dump of all float64 coordinate pairs, streamed to disk as each ZIP is parsed
- `inspire_bboxes.npy` — (N, 4) array of `[min_e, min_n, max_e, max_n]` per polygon
- `inspire_offsets.npy` — (N, 2) array of `[byte_offset_into_coords_bin, n_points]`

Pre-allocates numpy arrays at 25M capacity and grows by 1.5x if needed (using in-place `resize` with `refcheck=False`). This avoids Python list overhead for 24M polygons. The coords file is written sequentially — each polygon's raw bytes are appended, and its byte offset is recorded.

**Loading** (`inspire.py:load_inspire`): Bboxes and offsets are loaded into RAM (~1.1GB). Coords are memory-mapped — the OS pages them in on demand from the ~3GB file, never loading the whole thing.

**Candidate retrieval** (`inspire.py:get_inspire_candidates`): Given an OA's bounding box, performs a vectorized numpy overlap test against all 24M INSPIRE bboxes — four comparisons broadcast across the entire array. Typically matches 10-500 parcels per OA. Only those matches are materialized as Shapely Polygon objects by reading their coordinate slice from the memory-mapped file. Invalid polygons are repaired with `make_valid`.

### Phase 3: Processing OAs

The main loop in `__main__.py` iterates through every OA that has both a boundary polygon and UPRNs. For each OA, it retrieves the OA's UPRN points and postcodes.

**Fast path**: If every UPRN in the OA shares the same postcode, the entire OA polygon is assigned to that postcode. No geometry computation needed. This covers the majority of OAs (~70-80%).

**Slow path** (`process_oa.py`): For multi-postcode OAs, the algorithm has three stages:

#### Stage A: INSPIRE-based claiming

Build an STRtree spatial index over the INSPIRE candidate polygons. Convert all UPRN points to Shapely Point objects and batch-query the tree with `predicate="intersects"`. This returns pairs of (point_index, candidate_index) — which UPRNs fall inside which parcels.

For each INSPIRE parcel that contains at least one UPRN, run a majority vote: whichever postcode has the most UPRNs inside that parcel wins the parcel. Accumulate winning parcels per postcode, union them, and clip to the OA boundary. The result is `claimed[postcode] = polygon_within_oa`.

Then resolve overlaps: INSPIRE parcels can overlap geographically (digitization overlaps), so two postcodes might claim the same square meters. Walk through the claimed dict in insertion order (the postcode with the most parcel wins gets priority by virtue of appearing first), subtracting the running union from each subsequent postcode's geometry.

#### Stage B: Voronoi distribution of remaining area

Subtract all claimed area from the OA polygon to get `remaining`. If remaining area > 0.01 sqm, pass ALL UPRN points (not just unclaimed ones) and the remaining polygon to `compute_voronoi_regions`.

The Voronoi computation (`voronoi.py`):
1. Converts coordinates to float64 (since BNG grid refs are integers)
2. Deduplicates points, keeping one per (coordinate, postcode) pair. When multiple postcodes share the same coordinate (e.g. a block of flats straddling a postcode boundary), each postcode gets its own point with a tiny 0.01m jitter so Voronoi can distinguish them
3. Adds 4 dummy points far outside the real points (10x the spatial extent). This guarantees every real point gets a bounded Voronoi region (otherwise edge points get infinite regions) and also prevents collinearity from crashing scipy
4. Runs `scipy.spatial.Voronoi` on all points
5. For each real point's Voronoi cell, constructs the polygon from the Voronoi vertices, clips to the boundary, groups by postcode
6. Unions per-postcode fragments

The effect: every unclaimed patch of OA gets assigned to the nearest postcode by straight-line distance (Voronoi tessellation is exactly the set of all points nearest to each generator).

#### Stage C: Combine

Each postcode gets its INSPIRE-claimed polygon (if any) plus its Voronoi share (if any). These are unioned together, validated, and stripped of any non-polygonal geometry debris from `make_valid`.

The output of `process_oa` is `list[(postcode, polygon)]` — the per-OA fragments. A single postcode that spans two OAs produces two separate fragments (one from each OA's processing).

### Phase 4: Merging and writing

**Fragment merging** (`output.py:merge_fragments`): Groups all fragments by postcode, unions them. If the result is a MultiPolygon (meaning the postcode has disconnected pieces — either from spanning OAs with a gap, or algorithm artifacts), applies a 1m buffer-then-unbuffer to close tiny gaps from floating-point mismatches at OA boundary edges. If still a MultiPolygon after that, keeps only the largest polygon — postcodes are contiguous delivery routes, so detached fragments are artifacts.

**GeoJSON output** (`output.py:write_district_geojson`): Groups postcodes by district (the outward code, e.g. `SW1A` from `SW1A 1AA`). For each district, converts every postcode polygon from BNG to WGS84 using pyproj, simplifies with 1m tolerance (Douglas-Peucker), rounds coordinates to 6 decimal places (~0.1m precision), and writes a single `{district}.geojson` FeatureCollection. Each Feature has `postcodes` (formatted like `"SW1A 1AA"`) and `mapit_code` (no space: `"SW1A1AA"`) in its properties.

## Memory architecture

The pipeline is designed to run in <12GB:

| Dataset | Representation | Memory |
|---------|---------------|--------|
| OA boundaries | Python dict of Shapely objects | ~2GB |
| UPRNs | Polars DataFrame (Arrow columnar) + offset dict | ~1.5GB |
| INSPIRE bboxes | numpy float64 (N,4) | ~777MB |
| INSPIRE offsets | numpy int64 (N,2) | ~290MB |
| INSPIRE coords | memory-mapped file | ~0MB resident |
| Fragments | Python list of (str, Shapely) | grows during processing |

Key design choices:
- INSPIRE coords are memory-mapped, not loaded — the OS pages in only the ~100-500 polygons needed per OA
- UPRNs sorted + offset dict avoids per-OA groupby allocation
- `sink_parquet` for the sort avoids doubling memory
- `release_memory()` calls `gc.collect()` + glibc `malloc_trim(0)` to return freed pages to the OS between phases
- All three large datasets are explicitly deleted before Phase 4

## Key invariants

1. **Every square meter of every OA is assigned to exactly one postcode** — the combination of INSPIRE claiming + Voronoi fills the entire OA, and overlap resolution ensures no double-counting
2. **Every postcode that exists in the UPRN data gets a polygon** — unless all its UPRNs share coordinates with another postcode's UPRNs (handled by jitter) or it has zero UPRNs
3. **Postcode polygons never extend outside their OA(s)** — all geometry is clipped to OA boundaries
4. **Output is always single Polygon, never MultiPolygon** — the largest-polygon extraction in both `merge_fragments` and `to_wgs84_geojson` ensures this

## Module structure

```
postcode_boundaries/
  __init__.py         — Package docstring
  __main__.py         — CLI entry point, four-phase orchestration
  memory.py           — release_memory() glibc malloc_trim helper
  oa_boundaries.py    — GeoPackage parsing, OA boundary loading
  uprn.py             — UPRN loading (sorted DataFrame + offset dict), per-OA access
  inspire.py          — INSPIRE GML parsing, caching, loading, bbox candidate retrieval
  voronoi.py          — Voronoi region computation clipped to boundary
  process_oa.py       — Per-OA processing (INSPIRE assignment + Voronoi fallback)
  output.py           — BNG to WGS84 transform, fragment merging, GeoJSON writing
```

Invoked as:
```bash
uv run python -m pipeline.transform.postcode_boundaries \
  --uprn data/uprn_lookup.parquet \
  --oa-boundaries data/oa_boundaries.gpkg \
  --inspire data/inspire/ \
  --output data/postcode_boundaries/
```