.

2026-06-02 08:21:47 +01:00 · 2026-06-02 08:21:47 +01:00 · a04ac2d857
commit a04ac2d857
parent f99bd4e5c9
16 changed files with 132 additions and 74 deletions
--- a/pipeline/transform/postcode_boundaries/README.md
+++ b/pipeline/transform/postcode_boundaries/README.md
@ -37,11 +37,11 @@ Pre-allocates numpy arrays at 25M capacity and grows by 1.5x if needed (using in

 **Loading** (`inspire.py:load_inspire`): Bboxes and offsets are loaded into RAM (~1.1GB). Coords are memory-mapped — the OS pages them in on demand from the ~3GB file, never loading the whole thing.

-**Candidate retrieval** (`inspire.py:get_inspire_candidates`): Given an OA's bounding box, performs a vectorized numpy overlap test against all 24M INSPIRE bboxes — four comparisons broadcast across the entire array. Typically matches 10-500 parcels per OA. Only those matches are materialized as Shapely Polygon objects by reading their coordinate slice from the memory-mapped file. Invalid polygons are repaired with `make_valid`.
+**Candidate retrieval** (`inspire.py:InspireIndex`): A uniform 1km grid index is built once over the 24M parcel bboxes (`build_inspire_index`). Each OA lookup (`InspireIndex.candidates`) gathers parcels from the cells its bounding box covers plus a small overflow list of parcels larger than one cell, then applies the exact bbox overlap test — O(cells + candidates) instead of an O(24M) scan per OA (the old linear scan was ~4h of the run on its own). The candidate set and order are identical to the scan. Typically matches 10-500 parcels per OA, materialized as Shapely Polygon objects by reading their coordinate slice from the memory-mapped file; invalid polygons are repaired with `make_valid`.

 ### Phase 3: Processing OAs

-The main loop in `__main__.py` iterates through every OA that has both a boundary polygon and UPRNs. For each OA, it retrieves the OA's UPRN points and postcodes.
+The main loop in `__main__.py` (`_process_oas`) iterates through every OA that has both a boundary polygon and UPRNs. For each OA, it retrieves the OA's UPRN points and postcodes. OAs are independent, so the loop fans out across CPU cores with a `fork` process pool (`--workers`, default all CPUs): workers share the big read-only inputs (INSPIRE arrays + coords mmap, UPRN arrays, OA geometries) copy-on-write and return WKB-encoded fragments. Workers slice the UPRN data from plain numpy/Arrow arrays (`extract_uprn_arrays`) rather than polars, avoiding the fork-after-threads hazard of polars' thread pool. Fragment order doesn't affect the output (`merge_fragments` unions per postcode), so the parallel result is identical to single-process.

 **Fast path**: If every UPRN in the OA shares the same postcode, the entire OA polygon is assigned to that postcode. No geometry computation needed. This covers the majority of OAs (~70-80%).