idk
This commit is contained in:
parent
a04ac2d857
commit
d43da9708c
47 changed files with 4120 additions and 573 deletions
|
|
@ -77,9 +77,9 @@ The output of `process_oa` is `list[(postcode, polygon)]` — the per-OA fragmen
|
|||
|
||||
### Phase 4: Merging and writing
|
||||
|
||||
**Fragment merging** (`output.py:merge_fragments`): Groups all fragments by postcode, unions them. If the result is a MultiPolygon (meaning the postcode has disconnected pieces — either from spanning OAs with a gap, or algorithm artifacts), applies a 5m buffer-then-unbuffer to close tiny gaps from floating-point mismatches at OA boundary edges. If still a MultiPolygon after that, keeps only the largest polygon — postcodes are contiguous delivery routes, so detached fragments are artifacts.
|
||||
**Fragment merging** (`output.py:merge_fragments`): Groups all fragments by postcode, unions them. If the result is a MultiPolygon (meaning the postcode has disconnected pieces — either from spanning OAs with a gap, or algorithm artifacts), applies a 5m buffer-then-unbuffer to close tiny gaps from floating-point mismatches at OA boundary edges. If still a MultiPolygon after that, keeps the largest part **plus any other part ≥ `_MIN_DETACHED_PART_AREA` (100 m²)** (`_keep_polygon_parts`); only sub-100 m² noise slivers are dropped. Keeping substantial detached parts matters because a postcode genuinely split across an OA seam (by a railway, river, or main road wider than the 5m buffer) would otherwise lose a chunk — measured at ~1.8% of merged area left as uncovered gaps (often 3000–5000 m² building blocks) before this change.
|
||||
|
||||
**GeoJSON output** (`output.py:write_district_geojson`): Groups postcodes by district (the outward code, e.g. `SW1A` from `SW1A 1AA`). For each district, converts every postcode polygon from BNG to WGS84 using pyproj, simplifies with 1m tolerance (Douglas-Peucker), rounds coordinates to 6 decimal places (~0.1m precision), and writes a single `{district}.geojson` FeatureCollection. Each Feature has `postcodes` (formatted like `"SW1A 1AA"`) and `mapit_code` (no space: `"SW1A1AA"`) in its properties.
|
||||
**GeoJSON output** (`output.py:write_district_geojson`): Two passes. Pass 1 converts every postcode from BNG to WGS84 (pyproj), simplifies with 1m tolerance (Douglas-Peucker), and snaps to 6 decimal places (~0.1m precision); multi-part postcodes become `MultiPolygon` (`to_wgs84_geojson_multi`, each part handled independently), single-part stay `Polygon`. The whole set is then made a **partition** (`_resolve_overlaps`): each postcode is trimmed by the union of its higher-priority overlapping neighbours, where **priority = ascending area** (smaller postcodes win contested ground). That single rule handles both seam overlap *and* containment — an enclosed postcode is always smaller than its container, so it keeps its area while the container gets a hole (the query uses both the `overlaps` and `contains` predicates, since `overlaps` alone excludes containment). This runs last, so nothing re-introduces overlap; a postcode that would be emptied keeps its original geometry, so no active postcode is dropped. Pass 2 groups postcodes by district (the outward code, e.g. `SW1A` from `SW1A 1AA`), rounds coordinates to 6dp, and writes a `{district}.geojson` FeatureCollection. Each Feature has `postcodes` (formatted like `"SW1A 1AA"`) and `mapit_code` (no space: `"SW1A1AA"`) in its properties.
|
||||
|
||||
## Memory architecture
|
||||
|
||||
|
|
@ -103,10 +103,10 @@ Key design choices:
|
|||
|
||||
## Key invariants
|
||||
|
||||
1. **Every square meter of every OA is assigned to exactly one postcode** — the combination of INSPIRE claiming + Voronoi fills the entire OA, and overlap resolution ensures no double-counting
|
||||
1. **No two postcodes cover the same ground in the output** — within an OA the INSPIRE claiming + Voronoi tile it with no overlap, and a final `_resolve_overlaps` partition pass removes the thin overlap strips that the merge buffer + per-postcode simplification introduce across OA seams (measured residual overlap ~0.01% of area)
|
||||
2. **Every postcode that exists in the UPRN data gets a polygon** — unless all its UPRNs share coordinates with another postcode's UPRNs (handled by jitter) or it has zero UPRNs
|
||||
3. **Postcode polygons never extend outside their OA(s)** — all geometry is clipped to OA boundaries
|
||||
4. **Output is always single Polygon, never MultiPolygon** — the largest-polygon extraction in both `merge_fragments` and `to_wgs84_geojson` ensures this
|
||||
4. **A postcode split across an OA seam keeps all its substantial parts** — `merge_fragments` keeps every part ≥ 100 m² and the output is emitted as a `MultiPolygon` (the Rust server `postcodes.rs` and `loader.py` both parse MultiPolygon); only sub-100 m² noise slivers are dropped
|
||||
|
||||
## Module structure
|
||||
|
||||
|
|
|
|||
|
|
@ -1,12 +1,21 @@
|
|||
import argparse
|
||||
import multiprocessing as mp
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
import shapely
|
||||
from shapely.geometry import MultiPolygon, Polygon
|
||||
from tqdm import tqdm
|
||||
|
||||
from .fragments_cache import (
|
||||
fragments_cache_is_fresh,
|
||||
load_fragments,
|
||||
save_fragments,
|
||||
)
|
||||
from .inspire import (
|
||||
build_inspire_index,
|
||||
cache_inspire,
|
||||
get_inspire_candidates,
|
||||
inspire_cache_exists,
|
||||
load_inspire,
|
||||
)
|
||||
|
|
@ -14,7 +23,206 @@ from .memory import release_memory
|
|||
from .oa_boundaries import load_oa_boundaries
|
||||
from .output import merge_fragments, write_district_geojson
|
||||
from .process_oa import process_oa
|
||||
from .uprn import get_oa_uprns, load_uprns
|
||||
from .uprn import extract_uprn_arrays, get_oa_uprns_arrays, load_uprns
|
||||
|
||||
Fragment = tuple[str, Polygon | MultiPolygon]
|
||||
|
||||
|
||||
def _oa_fragments(
|
||||
oa_code, oa_geoms, east, north, postcodes_arr, offsets, index
|
||||
) -> tuple[list[Fragment], bool]:
|
||||
"""Process one OA into ``(postcode, geometry)`` fragments.
|
||||
|
||||
Returns ``(fragments, is_single)``; ``is_single`` flags the single-postcode
|
||||
fast path. Shared by the sequential and parallel drivers so both produce
|
||||
identical output. Any failure is re-raised tagged with the OA code so a single
|
||||
bad OA is attributable instead of an anonymous worker abort hours in.
|
||||
"""
|
||||
try:
|
||||
oa_geom = oa_geoms[oa_code]
|
||||
points, postcodes = get_oa_uprns_arrays(
|
||||
east, north, postcodes_arr, offsets, oa_code
|
||||
)
|
||||
if len(set(postcodes)) == 1:
|
||||
return [(postcodes[0], oa_geom)], True
|
||||
candidates = index.candidates(oa_geom.bounds)
|
||||
return process_oa(oa_geom, points, postcodes, candidates), False
|
||||
except Exception as exc:
|
||||
raise RuntimeError(f"Failed processing OA {oa_code}: {exc!r}") from exc
|
||||
|
||||
|
||||
# Worker-shared state. Populated in the parent before the pool forks; children
|
||||
# inherit it copy-on-write (the numpy/Arrow buffers + coords mmap stay shared,
|
||||
# never duplicated per worker). Read-only in workers.
|
||||
_WORKER_STATE: dict = {}
|
||||
|
||||
|
||||
def _process_oa_chunk(oa_codes: list[str]):
|
||||
"""Worker: turn a chunk of OA codes into WKB-encoded fragments.
|
||||
|
||||
Geometries are returned as WKB (compact and lossless) rather than pickled
|
||||
Shapely objects, to keep the IPC payload small.
|
||||
"""
|
||||
state = _WORKER_STATE
|
||||
frags: list[Fragment] = []
|
||||
single = 0
|
||||
for oa_code in oa_codes:
|
||||
oa_frags, is_single = _oa_fragments(
|
||||
oa_code,
|
||||
state["oa_geoms"],
|
||||
state["east"],
|
||||
state["north"],
|
||||
state["postcodes"],
|
||||
state["offsets"],
|
||||
state["index"],
|
||||
)
|
||||
frags.extend(oa_frags)
|
||||
single += is_single
|
||||
|
||||
if frags:
|
||||
pcs = [pc for pc, _ in frags]
|
||||
wkb = shapely.to_wkb(np.array([g for _, g in frags], dtype=object))
|
||||
else:
|
||||
pcs, wkb = [], np.empty(0, dtype=object)
|
||||
return pcs, wkb, single, len(oa_codes)
|
||||
|
||||
|
||||
def _resolve_workers(requested: int) -> int:
|
||||
"""Worker count: the explicit value if >0, otherwise all available CPUs."""
|
||||
if requested and requested > 0:
|
||||
return requested
|
||||
try:
|
||||
return max(1, len(os.sched_getaffinity(0)))
|
||||
except AttributeError:
|
||||
return max(1, os.cpu_count() or 1)
|
||||
|
||||
|
||||
def _process_oas(
|
||||
oa_codes, oa_geoms, east, north, postcodes_arr, offsets, index, workers
|
||||
) -> tuple[list[Fragment], int]:
|
||||
"""Drive Phase 3 over every OA, fanning out across `workers` processes.
|
||||
|
||||
OAs are independent, so the loop parallelises cleanly. ``fork`` lets workers
|
||||
share the big read-only inputs (INSPIRE arrays + coords mmap, UPRN arrays, OA
|
||||
geometries) copy-on-write instead of duplicating ~2GB each. Fragment order
|
||||
does not affect the result (``merge_fragments`` unions per postcode), so
|
||||
chunks are collected as they finish. Returns ``(fragments, single_count)``.
|
||||
"""
|
||||
all_fragments: list[Fragment] = []
|
||||
single_count = 0
|
||||
|
||||
if workers <= 1 or "fork" not in mp.get_all_start_methods():
|
||||
for oa_code in tqdm(
|
||||
oa_codes, desc="Processing OAs", unit="OA", smoothing=0.01, miniters=100
|
||||
):
|
||||
oa_frags, is_single = _oa_fragments(
|
||||
oa_code, oa_geoms, east, north, postcodes_arr, offsets, index
|
||||
)
|
||||
all_fragments.extend(oa_frags)
|
||||
single_count += is_single
|
||||
return all_fragments, single_count
|
||||
|
||||
_WORKER_STATE.update(
|
||||
oa_geoms=oa_geoms,
|
||||
east=east,
|
||||
north=north,
|
||||
postcodes=postcodes_arr,
|
||||
offsets=offsets,
|
||||
index=index,
|
||||
)
|
||||
# Many small contiguous chunks → dynamic load balancing across workers (rural
|
||||
# OAs cost far more than urban ones) while preserving mmap read locality.
|
||||
chunk_size = max(1, len(oa_codes) // (workers * 16))
|
||||
chunks = [oa_codes[i : i + chunk_size] for i in range(0, len(oa_codes), chunk_size)]
|
||||
print(f" Parallel: {workers} workers, {len(chunks)} chunks of ~{chunk_size} OAs")
|
||||
|
||||
ctx = mp.get_context("fork")
|
||||
try:
|
||||
with ctx.Pool(processes=workers) as pool:
|
||||
with tqdm(
|
||||
total=len(oa_codes), desc="Processing OAs", unit="OA", smoothing=0.01
|
||||
) as bar:
|
||||
for pcs, wkb, single, n_oas in pool.imap_unordered(
|
||||
_process_oa_chunk, chunks
|
||||
):
|
||||
if len(wkb):
|
||||
all_fragments.extend(zip(pcs, shapely.from_wkb(wkb)))
|
||||
single_count += single
|
||||
bar.update(n_oas)
|
||||
finally:
|
||||
# Drop references so Phase 4 doesn't keep the big inputs alive.
|
||||
_WORKER_STATE.clear()
|
||||
return all_fragments, single_count
|
||||
|
||||
|
||||
def build_fragments(args: argparse.Namespace) -> list[Fragment]:
|
||||
"""Run Phases 1-3: load data, parse INSPIRE, process every OA into fragments.
|
||||
|
||||
Returns the full ``(postcode, geometry)`` fragment list. The large
|
||||
intermediate structures (OA/UPRN/INSPIRE arrays) are locals here, so they are
|
||||
freed as soon as this function returns -- before the fragments are cached or
|
||||
merged.
|
||||
"""
|
||||
# Phase 1: Load all data
|
||||
print("=" * 60)
|
||||
print("Phase 1: Loading data")
|
||||
print("=" * 60)
|
||||
|
||||
oa_geoms = load_oa_boundaries(args.oa_boundaries)
|
||||
uprn_df, uprn_offsets = load_uprns(args.uprn, args.arcgis)
|
||||
# Convert UPRNs to fork-shareable numpy/Arrow arrays so parallel workers never
|
||||
# call polars (avoids the fork-after-threads hazard of its rayon pool).
|
||||
uprn_east, uprn_north, uprn_postcodes = extract_uprn_arrays(uprn_df)
|
||||
|
||||
# Phase 2: Parse/load INSPIRE
|
||||
print()
|
||||
print("=" * 60)
|
||||
print("Phase 2: INSPIRE data")
|
||||
print("=" * 60)
|
||||
|
||||
inspire_cache_dir = args.output / "inspire_cache"
|
||||
if not inspire_cache_exists(inspire_cache_dir):
|
||||
cache_inspire(args.inspire, inspire_cache_dir)
|
||||
inspire_bboxes, inspire_offsets, inspire_coords = load_inspire(inspire_cache_dir)
|
||||
inspire_index = build_inspire_index(inspire_bboxes, inspire_offsets, inspire_coords)
|
||||
|
||||
# Phase 3: Process OAs
|
||||
print()
|
||||
print("=" * 60)
|
||||
print("Phase 3: Processing OAs")
|
||||
print("=" * 60)
|
||||
|
||||
# Build work list — precompute which OAs are single vs multi-postcode
|
||||
oa_codes_with_data = sorted(set(oa_geoms.keys()) & set(uprn_offsets.keys()))
|
||||
skipped_no_uprn = len(oa_geoms) - len(oa_codes_with_data)
|
||||
skipped_no_boundary = len(uprn_offsets) - len(oa_codes_with_data)
|
||||
|
||||
if args.limit > 0:
|
||||
oa_codes_with_data = oa_codes_with_data[: args.limit]
|
||||
|
||||
print(f" OAs with UPRNs + boundaries: {len(oa_codes_with_data)}")
|
||||
print(f" Skipped (no UPRNs): {skipped_no_uprn}")
|
||||
print(f" Skipped (no boundary): {skipped_no_boundary}")
|
||||
|
||||
# --limit is a debug mode → force deterministic single-process.
|
||||
workers = 1 if args.limit > 0 else _resolve_workers(args.workers)
|
||||
all_fragments, single_count = _process_oas(
|
||||
oa_codes_with_data,
|
||||
oa_geoms,
|
||||
uprn_east,
|
||||
uprn_north,
|
||||
uprn_postcodes,
|
||||
uprn_offsets,
|
||||
inspire_index,
|
||||
workers,
|
||||
)
|
||||
multi_count = len(oa_codes_with_data) - single_count
|
||||
|
||||
print(f"\n Single-postcode OAs (fast path): {single_count}")
|
||||
print(f" Multi-postcode OAs (INSPIRE+Voronoi): {multi_count}")
|
||||
print(f" Total fragments: {len(all_fragments)}")
|
||||
|
||||
return all_fragments
|
||||
|
||||
|
||||
def main() -> None:
|
||||
|
|
@ -38,6 +246,12 @@ def main() -> None:
|
|||
parser.add_argument(
|
||||
"--limit", type=int, default=0, help="Process only first N OAs (0=all)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--workers",
|
||||
type=int,
|
||||
default=0,
|
||||
help="Parallel worker processes for OA processing (0=all CPUs, 1=sequential)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--greenspace",
|
||||
type=Path,
|
||||
|
|
@ -46,79 +260,30 @@ def main() -> None:
|
|||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
# Phase 1: Load all data
|
||||
print("=" * 60)
|
||||
print("Phase 1: Loading data")
|
||||
print("=" * 60)
|
||||
fragments_cache = args.output / "fragments_cache.parquet"
|
||||
# Phase 3 depends only on these inputs; greenspace is applied later (Phase 4),
|
||||
# so a greenspace change must not invalidate the fragment cache.
|
||||
fragment_inputs = [args.uprn, args.arcgis, args.oa_boundaries, args.inspire]
|
||||
# --limit yields a partial fragment set; never read or write the shared cache.
|
||||
use_cache = args.limit == 0
|
||||
|
||||
oa_geoms = load_oa_boundaries(args.oa_boundaries)
|
||||
uprn_df, uprn_offsets = load_uprns(args.uprn, args.arcgis)
|
||||
|
||||
# Phase 2: Parse/load INSPIRE
|
||||
print()
|
||||
print("=" * 60)
|
||||
print("Phase 2: INSPIRE data")
|
||||
print("=" * 60)
|
||||
|
||||
inspire_cache_dir = args.output / "inspire_cache"
|
||||
if not inspire_cache_exists(inspire_cache_dir):
|
||||
cache_inspire(args.inspire, inspire_cache_dir)
|
||||
inspire_bboxes, inspire_offsets, inspire_coords = load_inspire(inspire_cache_dir)
|
||||
|
||||
# Phase 3: Process OAs
|
||||
print()
|
||||
print("=" * 60)
|
||||
print("Phase 3: Processing OAs")
|
||||
print("=" * 60)
|
||||
|
||||
# Build work list — precompute which OAs are single vs multi-postcode
|
||||
oa_codes_with_data = sorted(set(oa_geoms.keys()) & set(uprn_offsets.keys()))
|
||||
skipped_no_uprn = len(oa_geoms) - len(oa_codes_with_data)
|
||||
skipped_no_boundary = len(uprn_offsets) - len(oa_codes_with_data)
|
||||
|
||||
if args.limit > 0:
|
||||
oa_codes_with_data = oa_codes_with_data[: args.limit]
|
||||
|
||||
print(f" OAs with UPRNs + boundaries: {len(oa_codes_with_data)}")
|
||||
print(f" Skipped (no UPRNs): {skipped_no_uprn}")
|
||||
print(f" Skipped (no boundary): {skipped_no_boundary}")
|
||||
|
||||
all_fragments: list[tuple[str, Polygon | MultiPolygon]] = []
|
||||
single_count = 0
|
||||
multi_count = 0
|
||||
|
||||
for oa_code in tqdm(
|
||||
oa_codes_with_data,
|
||||
desc="Processing OAs",
|
||||
unit="OA",
|
||||
smoothing=0.01,
|
||||
miniters=100,
|
||||
):
|
||||
oa_geom = oa_geoms[oa_code]
|
||||
points, postcodes = get_oa_uprns(uprn_df, uprn_offsets, oa_code)
|
||||
|
||||
if len(set(postcodes)) == 1:
|
||||
# Fast path: entire OA = one postcode
|
||||
all_fragments.append((postcodes[0], oa_geom))
|
||||
single_count += 1
|
||||
continue
|
||||
|
||||
# Get INSPIRE candidates via bbox pre-filter
|
||||
candidates = get_inspire_candidates(
|
||||
oa_geom.bounds, inspire_bboxes, inspire_offsets, inspire_coords
|
||||
if use_cache and fragments_cache_is_fresh(fragments_cache, fragment_inputs):
|
||||
print("=" * 60)
|
||||
print("Phase 3 cache hit — loading fragments (skipping Phases 1-3)")
|
||||
print("=" * 60)
|
||||
all_fragments = load_fragments(fragments_cache)
|
||||
print(
|
||||
f" Loaded {len(all_fragments):,} cached fragments from {fragments_cache}"
|
||||
)
|
||||
else:
|
||||
all_fragments = build_fragments(args)
|
||||
if use_cache:
|
||||
# Persist the expensive Phase-3 output before the cheap-but-fragile
|
||||
# merge/write so any failure there resumes in seconds, not ~10 hours.
|
||||
save_fragments(fragments_cache, all_fragments)
|
||||
print(f" Cached {len(all_fragments):,} fragments to {fragments_cache}")
|
||||
|
||||
fragments = process_oa(oa_geom, points, postcodes, candidates)
|
||||
all_fragments.extend(fragments)
|
||||
multi_count += 1
|
||||
|
||||
print(f"\n Single-postcode OAs (fast path): {single_count}")
|
||||
print(f" Multi-postcode OAs (INSPIRE+Voronoi): {multi_count}")
|
||||
print(f" Total fragments: {len(all_fragments)}")
|
||||
|
||||
# Free data no longer needed
|
||||
del oa_geoms, uprn_df, uprn_offsets
|
||||
del inspire_bboxes, inspire_offsets, inspire_coords
|
||||
# Free Phase-1-3 intermediates (build_fragments' locals) back to the OS.
|
||||
release_memory()
|
||||
|
||||
# Phase 4: Merge and write
|
||||
|
|
@ -145,6 +310,12 @@ def main() -> None:
|
|||
|
||||
file_count = write_district_geojson(merged, args.output)
|
||||
print(f"\n Wrote {file_count} district GeoJSON files to {args.output / 'units'}")
|
||||
|
||||
# The cache exists only to survive a crash between Phase 3 and a clean write.
|
||||
# Now that the output is complete, drop it so a later input change can never
|
||||
# be served from a stale cache.
|
||||
if use_cache:
|
||||
fragments_cache.unlink(missing_ok=True)
|
||||
print("Done!")
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -112,44 +112,130 @@ def load_inspire(
|
|||
return bboxes, offsets, coords_mmap
|
||||
|
||||
|
||||
def get_inspire_candidates(
|
||||
oa_bounds: tuple[float, float, float, float],
|
||||
# Grid cell size (m) for the parcel spatial index. The median parcel is ~25 m
|
||||
# and the 99th percentile ~540 m, so almost every parcel fits inside a single
|
||||
# 1 km cell; the ~0.4% larger than a cell go to an overflow list tested on every
|
||||
# query.
|
||||
_GRID_CELL_SIZE = 1000.0
|
||||
|
||||
|
||||
class InspireIndex:
|
||||
"""Uniform-grid spatial index over INSPIRE parcel bounding boxes.
|
||||
|
||||
The per-OA candidate lookup used to linear-scan all ~24M bboxes (O(N) per
|
||||
OA, ~4 h total over the country). This indexes parcels by grid cell so each
|
||||
lookup is O(cells_spanned + candidates). Parcels no larger than one cell are
|
||||
bucketed by their bbox min-corner cell in a CSR layout (parcel indices sorted
|
||||
by cell id, located with ``searchsorted``); the few parcels larger than a
|
||||
cell are kept in an overflow array tested directly on every query. An exact
|
||||
bbox test then runs on the gathered subset and the result is sorted, so the
|
||||
candidate set -- and its order -- is byte-for-byte identical to the old scan.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
bboxes: np.ndarray,
|
||||
offsets: np.ndarray,
|
||||
coords_mmap: np.memmap,
|
||||
cell_size: float = _GRID_CELL_SIZE,
|
||||
) -> None:
|
||||
self._bboxes = bboxes
|
||||
self._offsets = offsets
|
||||
self._coords = coords_mmap
|
||||
self._cell_size = cell_size
|
||||
self._origin_x = float(bboxes[:, 0].min())
|
||||
self._origin_y = float(bboxes[:, 1].min())
|
||||
# Flattened cell id is ``cx * _ny + cy``; +2 leaves a guard row so the
|
||||
# query's one-cell low-edge widening can never collide with cx-1.
|
||||
self._ny = int((bboxes[:, 1].max() - self._origin_y) // cell_size) + 2
|
||||
|
||||
width = bboxes[:, 2] - bboxes[:, 0]
|
||||
height = bboxes[:, 3] - bboxes[:, 1]
|
||||
small = np.where((width <= cell_size) & (height <= cell_size))[0]
|
||||
self._oversized = np.where((width > cell_size) | (height > cell_size))[0]
|
||||
self._oversized_bb = bboxes[self._oversized]
|
||||
|
||||
cx = ((bboxes[small, 0] - self._origin_x) // cell_size).astype(np.int64)
|
||||
cy = ((bboxes[small, 1] - self._origin_y) // cell_size).astype(np.int64)
|
||||
cell_id = cx * self._ny + cy
|
||||
order = np.argsort(cell_id, kind="stable")
|
||||
self._sorted_cells = cell_id[order]
|
||||
self._cell_parcels = small[order]
|
||||
|
||||
def candidate_indices(self, oa_bounds: tuple[float, float, float, float]) -> np.ndarray:
|
||||
"""Parcel indices whose bbox overlaps ``oa_bounds`` (ascending order)."""
|
||||
min_e, min_n, max_e, max_n = oa_bounds
|
||||
cs = self._cell_size
|
||||
# A small parcel (<= one cell) overlapping the OA has its min-corner no
|
||||
# more than one cell below/left of the OA bbox, so widen the low edges by
|
||||
# a cell. This keeps the lookup free of false negatives.
|
||||
gx0 = int((min_e - cs - self._origin_x) // cs)
|
||||
gx1 = int((max_e - self._origin_x) // cs)
|
||||
gy_lo = int((min_n - cs - self._origin_y) // cs)
|
||||
gy_hi = int((max_n - self._origin_y) // cs)
|
||||
|
||||
parts = []
|
||||
ob = self._oversized_bb
|
||||
if len(ob):
|
||||
mo = (
|
||||
(ob[:, 2] >= min_e)
|
||||
& (ob[:, 0] <= max_e)
|
||||
& (ob[:, 3] >= min_n)
|
||||
& (ob[:, 1] <= max_n)
|
||||
)
|
||||
if mo.any():
|
||||
parts.append(self._oversized[mo])
|
||||
|
||||
for gx in range(gx0, gx1 + 1):
|
||||
base = gx * self._ny
|
||||
lo = np.searchsorted(self._sorted_cells, base + gy_lo, "left")
|
||||
hi = np.searchsorted(self._sorted_cells, base + gy_hi, "right")
|
||||
if hi > lo:
|
||||
parts.append(self._cell_parcels[lo:hi])
|
||||
|
||||
if not parts:
|
||||
return np.empty(0, dtype=np.int64)
|
||||
cand = np.concatenate(parts)
|
||||
cb = self._bboxes[cand]
|
||||
mask = (
|
||||
(cb[:, 2] >= min_e)
|
||||
& (cb[:, 0] <= max_e)
|
||||
& (cb[:, 3] >= min_n)
|
||||
& (cb[:, 1] <= max_n)
|
||||
)
|
||||
# Sort so the candidate order matches the old full np.where scan exactly.
|
||||
return np.sort(cand[mask])
|
||||
|
||||
def candidates(
|
||||
self, oa_bounds: tuple[float, float, float, float]
|
||||
) -> list[Polygon]:
|
||||
"""INSPIRE polygons overlapping an OA, built from the mmap on demand.
|
||||
|
||||
Builds Shapely objects only for matches (typically 10-500 per OA).
|
||||
"""
|
||||
candidates = []
|
||||
for i in self.candidate_indices(oa_bounds):
|
||||
byte_offset = self._offsets[i, 0]
|
||||
n_pts = self._offsets[i, 1]
|
||||
float_offset = byte_offset // 8 # float64 = 8 bytes
|
||||
coords = self._coords[float_offset : float_offset + n_pts * 2].reshape(-1, 2)
|
||||
poly = Polygon(coords)
|
||||
if not poly.is_valid:
|
||||
poly = make_valid(poly)
|
||||
if poly.geom_type == "MultiPolygon":
|
||||
poly = max(poly.geoms, key=lambda g: g.area)
|
||||
elif poly.geom_type != "Polygon":
|
||||
continue
|
||||
if not poly.is_empty:
|
||||
candidates.append(poly)
|
||||
return candidates
|
||||
|
||||
|
||||
def build_inspire_index(
|
||||
bboxes: np.ndarray,
|
||||
offsets: np.ndarray,
|
||||
coords_mmap: np.memmap,
|
||||
) -> list[Polygon]:
|
||||
"""Get INSPIRE polygons overlapping an OA via bbox pre-filter.
|
||||
|
||||
Builds Shapely objects only for matches (typically 10-500 per OA).
|
||||
Reads coordinate data on-demand from memory-mapped file.
|
||||
"""
|
||||
min_e, min_n, max_e, max_n = oa_bounds
|
||||
|
||||
# Vectorized bbox overlap test
|
||||
mask = (
|
||||
(bboxes[:, 2] >= min_e)
|
||||
& (bboxes[:, 0] <= max_e)
|
||||
& (bboxes[:, 3] >= min_n)
|
||||
& (bboxes[:, 1] <= max_n)
|
||||
)
|
||||
idxs = np.where(mask)[0]
|
||||
if len(idxs) == 0:
|
||||
return []
|
||||
|
||||
# Build Shapely polygons only for candidates (coords from mmap)
|
||||
candidates = []
|
||||
for i in idxs:
|
||||
byte_offset = offsets[i, 0]
|
||||
n_pts = offsets[i, 1]
|
||||
float_offset = byte_offset // 8 # float64 = 8 bytes
|
||||
coords = coords_mmap[float_offset : float_offset + n_pts * 2].reshape(-1, 2)
|
||||
poly = Polygon(coords)
|
||||
if not poly.is_valid:
|
||||
poly = make_valid(poly)
|
||||
if poly.geom_type == "MultiPolygon":
|
||||
poly = max(poly.geoms, key=lambda g: g.area)
|
||||
elif poly.geom_type != "Polygon":
|
||||
continue
|
||||
if not poly.is_empty:
|
||||
candidates.append(poly)
|
||||
return candidates
|
||||
cell_size: float = _GRID_CELL_SIZE,
|
||||
) -> InspireIndex:
|
||||
"""Build the grid spatial index used for per-OA candidate retrieval."""
|
||||
return InspireIndex(bboxes, offsets, coords_mmap, cell_size)
|
||||
|
|
|
|||
|
|
@ -3,8 +3,9 @@ import shutil
|
|||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
from pyproj import Transformer
|
||||
from shapely import make_valid, set_precision
|
||||
from shapely import STRtree, make_valid, set_precision
|
||||
from shapely.errors import GEOSException
|
||||
from shapely.geometry import MultiPolygon, Polygon, mapping, shape
|
||||
from shapely.ops import transform as transform_geometry
|
||||
|
|
@ -41,30 +42,30 @@ def _largest_polygonal(geom) -> Polygon | None:
|
|||
return None
|
||||
|
||||
|
||||
def to_wgs84_geojson(
|
||||
geom: Polygon | MultiPolygon, tolerance: float = 1.0
|
||||
) -> dict | None:
|
||||
"""Simplify geometry in BNG, convert to WGS84, return a valid GeoJSON dict.
|
||||
# Output coordinate grid (~0.11 m at UK latitudes). Polygons whose extent is
|
||||
# below this in any direction snap to empty during serialization.
|
||||
_OUTPUT_PRECISION_DEG = 0.000001
|
||||
# Minimal BNG buffer used to rescue sub-grid slivers into a representable
|
||||
# footprint. A near-zero-area Voronoi/INSPIRE spike (e.g. three almost-collinear
|
||||
# vertices) would otherwise vanish at output precision; since every *active*
|
||||
# postcode must keep a boundary (validate_outputs enforces this with zero
|
||||
# tolerance), we fatten it just enough to survive snapping rather than drop it.
|
||||
_MIN_FOOTPRINT_BUFFER_M = 0.5
|
||||
|
||||
|
||||
def _snap_to_wgs84_geojson(geom_bng: Polygon | MultiPolygon) -> dict | None:
|
||||
"""Transform a BNG polygon to WGS84, snap to output precision, validate.
|
||||
|
||||
Validates the *serialized* GeoJSON dict (via a ``shape()`` round-trip), not
|
||||
just the intermediate Shapely object: coordinate snapping during
|
||||
serialization can otherwise leave a self-intersecting ring that only shows up
|
||||
once the feature is read back from disk. Any such geometry is repaired with
|
||||
``make_valid`` before returning so written features are always valid.
|
||||
once the feature is read back from disk. Returns ``None`` if the geometry
|
||||
collapses to empty (a sub-grid sliver).
|
||||
"""
|
||||
geom = _largest_polygonal(geom)
|
||||
if geom is None:
|
||||
return None
|
||||
|
||||
simplified = geom.simplify(tolerance, preserve_topology=True)
|
||||
simplified = _largest_polygonal(simplified)
|
||||
if simplified is None:
|
||||
return None
|
||||
|
||||
transformer = _get_to_wgs84()
|
||||
wgs84 = transform_geometry(transformer.transform, simplified)
|
||||
wgs84 = transform_geometry(transformer.transform, geom_bng)
|
||||
try:
|
||||
wgs84 = set_precision(wgs84, 0.000001, mode="valid_output")
|
||||
wgs84 = set_precision(wgs84, _OUTPUT_PRECISION_DEG, mode="valid_output")
|
||||
except GEOSException:
|
||||
# Precision snapping can fail on pathological geometries; fall back to a
|
||||
# plain validity repair without coordinate snapping.
|
||||
|
|
@ -87,20 +88,105 @@ def to_wgs84_geojson(
|
|||
return geojson_dict
|
||||
|
||||
|
||||
def _rescue_footprint(geom_bng) -> dict | None:
|
||||
"""Fatten a degenerate BNG geometry into a representable footprint and snap."""
|
||||
footprint = _largest_polygonal(geom_bng.buffer(_MIN_FOOTPRINT_BUFFER_M))
|
||||
if footprint is None:
|
||||
return None
|
||||
return _snap_to_wgs84_geojson(footprint)
|
||||
|
||||
|
||||
def to_wgs84_geojson(
|
||||
geom: Polygon | MultiPolygon, tolerance: float = 1.0
|
||||
) -> dict | None:
|
||||
"""Simplify geometry in BNG, convert to WGS84, return a valid GeoJSON dict.
|
||||
|
||||
A few thousand postcodes reduce to a sub-grid sliver that snaps to empty at
|
||||
output precision. Dropping them would leave an active postcode with no
|
||||
boundary (validate_outputs rejects that with zero tolerance), so instead they
|
||||
are fattened into a minimal footprint at the right location: first by buffering
|
||||
the (often elongated) sliver itself, then -- for fully-degenerate input -- a
|
||||
small disc around ``representative_point()``, which lies inside any non-empty
|
||||
geometry. ``None`` is returned only for a genuinely empty input.
|
||||
"""
|
||||
if geom is None or geom.is_empty:
|
||||
return None
|
||||
|
||||
cleaned = _largest_polygonal(geom)
|
||||
if cleaned is not None:
|
||||
simplified = _largest_polygonal(
|
||||
cleaned.simplify(tolerance, preserve_topology=True)
|
||||
)
|
||||
if simplified is None:
|
||||
simplified = cleaned
|
||||
# Normal path; if snapping erases a thin sliver, fatten its real shape.
|
||||
result = _snap_to_wgs84_geojson(simplified)
|
||||
if result is None:
|
||||
result = _rescue_footprint(simplified)
|
||||
if result is not None:
|
||||
return result
|
||||
|
||||
# Universal fallback for input too degenerate to clean or fatten in place.
|
||||
return _rescue_footprint(geom.representative_point())
|
||||
|
||||
|
||||
def to_wgs84_geojson_multi(
|
||||
geom: Polygon | MultiPolygon, tolerance: float = 1.0
|
||||
) -> dict | None:
|
||||
"""Convert a (possibly multi-part) postcode geometry to a GeoJSON dict,
|
||||
preserving every part. Each part is simplified/snapped/rescued independently
|
||||
via :func:`to_wgs84_geojson`; the result is a ``Polygon`` for a single part or
|
||||
a ``MultiPolygon`` for several. ``None`` only if every part is degenerate.
|
||||
"""
|
||||
parts = list(geom.geoms) if geom.geom_type == "MultiPolygon" else [geom]
|
||||
part_dicts = [d for part in parts if (d := to_wgs84_geojson(part, tolerance))]
|
||||
if not part_dicts:
|
||||
return None
|
||||
if len(part_dicts) == 1:
|
||||
return part_dicts[0]
|
||||
return {
|
||||
"type": "MultiPolygon",
|
||||
"coordinates": [pd["coordinates"] for pd in part_dicts],
|
||||
}
|
||||
|
||||
|
||||
# Interior holes from the INSPIRE+Voronoi+make_valid chain are small artifacts and
|
||||
# get filled. A hole at least this large is likely a genuinely enclosed postcode
|
||||
# (kept, so we never solidify over a neighbour); the de-overlap pass is the real
|
||||
# guarantee, this is defence-in-depth.
|
||||
_MAX_ARTIFACT_HOLE_AREA = 1000.0
|
||||
|
||||
|
||||
def _fill_small_holes(poly: Polygon) -> Polygon:
|
||||
kept = [r for r in poly.interiors if Polygon(r).area >= _MAX_ARTIFACT_HOLE_AREA]
|
||||
return Polygon(poly.exterior, kept)
|
||||
|
||||
|
||||
def _fill_holes(geom):
|
||||
"""Remove all interior rings (holes) from a polygon or multipolygon."""
|
||||
"""Fill small artifact interior rings; keep large (real-enclosed) holes."""
|
||||
if geom.geom_type == "Polygon":
|
||||
return Polygon(geom.exterior)
|
||||
return _fill_small_holes(geom)
|
||||
elif geom.geom_type == "MultiPolygon":
|
||||
return MultiPolygon([Polygon(p.exterior) for p in geom.geoms])
|
||||
return MultiPolygon([_fill_small_holes(p) for p in geom.geoms])
|
||||
return geom
|
||||
|
||||
|
||||
def _largest_polygon(geom):
|
||||
"""Extract the largest polygon from a MultiPolygon."""
|
||||
if geom.geom_type == "MultiPolygon":
|
||||
return max(geom.geoms, key=lambda g: g.area)
|
||||
return geom
|
||||
# A postcode genuinely split across an OA seam (by a railway, river, or main road
|
||||
# wider than the merge buffer) arrives here as a MultiPolygon. Keeping only the
|
||||
# largest part used to discard the rest, leaving ~1.8% of merged area as uncovered
|
||||
# gaps (often 3000-5000 m² building blocks). Keep every part at least this big;
|
||||
# smaller detached bits are Voronoi/clipping noise and are still dropped.
|
||||
_MIN_DETACHED_PART_AREA = 100.0
|
||||
|
||||
|
||||
def _keep_polygon_parts(geom):
|
||||
"""Keep all MultiPolygon parts >= _MIN_DETACHED_PART_AREA (largest if none)."""
|
||||
if geom.geom_type != "MultiPolygon":
|
||||
return geom
|
||||
parts = [g for g in geom.geoms if g.area >= _MIN_DETACHED_PART_AREA]
|
||||
if not parts:
|
||||
parts = [max(geom.geoms, key=lambda g: g.area)]
|
||||
return parts[0] if len(parts) == 1 else MultiPolygon(parts)
|
||||
|
||||
|
||||
def merge_fragments(
|
||||
|
|
@ -126,14 +212,19 @@ def merge_fragments(
|
|||
continue
|
||||
if not combined.is_valid:
|
||||
combined = make_valid(combined)
|
||||
# Close tiny gaps between adjacent OA boundary edges (float mismatches)
|
||||
# Close tiny gaps between adjacent OA boundary edges (float mismatches).
|
||||
# The closing can erode a tiny MultiPolygon (e.g. a postcode with only a
|
||||
# sliver fragment) to nothing, which would leave the postcode with no
|
||||
# geometry at all — keep the un-closed shape if that happens.
|
||||
if combined.geom_type == "MultiPolygon":
|
||||
combined = combined.buffer(5.0).buffer(-5.0)
|
||||
if not combined.is_valid:
|
||||
combined = make_valid(combined)
|
||||
# Postcodes are contiguous delivery routes — keep only the largest
|
||||
# polygon; small detached fragments are algorithm artifacts
|
||||
combined = _largest_polygon(combined)
|
||||
closed = combined.buffer(5.0).buffer(-5.0)
|
||||
if not closed.is_valid:
|
||||
closed = make_valid(closed)
|
||||
if not closed.is_empty:
|
||||
combined = closed
|
||||
# Keep the postcode whole: the largest part plus any other substantial
|
||||
# part (a genuine railway/river split), dropping only tiny noise slivers.
|
||||
combined = _keep_polygon_parts(combined)
|
||||
# Remove artifact interior holes from INSPIRE+Voronoi+make_valid chain
|
||||
combined = _fill_holes(combined)
|
||||
# Subtract parks/water if provided
|
||||
|
|
@ -142,7 +233,7 @@ def merge_fragments(
|
|||
|
||||
pre_green = combined
|
||||
combined = subtract_greenspace(combined, greenspace_tree, greenspace_geoms)
|
||||
combined = _largest_polygon(combined)
|
||||
combined = _keep_polygon_parts(combined)
|
||||
# Do NOT _fill_holes here: interior holes carved by the greenspace
|
||||
# subtraction (lakes, enclosed parks) are intentional, not artifacts.
|
||||
# Filling them would re-add the removed area and negate the
|
||||
|
|
@ -155,10 +246,114 @@ def merge_fragments(
|
|||
return merged
|
||||
|
||||
|
||||
def _polygonal(geom):
|
||||
"""Return only the polygonal part(s) of a geometry, or None if none remain."""
|
||||
if geom is None or geom.is_empty:
|
||||
return None
|
||||
if geom.geom_type in ("Polygon", "MultiPolygon"):
|
||||
return geom
|
||||
if geom.geom_type == "GeometryCollection":
|
||||
polys = [
|
||||
g
|
||||
for g in geom.geoms
|
||||
if g.geom_type in ("Polygon", "MultiPolygon") and not g.is_empty
|
||||
]
|
||||
if not polys:
|
||||
return None
|
||||
merged = unary_union(polys)
|
||||
return merged if not merged.is_empty else None
|
||||
return None
|
||||
|
||||
|
||||
def _resolve_overlaps(
|
||||
items: list[tuple[str, Polygon | MultiPolygon]],
|
||||
) -> list[tuple[str, Polygon | MultiPolygon]]:
|
||||
"""Make the postcode polygons a partition: no two cover the same ground.
|
||||
|
||||
Overlap appears at OA seams (the 5m merge buffer expands each postcode
|
||||
independently), from simplifying each postcode on its own, and as genuine
|
||||
containment (a postcode fully enclosed by another). Each postcode is trimmed
|
||||
by the union of its higher-priority overlapping neighbours, where **priority =
|
||||
ascending area**: a smaller postcode wins contested ground. That single rule
|
||||
handles both cases correctly — an enclosed postcode is always smaller than its
|
||||
container, so it keeps its area while the container gets a hole (a `overlaps`
|
||||
query alone would miss containment entirely). Run last, on the final output
|
||||
geometries, so nothing re-introduces overlap afterwards. A postcode that would
|
||||
be emptied keeps its original geometry, so an active postcode is never dropped.
|
||||
"""
|
||||
geoms = [g for _, g in items]
|
||||
n = len(geoms)
|
||||
if n < 2:
|
||||
return items
|
||||
|
||||
# rank[i]: 0 = highest priority (smallest area). Postcode string breaks ties
|
||||
# for determinism.
|
||||
rank = {
|
||||
idx: r
|
||||
for r, idx in enumerate(
|
||||
sorted(range(n), key=lambda i: (geoms[i].area, items[i][0]))
|
||||
)
|
||||
}
|
||||
|
||||
tree = STRtree(geoms)
|
||||
arr = np.array(geoms, dtype=object)
|
||||
pairs: set[tuple[int, int]] = set()
|
||||
# "overlaps" gives partial overlaps; "contains" gives containment (which
|
||||
# "overlaps" excludes) — together they cover every 2-D overlap without the
|
||||
# edge-touch explosion a plain "intersects" query would add.
|
||||
for predicate in ("overlaps", "contains"):
|
||||
qsrc, qtgt = tree.query(arr, predicate=predicate)
|
||||
for s, t in zip(qsrc.tolist(), qtgt.tolist()):
|
||||
if s != t:
|
||||
pairs.add((s, t) if s < t else (t, s))
|
||||
|
||||
# For each loser (lower priority) the higher-priority neighbours to subtract.
|
||||
higher: dict[int, list[int]] = defaultdict(list)
|
||||
for a, b in pairs:
|
||||
winner, loser = (a, b) if rank[a] < rank[b] else (b, a)
|
||||
higher[loser].append(winner)
|
||||
|
||||
out = list(geoms)
|
||||
# Process losers from highest priority down, so every subtracted neighbour is
|
||||
# already finalised.
|
||||
for i in sorted(higher, key=lambda idx: rank[idx]):
|
||||
cut = unary_union([out[j] for j in higher[i]])
|
||||
trimmed = out[i].difference(cut)
|
||||
if not trimmed.is_valid:
|
||||
trimmed = make_valid(trimmed)
|
||||
# Keep all polygonal parts: these geometries are in WGS84 degrees, so an
|
||||
# area threshold here would wrongly drop everything but the largest part
|
||||
# and re-open the very gaps the seam fix closed.
|
||||
trimmed = _polygonal(trimmed)
|
||||
if trimmed is not None and not trimmed.is_empty:
|
||||
out[i] = trimmed
|
||||
return [(pc, out[i]) for i, (pc, _) in enumerate(items)]
|
||||
|
||||
|
||||
def _round_coords(coords, ndigits=6):
|
||||
if coords and isinstance(coords[0], (int, float)):
|
||||
return [round(coords[0], ndigits), round(coords[1], ndigits)]
|
||||
return [_round_coords(c, ndigits) for c in coords]
|
||||
|
||||
|
||||
def _geojson_geometry(geom) -> dict | None:
|
||||
"""Serialize a WGS84 polygon/multipolygon to a 6dp GeoJSON dict, or None."""
|
||||
geom = _polygonal(geom if geom.is_valid else make_valid(geom))
|
||||
if geom is None or geom.is_empty:
|
||||
return None
|
||||
gj = mapping(geom)
|
||||
return {"type": gj["type"], "coordinates": _round_coords(gj["coordinates"])}
|
||||
|
||||
|
||||
def write_district_geojson(
|
||||
postcodes: dict[str, Polygon | MultiPolygon], output_dir: Path
|
||||
) -> int:
|
||||
"""Group postcodes by district, write GeoJSON files. Returns file count."""
|
||||
"""Group postcodes by district, write GeoJSON files. Returns file count.
|
||||
|
||||
Before writing, the postcode polygons are converted to their final WGS84 form
|
||||
and made a partition (overlaps removed) so the output never has two postcodes
|
||||
covering the same ground.
|
||||
"""
|
||||
units_dir = output_dir / "units"
|
||||
tmp_units_dir = output_dir / "units.tmp"
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
|
@ -166,38 +361,46 @@ def write_district_geojson(
|
|||
shutil.rmtree(tmp_units_dir)
|
||||
tmp_units_dir.mkdir(parents=True)
|
||||
|
||||
skipped: list[str] = []
|
||||
|
||||
# Pass 1: convert every postcode to its final WGS84 geometry (simplify, snap,
|
||||
# sliver-rescue, multi-part preserved). Sorted → deterministic de-overlap
|
||||
# priority. to_wgs84_geojson_multi returns None only for a genuinely empty
|
||||
# input, which is skipped and reported rather than aborting a multi-hour run.
|
||||
converted: list[tuple[str, Polygon | MultiPolygon]] = []
|
||||
for pc in sorted(postcodes):
|
||||
gj = to_wgs84_geojson_multi(postcodes[pc])
|
||||
if gj is None:
|
||||
skipped.append(pc)
|
||||
continue
|
||||
converted.append((pc, shape(gj)))
|
||||
|
||||
# Remove overlap strips so the output is a clean partition.
|
||||
converted = _resolve_overlaps(converted)
|
||||
|
||||
by_district: dict[str, list[tuple[str, Polygon | MultiPolygon]]] = defaultdict(list)
|
||||
for pc, geom in postcodes.items():
|
||||
for pc, geom in converted:
|
||||
parts = pc.split()
|
||||
district = parts[0] if parts else pc[:4]
|
||||
by_district[district].append((pc, geom))
|
||||
|
||||
file_count = 0
|
||||
seen_postcodes: set[str] = set()
|
||||
for district, entries in tqdm(
|
||||
sorted(by_district.items()), desc="Writing GeoJSON", unit="file"
|
||||
):
|
||||
features = []
|
||||
for pc, geom in sorted(entries, key=lambda x: x[0]):
|
||||
if pc in seen_postcodes:
|
||||
raise ValueError(f"Duplicate postcode boundary feature: {pc}")
|
||||
seen_postcodes.add(pc)
|
||||
geojson_geom = to_wgs84_geojson(geom)
|
||||
geojson_geom = _geojson_geometry(geom)
|
||||
if geojson_geom is None:
|
||||
raise ValueError(f"Postcode boundary collapsed to empty geometry: {pc}")
|
||||
written_geom = shape(geojson_geom)
|
||||
if written_geom.is_empty or not written_geom.is_valid:
|
||||
raise ValueError(
|
||||
f"Invalid postcode boundary geometry after output: {pc}"
|
||||
)
|
||||
mapit_code = pc.replace(" ", "")
|
||||
skipped.append(pc)
|
||||
continue
|
||||
features.append(
|
||||
{
|
||||
"type": "Feature",
|
||||
"geometry": geojson_geom,
|
||||
"properties": {
|
||||
"postcodes": pc,
|
||||
"mapit_code": mapit_code,
|
||||
"mapit_code": pc.replace(" ", ""),
|
||||
},
|
||||
}
|
||||
)
|
||||
|
|
@ -211,6 +414,14 @@ def write_district_geojson(
|
|||
json.dump(collection, f, separators=(",", ":"))
|
||||
file_count += 1
|
||||
|
||||
if skipped:
|
||||
preview = ", ".join(skipped[:10])
|
||||
suffix = " …" if len(skipped) > 10 else ""
|
||||
print(
|
||||
f" Skipped {len(skipped)} postcode(s) with degenerate (sub-grid) "
|
||||
f"geometry: {preview}{suffix}"
|
||||
)
|
||||
|
||||
if units_dir.exists():
|
||||
shutil.rmtree(units_dir)
|
||||
tmp_units_dir.replace(units_dir)
|
||||
|
|
|
|||
|
|
@ -85,19 +85,42 @@ def _claim_inspire_parcels(
|
|||
uprn_pts = shp_points(points)
|
||||
pt_idx, cand_idx = cand_tree.query(uprn_pts, predicate="within")
|
||||
|
||||
# First priority: parcels that physically contain UPRNs. Majority vote
|
||||
# resolves blocks of flats or overlapping parcel data.
|
||||
# First priority: parcels that physically contain UPRNs. A parcel holding
|
||||
# UPRNs from a single postcode goes wholly to that postcode. A parcel shared
|
||||
# by several postcodes (a block of flats spanning postcodes, or overlapping
|
||||
# parcel data) is split between them via a sub-Voronoi over their own UPRNs
|
||||
# clipped to the parcel — so EVERY contained postcode keeps part of the
|
||||
# parcel. A bare majority vote would hand the whole parcel to one winner and
|
||||
# leave the losers' UPRNs trapped inside claimed land, dropping them from
|
||||
# both this claim and the `remaining` polygon handed to Voronoi downstream.
|
||||
cand_postcodes: dict[int, list[str]] = defaultdict(list)
|
||||
cand_point_idx: dict[int, list[int]] = defaultdict(list)
|
||||
for pi, ci in zip(pt_idx, cand_idx):
|
||||
cand_postcodes[ci].append(postcodes[pi])
|
||||
cand_point_idx[ci].append(pi)
|
||||
|
||||
points_f64 = points.astype(np.float64, copy=False)
|
||||
contained_parts: dict[str, list] = defaultdict(list)
|
||||
contained_scores: Counter[str] = Counter()
|
||||
for ci, pc_list in cand_postcodes.items():
|
||||
pc_counts = Counter(pc_list)
|
||||
winner, votes = pc_counts.most_common(1)[0]
|
||||
contained_parts[winner].append(parcels[ci])
|
||||
contained_scores[winner] += votes
|
||||
if len(pc_counts) == 1:
|
||||
winner = next(iter(pc_counts))
|
||||
contained_parts[winner].append(parcels[ci])
|
||||
contained_scores[winner] += pc_counts[winner]
|
||||
continue
|
||||
# Shared parcel: sub-Voronoi over the contained UPRNs so each postcode
|
||||
# present keeps a fragment instead of being absorbed by the winner.
|
||||
sub_idx = cand_point_idx[ci]
|
||||
sub_points = points_f64[sub_idx]
|
||||
sub_postcodes = [postcodes[pi] for pi in sub_idx]
|
||||
for pc, geom in compute_voronoi_regions(
|
||||
sub_points, sub_postcodes, parcels[ci]
|
||||
).items():
|
||||
cleaned = _clean_polygonal(geom)
|
||||
if cleaned is not None:
|
||||
contained_parts[pc].append(cleaned)
|
||||
contained_scores[pc] += pc_counts[pc]
|
||||
|
||||
contained_claimed = _merge_parts_by_postcode(contained_parts)
|
||||
contained_claims = sorted(
|
||||
|
|
@ -109,7 +132,6 @@ def _claim_inspire_parcels(
|
|||
# each to the nearest UPRN/postcode so parcel boundaries carry more of the
|
||||
# visible postcode shape; Voronoi is then limited to roads, parks, water, and
|
||||
# any other non-parcel gaps.
|
||||
points_f64 = points.astype(np.float64, copy=False)
|
||||
contained_union = _union_claims(contained_claims)
|
||||
nearest_tree = cKDTree(points_f64)
|
||||
nearest_parts: dict[str, list] = defaultdict(list)
|
||||
|
|
@ -235,11 +257,11 @@ def _extract_polygonal(geom) -> Polygon | MultiPolygon | None:
|
|||
return None
|
||||
if len(polys) == 1:
|
||||
return polys[0]
|
||||
return MultiPolygon(
|
||||
[
|
||||
p
|
||||
for g in polys
|
||||
for p in (g.geoms if g.geom_type == "MultiPolygon" else [g])
|
||||
]
|
||||
)
|
||||
# Union (not bare MultiPolygon construction): make_valid can emit
|
||||
# overlapping polygonal parts, and a MultiPolygon of overlapping parts is
|
||||
# invalid — it double-counts area and makes the next `.difference()` raise
|
||||
# a TopologyException that aborts the OA (and, in parallel mode, the
|
||||
# worker). unary_union merges them into a valid geometry.
|
||||
merged = unary_union(polys)
|
||||
return merged if not merged.is_empty else None
|
||||
return None
|
||||
|
|
|
|||
|
|
@ -11,12 +11,20 @@ import pytest
|
|||
from shapely.geometry import MultiPolygon, Polygon, box
|
||||
from shapely.ops import unary_union
|
||||
|
||||
from .fragments_cache import (
|
||||
fragments_cache_is_fresh,
|
||||
load_fragments,
|
||||
save_fragments,
|
||||
)
|
||||
from .__main__ import _oa_fragments, _process_oas
|
||||
from .inspire import build_inspire_index
|
||||
from .oa_boundaries import parse_gpkg_geometry
|
||||
from .greenspace import subtract_greenspace
|
||||
from .output import (
|
||||
_fill_holes,
|
||||
merge_fragments,
|
||||
to_wgs84_geojson,
|
||||
to_wgs84_geojson_multi,
|
||||
write_district_geojson,
|
||||
)
|
||||
from .process_oa import _extract_polygonal, process_oa
|
||||
|
|
@ -173,6 +181,52 @@ class TestWhitespacePostcodes:
|
|||
|
||||
assert loaded_df["PCDS"].to_list() == ["AA1 1AB"]
|
||||
|
||||
def test_remapped_terminated_postcode_adopts_successor_oa(self, tmp_path):
|
||||
"""When a terminated postcode is remapped to its active successor, the
|
||||
remapped seed point must carry the SUCCESSOR's OA (and coords), not the
|
||||
terminated postcode's original OA. Pre-fix the row kept OA21CD of the
|
||||
terminated postcode, seeding the successor into an OA it doesn't belong
|
||||
to and splitting its boundary across OAs."""
|
||||
# Terminated AA1 1AA sits in OA E00000001. Its nearest active successor
|
||||
# AA1 1AB lives in a DIFFERENT OA (E00000002) far away.
|
||||
uprns = pl.DataFrame(
|
||||
{
|
||||
"GRIDGB1E": [500010],
|
||||
"GRIDGB1N": [180010],
|
||||
"PCDS": ["AA1 1AA"],
|
||||
"OA21CD": ["E00000001"],
|
||||
}
|
||||
)
|
||||
uprn_path = tmp_path / "uprn.parquet"
|
||||
uprns.write_parquet(uprn_path)
|
||||
arcgis = pl.DataFrame(
|
||||
{
|
||||
"pcds": ["AA1 1AA", "AA1 1AB"],
|
||||
"east1m": [500010, 500030],
|
||||
"north1m": [180010, 180020],
|
||||
# AA1 1AA terminated → only AA1 1AB is an active successor, and
|
||||
# it belongs to a different OA than the terminated postcode.
|
||||
"oa21cd": ["E00000001", "E00000002"],
|
||||
"doterm": ["2020-01-01", None],
|
||||
"ctry25cd": ["E92000001", "E92000001"],
|
||||
}
|
||||
)
|
||||
arcgis_path = tmp_path / "arcgis.parquet"
|
||||
arcgis.write_parquet(arcgis_path)
|
||||
|
||||
loaded_df, offsets = load_uprns(uprn_path, arcgis_path)
|
||||
|
||||
# The remapped point must be grouped under the successor's OA, not the
|
||||
# terminated postcode's OA.
|
||||
assert "E00000002" in offsets, "Successor OA missing — remap kept old OA"
|
||||
assert "E00000001" not in offsets, (
|
||||
"Remapped point still lives in the terminated postcode's OA"
|
||||
)
|
||||
points, postcodes = get_oa_uprns(loaded_df, offsets, "E00000002")
|
||||
assert postcodes == ["AA1 1AB"]
|
||||
# It should also adopt the successor's authoritative coordinates.
|
||||
assert points.tolist() == [[500030.0, 180020.0]]
|
||||
|
||||
def test_arcgis_filters_to_active_english_postcodes(self, tmp_path):
|
||||
uprns = pl.DataFrame(
|
||||
{
|
||||
|
|
@ -617,6 +671,32 @@ class TestProcessOAInspireParcelAssignment:
|
|||
for _, geom in fragments:
|
||||
assert geom.difference(oa_geom).area < 0.01
|
||||
|
||||
def test_shared_parcel_keeps_every_contained_postcode(self):
|
||||
"""A single parcel containing UPRNs for [A, A, B] must yield a fragment
|
||||
for BOTH A and B. Pre-fix the majority winner (A) claimed the whole
|
||||
parcel, excluding it from `remaining`, so B's UPRNs were trapped inside
|
||||
claimed land and B vanished entirely (no fragment)."""
|
||||
oa_geom = box(0, 0, 100, 100)
|
||||
parcel = box(0, 0, 100, 100) # one parcel covering the whole OA
|
||||
points = np.array(
|
||||
[
|
||||
[20, 50], # postcode A
|
||||
[30, 50], # postcode A (majority)
|
||||
[80, 50], # postcode B (minority — would be dropped pre-fix)
|
||||
]
|
||||
)
|
||||
postcodes = ["A", "A", "B"]
|
||||
|
||||
fragments = process_oa(oa_geom, points, postcodes, inspire_candidates=[parcel])
|
||||
frag_dict = dict(fragments)
|
||||
|
||||
assert "A" in frag_dict, "Majority postcode A must keep a fragment"
|
||||
assert "B" in frag_dict, "Minority postcode B must not be dropped"
|
||||
assert frag_dict["A"].area > 0
|
||||
assert frag_dict["B"].area > 0
|
||||
# The split must partition the parcel without overlap.
|
||||
assert frag_dict["A"].intersection(frag_dict["B"]).area < 0.01
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# _extract_polygonal helper
|
||||
|
|
@ -656,6 +736,21 @@ class TestExtractPolygonal:
|
|||
|
||||
assert _extract_polygonal(LineString([(0, 0), (1, 1)])) is None
|
||||
|
||||
def test_overlapping_collection_unioned_to_valid(self):
|
||||
"""A GeometryCollection with OVERLAPPING polygons must be unioned into a
|
||||
VALID geometry (not a raw MultiPolygon, which would be invalid and crash
|
||||
the next .difference()), and must not double-count the overlap area."""
|
||||
from shapely.geometry import GeometryCollection
|
||||
|
||||
a = box(0, 0, 100, 100)
|
||||
b = box(50, 50, 150, 150) # overlaps a by 50x50
|
||||
result = _extract_polygonal(GeometryCollection([a, b]))
|
||||
assert result is not None
|
||||
assert result.is_valid
|
||||
assert result.area == pytest.approx(unary_union([a, b]).area)
|
||||
# And the formerly-crashing op now works:
|
||||
assert result.difference(box(0, 0, 10, 10)).is_valid
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Edge case: merge_fragments handles single-OA postcodes
|
||||
|
|
@ -763,12 +858,12 @@ class TestParseGpkgGeometry:
|
|||
|
||||
|
||||
class TestFillHoles:
|
||||
"""_fill_holes must remove all interior holes from polygons."""
|
||||
"""_fill_holes fills small artifact holes but keeps large (real-enclosed) ones."""
|
||||
|
||||
def test_polygon_with_hole(self):
|
||||
"""A polygon with an interior ring should become a solid polygon."""
|
||||
def test_small_artifact_hole_filled(self):
|
||||
"""A small (<1000 m²) interior ring is an artifact and gets filled."""
|
||||
outer = [(0, 0), (100, 0), (100, 100), (0, 100), (0, 0)]
|
||||
hole = [(30, 30), (70, 30), (70, 70), (30, 70), (30, 30)]
|
||||
hole = [(40, 40), (60, 40), (60, 60), (40, 60), (40, 40)] # 20x20 = 400 m²
|
||||
poly_with_hole = Polygon(outer, [hole])
|
||||
assert len(list(poly_with_hole.interiors)) == 1
|
||||
result = _fill_holes(poly_with_hole)
|
||||
|
|
@ -776,6 +871,15 @@ class TestFillHoles:
|
|||
assert len(list(result.interiors)) == 0
|
||||
assert result.area == pytest.approx(Polygon(outer).area)
|
||||
|
||||
def test_large_hole_kept(self):
|
||||
"""A large (>=1000 m²) hole is likely a real enclosed postcode — keep it."""
|
||||
outer = [(0, 0), (100, 0), (100, 100), (0, 100), (0, 0)]
|
||||
hole = [(20, 20), (80, 20), (80, 80), (20, 80), (20, 20)] # 60x60 = 3600 m²
|
||||
poly_with_hole = Polygon(outer, [hole])
|
||||
result = _fill_holes(poly_with_hole)
|
||||
assert len(list(result.interiors)) == 1
|
||||
assert result.area == pytest.approx(10000 - 3600)
|
||||
|
||||
def test_multipolygon_with_holes(self):
|
||||
"""A MultiPolygon where each part has holes should have all holes removed."""
|
||||
outer1 = [(0, 0), (50, 0), (50, 50), (0, 50), (0, 0)]
|
||||
|
|
@ -944,3 +1048,356 @@ class TestGreenspaceHolePreserved:
|
|||
merged = result["TEST1"]
|
||||
assert len(list(merged.interiors)) == 1
|
||||
assert merged.area == pytest.approx(10000 - 1600, rel=0.05)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# merge_fragments keeps substantial detached parts (no OA-seam coverage gaps)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestKeepDetachedParts:
|
||||
"""A postcode split across an OA seam (railway/river) must keep both parts
|
||||
instead of dropping all but the largest, which left ~1.8% uncovered gaps."""
|
||||
|
||||
def test_far_apart_parts_both_kept(self):
|
||||
# Two 50x50m blocks 30m apart — wider than the 10m merge buffer.
|
||||
a = box(0, 0, 50, 50) # 2500 m²
|
||||
b = box(80, 0, 130, 50) # 2500 m², 30m gap
|
||||
geom = merge_fragments([("AA1 1AA", a), ("AA1 1AA", b)])["AA1 1AA"]
|
||||
assert geom.geom_type == "MultiPolygon"
|
||||
assert len(geom.geoms) == 2
|
||||
assert geom.area == pytest.approx(5000, rel=0.01)
|
||||
|
||||
def test_tiny_noise_part_dropped(self):
|
||||
main = box(0, 0, 100, 100) # 10000 m²
|
||||
noise = box(200, 200, 205, 205) # 25 m² < 100 m² threshold
|
||||
geom = merge_fragments([("AA1 1AA", main), ("AA1 1AA", noise)])["AA1 1AA"]
|
||||
assert geom.geom_type == "Polygon"
|
||||
assert geom.area == pytest.approx(10000, rel=0.01)
|
||||
|
||||
|
||||
class TestMultiPolygonOutput:
|
||||
"""to_wgs84_geojson_multi / the writer must emit MultiPolygon for split
|
||||
postcodes (the Rust server + loader already parse MultiPolygon)."""
|
||||
|
||||
def test_multipolygon_preserves_all_parts(self):
|
||||
from shapely.geometry import shape
|
||||
|
||||
mp = MultiPolygon(
|
||||
[
|
||||
box(530000, 180000, 530100, 180100),
|
||||
box(531000, 180000, 531100, 180100),
|
||||
]
|
||||
)
|
||||
gj = to_wgs84_geojson_multi(mp)
|
||||
assert gj["type"] == "MultiPolygon"
|
||||
assert len(gj["coordinates"]) == 2
|
||||
rt = shape(gj)
|
||||
assert rt.is_valid and not rt.is_empty
|
||||
assert len(rt.geoms) == 2
|
||||
|
||||
def test_single_part_stays_polygon(self):
|
||||
gj = to_wgs84_geojson_multi(box(530000, 180000, 530100, 180100))
|
||||
assert gj["type"] == "Polygon"
|
||||
|
||||
def test_writer_emits_multipolygon_feature(self, tmp_path):
|
||||
mp = MultiPolygon(
|
||||
[
|
||||
box(530000, 180000, 530100, 180100),
|
||||
box(531000, 180000, 531100, 180100),
|
||||
]
|
||||
)
|
||||
assert write_district_geojson({"AA1 1AA": mp}, tmp_path) == 1
|
||||
coll = json.loads((tmp_path / "units" / "AA1.geojson").read_text())
|
||||
assert coll["features"][0]["geometry"]["type"] == "MultiPolygon"
|
||||
|
||||
|
||||
class TestOutputPartition:
|
||||
"""The writer must emit a partition: overlapping postcodes are made disjoint
|
||||
(no two cover the same ground) without dropping an active postcode."""
|
||||
|
||||
def test_overlapping_postcodes_made_disjoint(self, tmp_path):
|
||||
from shapely.geometry import shape
|
||||
|
||||
a = box(530000, 180000, 530100, 180100)
|
||||
b = box(530090, 180000, 530200, 180100) # overlaps `a` in a 10m strip
|
||||
assert a.intersection(b).area > 0 # precondition: they overlap
|
||||
|
||||
write_district_geojson({"AA1 1AA": a, "AA1 1AB": b}, tmp_path)
|
||||
coll = json.loads((tmp_path / "units" / "AA1.geojson").read_text())
|
||||
geoms = {
|
||||
f["properties"]["postcodes"]: shape(f["geometry"])
|
||||
for f in coll["features"]
|
||||
}
|
||||
assert set(geoms) == {"AA1 1AA", "AA1 1AB"} # neither dropped
|
||||
# Disjoint interiors (share at most an edge).
|
||||
assert geoms["AA1 1AA"].intersection(geoms["AA1 1AB"]).area == pytest.approx(
|
||||
0.0, abs=1e-12
|
||||
)
|
||||
assert all(g.area > 0 for g in geoms.values())
|
||||
|
||||
def test_enclosed_postcode_makes_container_a_donut(self, tmp_path):
|
||||
"""A postcode fully INSIDE another must stay disjoint: the smaller (inner)
|
||||
keeps its area, the container gets a hole. A plain `overlaps` query misses
|
||||
containment, so this is the regression guard for that fix."""
|
||||
from shapely.geometry import shape
|
||||
|
||||
outer = box(530000, 180000, 530300, 180300) # 90,000 m²
|
||||
inner = box(530100, 180100, 530200, 180200) # 10,000 m², fully inside outer
|
||||
assert outer.contains(inner) # precondition
|
||||
|
||||
write_district_geojson({"AA1 1AA": outer, "AA1 1AB": inner}, tmp_path)
|
||||
coll = json.loads((tmp_path / "units" / "AA1.geojson").read_text())
|
||||
geoms = {
|
||||
f["properties"]["postcodes"]: shape(f["geometry"])
|
||||
for f in coll["features"]
|
||||
}
|
||||
assert set(geoms) == {"AA1 1AA", "AA1 1AB"} # neither dropped
|
||||
assert geoms["AA1 1AA"].intersection(geoms["AA1 1AB"]).area == pytest.approx(
|
||||
0.0, abs=1e-12
|
||||
)
|
||||
# Container is now a donut around the enclosed postcode.
|
||||
assert geoms["AA1 1AA"].geom_type == "Polygon"
|
||||
assert len(list(geoms["AA1 1AA"].interiors)) == 1
|
||||
assert geoms["AA1 1AB"].area > 0
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# InspireIndex must return the same candidates as a brute-force bbox scan
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestInspireIndex:
|
||||
"""The grid index replaces a per-OA linear scan of all parcel bboxes; it must
|
||||
return an identical candidate set (and order) so Phase 3 output is unchanged."""
|
||||
|
||||
@staticmethod
|
||||
def _brute(bboxes, box):
|
||||
e0, n0, e1, n1 = box
|
||||
mask = (
|
||||
(bboxes[:, 2] >= e0)
|
||||
& (bboxes[:, 0] <= e1)
|
||||
& (bboxes[:, 3] >= n0)
|
||||
& (bboxes[:, 1] <= n1)
|
||||
)
|
||||
return np.where(mask)[0]
|
||||
|
||||
def test_matches_brute_force_over_random_queries(self):
|
||||
rng = np.random.default_rng(0)
|
||||
x = rng.uniform(0, 10000, 5000)
|
||||
y = rng.uniform(0, 10000, 5000)
|
||||
w = rng.uniform(1, 60, 5000) # all <= 500m cell → CSR path
|
||||
h = rng.uniform(1, 60, 5000)
|
||||
bboxes = np.column_stack([x, y, x + w, y + h]).astype(np.float64)
|
||||
idx = build_inspire_index(bboxes, None, None, cell_size=500.0)
|
||||
|
||||
for _ in range(400):
|
||||
cx, cy = rng.uniform(0, 10000), rng.uniform(0, 10000)
|
||||
sz = float(rng.choice([30.0, 200.0, 1000.0, 3000.0]))
|
||||
box = (cx, cy, cx + sz, cy + sz)
|
||||
got = idx.candidate_indices(box)
|
||||
expected = np.sort(self._brute(bboxes, box))
|
||||
assert np.array_equal(got, expected)
|
||||
|
||||
def test_oversized_parcel_is_found(self):
|
||||
# A parcel larger than a cell goes to the overflow list, not the grid;
|
||||
# a query deep inside it (away from the small parcels) must still find it.
|
||||
bboxes = np.array(
|
||||
[
|
||||
[0.0, 0.0, 5000.0, 5000.0], # 5km parcel >> 500m cell
|
||||
[100.0, 100.0, 120.0, 120.0],
|
||||
[4000.0, 4000.0, 4020.0, 4020.0],
|
||||
]
|
||||
)
|
||||
idx = build_inspire_index(bboxes, None, None, cell_size=500.0)
|
||||
box = (2000.0, 2000.0, 2050.0, 2050.0)
|
||||
got = idx.candidate_indices(box)
|
||||
assert 0 in got
|
||||
assert np.array_equal(got, np.sort(self._brute(bboxes, box)))
|
||||
|
||||
def test_no_overlap_returns_empty(self):
|
||||
bboxes = np.array([[0.0, 0.0, 10.0, 10.0], [20.0, 20.0, 30.0, 30.0]])
|
||||
idx = build_inspire_index(bboxes, None, None, cell_size=500.0)
|
||||
assert len(idx.candidate_indices((100.0, 100.0, 110.0, 110.0))) == 0
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Parallel OA processing must match the sequential result exactly
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestParallelProcessing:
|
||||
"""_process_oas across workers must produce the same fragments as workers=1.
|
||||
Uses single-postcode OAs (fast path), so it exercises the chunking + WKB
|
||||
round-trip + fork machinery without needing INSPIRE data."""
|
||||
|
||||
@staticmethod
|
||||
def _inputs(n_oas=60):
|
||||
import pyarrow as pa
|
||||
|
||||
oa_geoms = {
|
||||
f"E{i:08d}": box(i * 100.0, 0.0, i * 100.0 + 50.0, 50.0)
|
||||
for i in range(n_oas)
|
||||
}
|
||||
codes = sorted(oa_geoms)
|
||||
east, north, pcs = [], [], []
|
||||
offsets = {}
|
||||
pos = 0
|
||||
for i, code in enumerate(codes):
|
||||
east += [i * 100.0 + 10.0, i * 100.0 + 20.0]
|
||||
north += [10.0, 20.0]
|
||||
pcs += [f"AA{i % 5} {i % 9}AA"] * 2 # one postcode per OA → fast path
|
||||
offsets[code] = (pos, pos + 2)
|
||||
pos += 2
|
||||
return (
|
||||
codes,
|
||||
oa_geoms,
|
||||
np.array(east),
|
||||
np.array(north),
|
||||
pa.array(pcs, type=pa.large_string()),
|
||||
offsets,
|
||||
)
|
||||
|
||||
@staticmethod
|
||||
def _norm(frags):
|
||||
return sorted((pc, geom.wkb_hex) for pc, geom in frags)
|
||||
|
||||
def test_parallel_matches_sequential(self):
|
||||
codes, oa, east, north, pcs, offs = self._inputs()
|
||||
seq, s1 = _process_oas(codes, oa, east, north, pcs, offs, None, workers=1)
|
||||
par, s2 = _process_oas(codes, oa, east, north, pcs, offs, None, workers=3)
|
||||
assert len(seq) == len(codes) # one fragment per single-postcode OA
|
||||
assert s1 == s2 == len(codes)
|
||||
assert self._norm(seq) == self._norm(par)
|
||||
|
||||
def test_oa_failure_is_tagged_with_oa_code(self):
|
||||
"""A failure inside per-OA processing must re-raise with the OA code, so a
|
||||
single bad OA is attributable instead of an anonymous worker abort."""
|
||||
# Missing OA in the geoms dict → KeyError, wrapped with the OA code.
|
||||
with pytest.raises(RuntimeError, match="E00099999"):
|
||||
_oa_fragments("E00099999", {}, None, None, None, {}, None)
|
||||
|
||||
|
||||
class TestDegenerateGeometryHandling:
|
||||
"""Every active postcode must keep a boundary (validate_outputs is strict),
|
||||
so a sub-grid sliver is fattened rather than dropped. A genuinely empty
|
||||
geometry is skipped without aborting the whole write (the 10h regression)."""
|
||||
|
||||
# Three near-collinear vertices in BNG: bbox ~28m x 7m but area ~0.04 m²,
|
||||
# i.e. AL10 0TU. Without the rescue it snaps to empty at output precision.
|
||||
SLIVER = Polygon(
|
||||
[(523045.34, 209625.56), (523040.47, 209624.33), (523017.0, 209618.42)]
|
||||
)
|
||||
|
||||
def test_sliver_is_rescued_to_valid_geometry(self):
|
||||
from shapely.geometry import shape
|
||||
|
||||
result = to_wgs84_geojson(self.SLIVER)
|
||||
assert result is not None, "sliver must be rescued, not dropped"
|
||||
rt = shape(result)
|
||||
assert not rt.is_empty
|
||||
assert rt.is_valid
|
||||
|
||||
def test_collinear_zero_area_input_is_rescued(self):
|
||||
"""A zero-area collinear 'polygon' (can't be cleaned to a polygon) must
|
||||
still be rescued via the representative-point fallback, not dropped."""
|
||||
from shapely.geometry import shape
|
||||
|
||||
degenerate = Polygon(
|
||||
[(523000, 209600), (523010, 209600), (523020, 209600), (523000, 209600)]
|
||||
)
|
||||
assert degenerate.area == 0.0
|
||||
result = to_wgs84_geojson(degenerate)
|
||||
assert result is not None, "degenerate input must be rescued, not dropped"
|
||||
rt = shape(result)
|
||||
assert not rt.is_empty
|
||||
assert rt.is_valid
|
||||
|
||||
def test_sliver_postcode_present_in_output(self, tmp_path):
|
||||
postcodes = {
|
||||
"AA1 1AA": box(530000, 180000, 530100, 180100),
|
||||
"AA1 1AB": self.SLIVER, # must survive
|
||||
}
|
||||
file_count = write_district_geojson(postcodes, tmp_path)
|
||||
assert file_count == 1
|
||||
collection = json.loads((tmp_path / "units" / "AA1.geojson").read_text())
|
||||
written = {f["properties"]["postcodes"] for f in collection["features"]}
|
||||
assert written == {"AA1 1AA", "AA1 1AB"}
|
||||
|
||||
def test_empty_geometry_skipped_not_raised(self, tmp_path):
|
||||
# The last-resort safety net: an unrescuable (empty) geometry is skipped
|
||||
# so one bad postcode can never abort a multi-hour run.
|
||||
postcodes = {
|
||||
"AA1 1AA": box(530000, 180000, 530100, 180100),
|
||||
"AA1 1AB": Polygon(), # genuinely empty
|
||||
}
|
||||
file_count = write_district_geojson(postcodes, tmp_path)
|
||||
assert file_count == 1
|
||||
collection = json.loads((tmp_path / "units" / "AA1.geojson").read_text())
|
||||
written = {f["properties"]["postcodes"] for f in collection["features"]}
|
||||
assert written == {"AA1 1AA"}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# fragments_cache round-trips Phase 3 output and validates freshness
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestFragmentsCache:
|
||||
"""Persisting Phase 3 lets a crashed run resume without the ~10h OA loop."""
|
||||
|
||||
def test_round_trip_preserves_postcodes_and_geometry(self, tmp_path):
|
||||
fragments = [
|
||||
("AA1 1AA", box(0, 0, 100, 100)),
|
||||
("AA1 1AB", box(200, 200, 250, 260)),
|
||||
# A postcode spanning multiple OAs appears as repeated entries.
|
||||
("AA1 1AA", box(100, 0, 150, 100)),
|
||||
("AA1 1AC", MultiPolygon([box(0, 0, 10, 10), box(20, 20, 30, 30)])),
|
||||
]
|
||||
cache = tmp_path / "fragments_cache.parquet"
|
||||
save_fragments(cache, fragments)
|
||||
loaded = load_fragments(cache)
|
||||
|
||||
assert [pc for pc, _ in loaded] == [pc for pc, _ in fragments]
|
||||
for (_, original), (_, restored) in zip(fragments, loaded):
|
||||
assert restored.equals(original)
|
||||
|
||||
def test_save_is_atomic_no_tmp_left_behind(self, tmp_path):
|
||||
cache = tmp_path / "fragments_cache.parquet"
|
||||
save_fragments(cache, [("AA1 1AA", box(0, 0, 1, 1))])
|
||||
assert cache.exists()
|
||||
assert not (tmp_path / "fragments_cache.parquet.tmp").exists()
|
||||
|
||||
def test_missing_cache_is_not_fresh(self, tmp_path):
|
||||
cache = tmp_path / "fragments_cache.parquet"
|
||||
inp = tmp_path / "uprn.parquet"
|
||||
inp.write_text("x")
|
||||
assert fragments_cache_is_fresh(cache, [inp]) is False
|
||||
|
||||
def test_cache_newer_than_inputs_is_fresh(self, tmp_path):
|
||||
import os
|
||||
|
||||
inp = tmp_path / "uprn.parquet"
|
||||
inp.write_text("x")
|
||||
cache = tmp_path / "fragments_cache.parquet"
|
||||
cache.write_text("c")
|
||||
os.utime(inp, (1_000, 1_000))
|
||||
os.utime(cache, (2_000, 2_000))
|
||||
assert fragments_cache_is_fresh(cache, [inp, None]) is True
|
||||
|
||||
def test_cache_older_than_any_input_is_stale(self, tmp_path):
|
||||
import os
|
||||
|
||||
inp = tmp_path / "oa.gpkg"
|
||||
inp.write_text("x")
|
||||
cache = tmp_path / "fragments_cache.parquet"
|
||||
cache.write_text("c")
|
||||
os.utime(cache, (1_000, 1_000))
|
||||
os.utime(inp, (2_000, 2_000)) # input touched after the cache
|
||||
assert fragments_cache_is_fresh(cache, [inp]) is False
|
||||
|
||||
def test_missing_input_is_ignored(self, tmp_path):
|
||||
cache = tmp_path / "fragments_cache.parquet"
|
||||
cache.write_text("c")
|
||||
# arcgis is optional/absent — it cannot have invalidated the cache.
|
||||
assert fragments_cache_is_fresh(cache, [tmp_path / "absent.parquet"]) is True
|
||||
|
|
|
|||
|
|
@ -79,13 +79,42 @@ def load_uprns(
|
|||
)
|
||||
|
||||
if mapping is not None and mapping.height > 0:
|
||||
uprns = (
|
||||
uprns.join(
|
||||
mapping.lazy(), left_on="PCDS", right_on="old_postcode", how="left"
|
||||
# Remap terminated postcodes to their nearest active successor. The
|
||||
# successor generally lives in a DIFFERENT OA (and at different grid
|
||||
# coordinates), so the remapped point must adopt the successor's
|
||||
# authoritative OA/coords — keeping the terminated postcode's original
|
||||
# OA would seed the successor into an OA it doesn't belong to, splitting
|
||||
# its boundary across OAs. Genuine (non-remapped) UPRN rows keep their
|
||||
# own OA, since a live postcode can legitimately span several OAs.
|
||||
uprns = uprns.join(
|
||||
mapping.lazy(), left_on="PCDS", right_on="old_postcode", how="left"
|
||||
).with_columns(pl.col("new_postcode").is_not_null().alias("_remapped"))
|
||||
if active_postcode_points is not None:
|
||||
successor_oa = active_postcode_points.rename(
|
||||
{
|
||||
"PCDS": "new_postcode",
|
||||
"GRIDGB1E": "_succ_e",
|
||||
"GRIDGB1N": "_succ_n",
|
||||
"OA21CD": "_succ_oa",
|
||||
}
|
||||
)
|
||||
.with_columns(pl.coalesce("new_postcode", "PCDS").alias("PCDS"))
|
||||
.select("GRIDGB1E", "GRIDGB1N", "PCDS", "OA21CD")
|
||||
)
|
||||
uprns = uprns.join(successor_oa, on="new_postcode", how="left").with_columns(
|
||||
pl.when("_remapped")
|
||||
.then(pl.col("_succ_e"))
|
||||
.otherwise(pl.col("GRIDGB1E"))
|
||||
.alias("GRIDGB1E"),
|
||||
pl.when("_remapped")
|
||||
.then(pl.col("_succ_n"))
|
||||
.otherwise(pl.col("GRIDGB1N"))
|
||||
.alias("GRIDGB1N"),
|
||||
pl.when("_remapped")
|
||||
.then(pl.col("_succ_oa"))
|
||||
.otherwise(pl.col("OA21CD"))
|
||||
.alias("OA21CD"),
|
||||
)
|
||||
uprns = uprns.with_columns(
|
||||
pl.coalesce("new_postcode", "PCDS").alias("PCDS")
|
||||
).select("GRIDGB1E", "GRIDGB1N", "PCDS", "OA21CD")
|
||||
|
||||
if active_postcode_points is not None:
|
||||
active_postcodes = active_postcode_points.select("PCDS").unique()
|
||||
|
|
@ -149,3 +178,37 @@ def get_oa_uprns(
|
|||
)
|
||||
postcodes = sub["PCDS"].to_list()
|
||||
return points, postcodes
|
||||
|
||||
|
||||
def extract_uprn_arrays(df: pl.DataFrame):
|
||||
"""Convert the UPRN DataFrame to fork-shareable numpy/Arrow arrays.
|
||||
|
||||
Returns ``(east, north, postcodes)``: two float64 ndarrays and a contiguous
|
||||
pyarrow string Array. Multiprocessing workers slice these per OA via
|
||||
:func:`get_oa_uprns_arrays` **without touching polars**, which avoids the
|
||||
fork-after-threads deadlock hazard of polars' rayon pool. Being plain
|
||||
numpy/Arrow buffers (not millions of Python objects), they are shared by
|
||||
``fork`` copy-on-write rather than duplicated ~1GB per worker.
|
||||
"""
|
||||
import pyarrow as pa
|
||||
|
||||
east = np.ascontiguousarray(df["GRIDGB1E"].to_numpy(), dtype=np.float64)
|
||||
north = np.ascontiguousarray(df["GRIDGB1N"].to_numpy(), dtype=np.float64)
|
||||
postcodes = df["PCDS"].to_arrow()
|
||||
if isinstance(postcodes, pa.ChunkedArray):
|
||||
postcodes = postcodes.combine_chunks()
|
||||
return east, north, postcodes
|
||||
|
||||
|
||||
def get_oa_uprns_arrays(
|
||||
east: np.ndarray,
|
||||
north: np.ndarray,
|
||||
postcodes,
|
||||
offsets: dict[str, tuple[int, int]],
|
||||
oa_code: str,
|
||||
) -> tuple[np.ndarray, list[str]]:
|
||||
"""Like :func:`get_oa_uprns`, but slices the fork-shareable arrays from
|
||||
:func:`extract_uprn_arrays` (no polars), so it is safe to call in workers."""
|
||||
s, e = offsets[oa_code]
|
||||
points = np.column_stack([east[s:e], north[s:e]])
|
||||
return points, postcodes.slice(s, e - s).to_pylist()
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue