Good changes

2026-03-11 20:44:34 +00:00 · 2026-03-11 20:44:34 +00:00 · 791bc6976b
commit 791bc6976b
parent 80a5a2a774
24 changed files with 890 additions and 312 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -17,8 +17,9 @@ All commands use [Task](https://taskfile.dev) runner. Python uses `uv run`. Fron
 task dev:server           # Rust backend on :8001 (cargo run --release)
 task dev:frontend         # Webpack dev server on :3001 (proxies /api to :8001)

-# Data pipeline
-task prepare              # Build wide.parquet from all pre-downloaded sources
+# Data pipeline (uses Make, not Task — see Makefile.data)
+make -f Makefile.data prepare   # Build properties.parquet (merge + price estimation)
+make -f Makefile.data merge     # Just the merge step (no price estimation)

 # Assets
 task download:map-assets  # Download font glyphs + twemoji PNGs into frontend/public/assets/
@ -55,28 +56,30 @@ uv run pytest pipeline/utils/test_haversine.py -k "test_name"  # Single test
 ```
 Raw sources → [Download scripts] → data/*.parquet
  → [Fuzzy join EPC ↔ Price-Paid] → epc_pp.parquet
-  → [Merge all datasets] → wide.parquet
+  → [Merge all datasets] → properties.parquet
+  → [Price estimation] → properties.parquet (augmented with estimated prices)
  → [Rust server loads into memory + precomputes H3 + spatial grid]
  → [Frontend renders deck.gl H3HexagonLayer over MapLibre GL]
 ```

 ### Data Pipeline (`pipeline/`)

-Python + Polars. Two phases:
+Python + Polars. Orchestrated by `Makefile.data` (Make DAG with sentinel files like `.merge_done`, `.prices_done`). Two phases:

 1. **Download** (`pipeline/download/`) — Each script fetches one raw dataset into `data/`
 2. **Transform** (`pipeline/transform/`) — Joins and derives features:
   - `join_epc_pp.py` — Fuzzy-joins EPC ↔ price-paid by address within postcode buckets
-   - `merge.py` — **Main pipeline**: joins all datasets → `wide.parquet` with human-readable column names
+   - `merge.py` — **Main pipeline**: joins all datasets → `properties.parquet` with human-readable column names
+   - `price_estimation/` — Post-merge step: adds "Estimated current price" and "Est. price per sqm" columns to `properties.parquet`. Uses repeat-sales price index + kNN spatial blending. Requires `price_index.parquet` (built by `price_estimation/index.py`). Run via `make -f Makefile.data prepare` (the `merge` target alone skips this).
   - `transform_poi.py` — Filters POIs, maps to friendly names + emoji (exhaustive category validation)
   - `poi_proximity.py` — Counts POIs within 2km per postcode using 0.05° spatial grid
   - `crime.py` — Aggregates crime CSVs into yearly averages by LSOA

-**Critical: column renaming in `merge.py`** — The pipeline renames columns from snake_case to human-readable names before writing `wide.parquet`. The Rust server and frontend use **only** these human-readable names — there are no fallbacks to snake_case. Key renames:
+**Critical: column renaming in `merge.py`** — The pipeline renames columns from snake_case to human-readable names before writing `properties.parquet`. The Rust server and frontend use **only** these human-readable names — there are no fallbacks to snake_case. Key renames:
 - `pp_address` → `Address per Property Register`
 - `postcode` → `Postcode`
 - `latest_price` → `Last known price`
- `duration` → `Leashold/Freehold`
+- `duration` → `Leasehold/Freehold`
 - `total_floor_area` → `Total floor area (sqm)`
 - `current_energy_rating` → `Current energy rating`

@ -321,7 +324,7 @@ Follow these conventions in all Rust code:
 - **POI transform validation**: Fails if any OSM category is unmapped — guarantees exhaustive coverage
 - **Fuzzy join**: Groups by postcode, uses `thefuzz.token_sort_ratio` with numeric token compatibility, greedy assignment from highest score
 - **Filter parsing is strict**: `parse_filters()` returns `Result` — malformed entries, unknown feature names, and unparseable numbers all return 400 Bad Request. No silent skipping of invalid filters.
- **Data loading is strict**: `extract_string_col` and `lookup_enum_value` take a single column name (no fallback names). H3 precomputation panics on invalid coordinates. Required parquet columns must exist at startup.
+- **Data loading is strict**: `extract_string_col` and `lookup_enum_value` take a single column name (no fallback names). H3 precomputation panics on invalid coordinates. All configured features (defined in `features.rs`) must exist in at least one data source — the server panics at startup if any are missing (no NaN placeholders). This means all pipeline steps must be complete before starting the server. Polars `diagonal: true` concat fills nulls for features that exist in some but not all sources (e.g. "Listing date" from listings only).
 - **Travel time is strict**: `mode` param is required (400) when `destination` is set — no silent default to "car". R5 failures return 502 Bad Gateway, not silent omission. `r5_url` is `Option<String>` — returns 503 if travel time requested without R5 configured.
 - **Filter bounds format**: `south,west,north,east` (not standard bbox order)
 - **Server-side AABB filtering**: Both `/api/hexagons` and `/api/postcodes` filter results by bounding-box intersection with query bounds. Hexagons use `h3_cell_bounds()` (h3o returns degrees, not radians). Postcodes compute polygon AABB from vertices. See `bounds_intersect()` in `parsing/bounds.rs`.