Good changes

This commit is contained in:
Andras Schmelczer 2026-03-11 20:44:34 +00:00
parent 80a5a2a774
commit 791bc6976b
24 changed files with 890 additions and 312 deletions

View file

@ -17,8 +17,9 @@ All commands use [Task](https://taskfile.dev) runner. Python uses `uv run`. Fron
task dev:server # Rust backend on :8001 (cargo run --release)
task dev:frontend # Webpack dev server on :3001 (proxies /api to :8001)
# Data pipeline
task prepare # Build wide.parquet from all pre-downloaded sources
# Data pipeline (uses Make, not Task — see Makefile.data)
make -f Makefile.data prepare # Build properties.parquet (merge + price estimation)
make -f Makefile.data merge # Just the merge step (no price estimation)
# Assets
task download:map-assets # Download font glyphs + twemoji PNGs into frontend/public/assets/
@ -55,28 +56,30 @@ uv run pytest pipeline/utils/test_haversine.py -k "test_name" # Single test
```
Raw sources → [Download scripts] → data/*.parquet
→ [Fuzzy join EPC ↔ Price-Paid] → epc_pp.parquet
→ [Merge all datasets] → wide.parquet
→ [Merge all datasets] → properties.parquet
→ [Price estimation] → properties.parquet (augmented with estimated prices)
→ [Rust server loads into memory + precomputes H3 + spatial grid]
→ [Frontend renders deck.gl H3HexagonLayer over MapLibre GL]
```
### Data Pipeline (`pipeline/`)
Python + Polars. Two phases:
Python + Polars. Orchestrated by `Makefile.data` (Make DAG with sentinel files like `.merge_done`, `.prices_done`). Two phases:
1. **Download** (`pipeline/download/`) — Each script fetches one raw dataset into `data/`
2. **Transform** (`pipeline/transform/`) — Joins and derives features:
- `join_epc_pp.py` — Fuzzy-joins EPC ↔ price-paid by address within postcode buckets
- `merge.py`**Main pipeline**: joins all datasets → `wide.parquet` with human-readable column names
- `merge.py`**Main pipeline**: joins all datasets → `properties.parquet` with human-readable column names
- `price_estimation/` — Post-merge step: adds "Estimated current price" and "Est. price per sqm" columns to `properties.parquet`. Uses repeat-sales price index + kNN spatial blending. Requires `price_index.parquet` (built by `price_estimation/index.py`). Run via `make -f Makefile.data prepare` (the `merge` target alone skips this).
- `transform_poi.py` — Filters POIs, maps to friendly names + emoji (exhaustive category validation)
- `poi_proximity.py` — Counts POIs within 2km per postcode using 0.05° spatial grid
- `crime.py` — Aggregates crime CSVs into yearly averages by LSOA
**Critical: column renaming in `merge.py`** — The pipeline renames columns from snake_case to human-readable names before writing `wide.parquet`. The Rust server and frontend use **only** these human-readable names — there are no fallbacks to snake_case. Key renames:
**Critical: column renaming in `merge.py`** — The pipeline renames columns from snake_case to human-readable names before writing `properties.parquet`. The Rust server and frontend use **only** these human-readable names — there are no fallbacks to snake_case. Key renames:
- `pp_address``Address per Property Register`
- `postcode``Postcode`
- `latest_price``Last known price`
- `duration``Leashold/Freehold`
- `duration``Leasehold/Freehold`
- `total_floor_area``Total floor area (sqm)`
- `current_energy_rating``Current energy rating`
@ -321,7 +324,7 @@ Follow these conventions in all Rust code:
- **POI transform validation**: Fails if any OSM category is unmapped — guarantees exhaustive coverage
- **Fuzzy join**: Groups by postcode, uses `thefuzz.token_sort_ratio` with numeric token compatibility, greedy assignment from highest score
- **Filter parsing is strict**: `parse_filters()` returns `Result` — malformed entries, unknown feature names, and unparseable numbers all return 400 Bad Request. No silent skipping of invalid filters.
- **Data loading is strict**: `extract_string_col` and `lookup_enum_value` take a single column name (no fallback names). H3 precomputation panics on invalid coordinates. Required parquet columns must exist at startup.
- **Data loading is strict**: `extract_string_col` and `lookup_enum_value` take a single column name (no fallback names). H3 precomputation panics on invalid coordinates. All configured features (defined in `features.rs`) must exist in at least one data source — the server panics at startup if any are missing (no NaN placeholders). This means all pipeline steps must be complete before starting the server. Polars `diagonal: true` concat fills nulls for features that exist in some but not all sources (e.g. "Listing date" from listings only).
- **Travel time is strict**: `mode` param is required (400) when `destination` is set — no silent default to "car". R5 failures return 502 Bad Gateway, not silent omission. `r5_url` is `Option<String>` — returns 503 if travel time requested without R5 configured.
- **Filter bounds format**: `south,west,north,east` (not standard bbox order)
- **Server-side AABB filtering**: Both `/api/hexagons` and `/api/postcodes` filter results by bounding-box intersection with query bounds. Hexagons use `h3_cell_bounds()` (h3o returns degrees, not radians). Postcodes compute polygon AABB from vertices. See `bounds_intersect()` in `parsing/bounds.rs`.