Good changes
This commit is contained in:
parent
80a5a2a774
commit
791bc6976b
24 changed files with 890 additions and 312 deletions
19
CLAUDE.md
19
CLAUDE.md
|
|
@ -17,8 +17,9 @@ All commands use [Task](https://taskfile.dev) runner. Python uses `uv run`. Fron
|
|||
task dev:server # Rust backend on :8001 (cargo run --release)
|
||||
task dev:frontend # Webpack dev server on :3001 (proxies /api to :8001)
|
||||
|
||||
# Data pipeline
|
||||
task prepare # Build wide.parquet from all pre-downloaded sources
|
||||
# Data pipeline (uses Make, not Task — see Makefile.data)
|
||||
make -f Makefile.data prepare # Build properties.parquet (merge + price estimation)
|
||||
make -f Makefile.data merge # Just the merge step (no price estimation)
|
||||
|
||||
# Assets
|
||||
task download:map-assets # Download font glyphs + twemoji PNGs into frontend/public/assets/
|
||||
|
|
@ -55,28 +56,30 @@ uv run pytest pipeline/utils/test_haversine.py -k "test_name" # Single test
|
|||
```
|
||||
Raw sources → [Download scripts] → data/*.parquet
|
||||
→ [Fuzzy join EPC ↔ Price-Paid] → epc_pp.parquet
|
||||
→ [Merge all datasets] → wide.parquet
|
||||
→ [Merge all datasets] → properties.parquet
|
||||
→ [Price estimation] → properties.parquet (augmented with estimated prices)
|
||||
→ [Rust server loads into memory + precomputes H3 + spatial grid]
|
||||
→ [Frontend renders deck.gl H3HexagonLayer over MapLibre GL]
|
||||
```
|
||||
|
||||
### Data Pipeline (`pipeline/`)
|
||||
|
||||
Python + Polars. Two phases:
|
||||
Python + Polars. Orchestrated by `Makefile.data` (Make DAG with sentinel files like `.merge_done`, `.prices_done`). Two phases:
|
||||
|
||||
1. **Download** (`pipeline/download/`) — Each script fetches one raw dataset into `data/`
|
||||
2. **Transform** (`pipeline/transform/`) — Joins and derives features:
|
||||
- `join_epc_pp.py` — Fuzzy-joins EPC ↔ price-paid by address within postcode buckets
|
||||
- `merge.py` — **Main pipeline**: joins all datasets → `wide.parquet` with human-readable column names
|
||||
- `merge.py` — **Main pipeline**: joins all datasets → `properties.parquet` with human-readable column names
|
||||
- `price_estimation/` — Post-merge step: adds "Estimated current price" and "Est. price per sqm" columns to `properties.parquet`. Uses repeat-sales price index + kNN spatial blending. Requires `price_index.parquet` (built by `price_estimation/index.py`). Run via `make -f Makefile.data prepare` (the `merge` target alone skips this).
|
||||
- `transform_poi.py` — Filters POIs, maps to friendly names + emoji (exhaustive category validation)
|
||||
- `poi_proximity.py` — Counts POIs within 2km per postcode using 0.05° spatial grid
|
||||
- `crime.py` — Aggregates crime CSVs into yearly averages by LSOA
|
||||
|
||||
**Critical: column renaming in `merge.py`** — The pipeline renames columns from snake_case to human-readable names before writing `wide.parquet`. The Rust server and frontend use **only** these human-readable names — there are no fallbacks to snake_case. Key renames:
|
||||
**Critical: column renaming in `merge.py`** — The pipeline renames columns from snake_case to human-readable names before writing `properties.parquet`. The Rust server and frontend use **only** these human-readable names — there are no fallbacks to snake_case. Key renames:
|
||||
- `pp_address` → `Address per Property Register`
|
||||
- `postcode` → `Postcode`
|
||||
- `latest_price` → `Last known price`
|
||||
- `duration` → `Leashold/Freehold`
|
||||
- `duration` → `Leasehold/Freehold`
|
||||
- `total_floor_area` → `Total floor area (sqm)`
|
||||
- `current_energy_rating` → `Current energy rating`
|
||||
|
||||
|
|
@ -321,7 +324,7 @@ Follow these conventions in all Rust code:
|
|||
- **POI transform validation**: Fails if any OSM category is unmapped — guarantees exhaustive coverage
|
||||
- **Fuzzy join**: Groups by postcode, uses `thefuzz.token_sort_ratio` with numeric token compatibility, greedy assignment from highest score
|
||||
- **Filter parsing is strict**: `parse_filters()` returns `Result` — malformed entries, unknown feature names, and unparseable numbers all return 400 Bad Request. No silent skipping of invalid filters.
|
||||
- **Data loading is strict**: `extract_string_col` and `lookup_enum_value` take a single column name (no fallback names). H3 precomputation panics on invalid coordinates. Required parquet columns must exist at startup.
|
||||
- **Data loading is strict**: `extract_string_col` and `lookup_enum_value` take a single column name (no fallback names). H3 precomputation panics on invalid coordinates. All configured features (defined in `features.rs`) must exist in at least one data source — the server panics at startup if any are missing (no NaN placeholders). This means all pipeline steps must be complete before starting the server. Polars `diagonal: true` concat fills nulls for features that exist in some but not all sources (e.g. "Listing date" from listings only).
|
||||
- **Travel time is strict**: `mode` param is required (400) when `destination` is set — no silent default to "car". R5 failures return 502 Bad Gateway, not silent omission. `r5_url` is `Option<String>` — returns 503 if travel time requested without R5 configured.
|
||||
- **Filter bounds format**: `south,west,north,east` (not standard bbox order)
|
||||
- **Server-side AABB filtering**: Both `/api/hexagons` and `/api/postcodes` filter results by bounding-box intersection with query bounds. Hexagons use `h3_cell_bounds()` (h3o returns degrees, not radians). Postcodes compute polygon AABB from vertices. See `bounds_intersect()` in `parsing/bounds.rs`.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue