Update
This commit is contained in:
parent
cf7449e38b
commit
0b0a655599
1 changed files with 97 additions and 44 deletions
141
CLAUDE.md
141
CLAUDE.md
|
|
@ -4,74 +4,127 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
|
||||||
|
|
||||||
## Project Overview
|
## Project Overview
|
||||||
|
|
||||||
Property Map is a full-stack geospatial application for visualizing UK property data on an interactive map. It combines Land Registry price-paid data, EPC energy certificates, postcode geolocation, TFL journey times, Index of Deprivation scores, and OpenStreetMap POIs into a single wide parquet file, then serves aggregated H3 hexagon statistics and POI data via a Rust backend.
|
Property Map is a full-stack geospatial application for visualizing UK property data on an interactive map. It combines Land Registry price-paid data, EPC energy certificates, postcode geolocation, TFL journey times, Index of Deprivation scores, crime statistics, ethnicity data, broadband speeds, school ratings, road noise, and OpenStreetMap POIs into a single wide parquet file, then serves aggregated H3 hexagon statistics and POI data via a Rust backend.
|
||||||
|
|
||||||
## Commands
|
## Commands
|
||||||
|
|
||||||
All commands use [Task](https://taskfile.dev) runner.
|
All commands use [Task](https://taskfile.dev) runner. Python uses `uv run`. Frontend uses `npm run` from `frontend/`.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
task prepare # Full setup: install deps, download data (~GB), run pipeline
|
# Development servers
|
||||||
task server # Rust backend on :8001 (cargo run --release)
|
task dev:server # Rust backend on :8001 (cargo run --release)
|
||||||
task frontend # Webpack dev server on :3030 (proxies /api to :8001)
|
task dev:frontend # Webpack dev server on :3030 (proxies /api to :8001)
|
||||||
|
|
||||||
task lint # Lint Python (ruff) + TypeScript (ESLint + Prettier)
|
# Data pipeline
|
||||||
task format # Auto-fix formatting (ruff + ESLint + Prettier)
|
task prepare # Build wide.parquet from all pre-downloaded sources
|
||||||
task typecheck # TypeScript type checking
|
|
||||||
task check # All checks (lint + typecheck + build)
|
# Quality
|
||||||
task test # Run Python tests (fuzzy join)
|
task lint # Lint all: Python (ruff) + TypeScript (ESLint+Prettier) + Rust (clippy+fmt)
|
||||||
task build # Build frontend for production
|
task format # Auto-fix formatting for all languages
|
||||||
|
task test # Python tests (fuzzy join, haversine, POI counts)
|
||||||
|
task check # Full validation: lint + build + test
|
||||||
|
|
||||||
|
# Building
|
||||||
|
task build:frontend # TypeScript typecheck + webpack production build
|
||||||
|
task build:server # cargo build --release (NOTE: dir is wrong in Taskfile, run from server-rs/)
|
||||||
|
|
||||||
|
# Granular lint/format
|
||||||
|
task lint:python # uv run ruff check .
|
||||||
|
task lint:frontend # eslint + prettier --check
|
||||||
|
task lint:rust # cargo clippy -- -D warnings && cargo fmt --check
|
||||||
|
task format:python # ruff check --fix && ruff format
|
||||||
|
task format:frontend # eslint --fix + prettier --write
|
||||||
|
task format:rust # cargo fmt --all
|
||||||
```
|
```
|
||||||
|
|
||||||
Python commands use `uv run`. Frontend commands use `npm run` from `frontend/`.
|
Running individual tests:
|
||||||
|
```bash
|
||||||
|
uv run pytest pipeline/utils/test_haversine.py # Single test file
|
||||||
|
uv run pytest pipeline/utils/test_haversine.py -k "test_name" # Single test
|
||||||
|
```
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
|
### Data Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
Raw sources → [Download scripts] → data/*.parquet
|
||||||
|
→ [Fuzzy join EPC ↔ Price-Paid] → epc_pp.parquet
|
||||||
|
→ [Merge all datasets] → wide.parquet
|
||||||
|
→ [Rust server loads into memory + precomputes H3 + spatial grid]
|
||||||
|
→ [Frontend renders deck.gl H3HexagonLayer over MapLibre GL]
|
||||||
|
```
|
||||||
|
|
||||||
### Data Pipeline (`pipeline/`)
|
### Data Pipeline (`pipeline/`)
|
||||||
|
|
||||||
Python + Polars. Orchestrated by `pipeline/run.py` which builds `data_sources/processed/wide.parquet`:
|
Python + Polars. Two phases:
|
||||||
|
|
||||||
1. **Download** (`pipeline/download/`) — Fetches raw data into `data_sources/`:
|
1. **Download** (`pipeline/download/`) — Each script fetches one raw dataset into `data/`
|
||||||
- `arcgis.py` — Postcode → lat/lon/LSOA mappings
|
2. **Transform** (`pipeline/transform/`) — Joins and derives features:
|
||||||
- `price_paid.py` — Land Registry price-paid records
|
- `join_epc_pp.py` — Fuzzy-joins EPC ↔ price-paid by address within postcode buckets
|
||||||
- `pois/` — OpenStreetMap POIs via osmium (PBF parsing)
|
- `merge.py` — **Main pipeline**: joins all datasets → `wide.parquet` with human-readable column names
|
||||||
- `deprivation_data.py` — English Indices of Deprivation 2025
|
- `transform_poi.py` — Filters POIs, maps to friendly names + emoji (exhaustive category validation)
|
||||||
2. **Join** (`pipeline/epc_pp.py`) — Fuzzy-joins EPC certificates with price-paid by address within postcode buckets → `epc_pp.parquet`
|
- `poi_proximity.py` — Counts POIs within 2km per postcode using 0.05° spatial grid
|
||||||
3. **Widen** (`pipeline/run.py`) — Joins epc_pp with GPS coords, journey times, IoD scores, POI proximity counts, derives `price_per_sqm` and numeric `construction_age_band`
|
- `crime.py` — Aggregates crime CSVs into yearly averages by LSOA
|
||||||
4. **Transform POIs** (`pipeline/download/pois/transform.py`) — Drops unwanted categories, remaps to friendly names + emoji → `filtered_uk_pois.parquet`
|
|
||||||
|
|
||||||
Shared utilities live in `pipeline/utils/` (haversine distance for both numpy and Polars expressions, fuzzy address matching).
|
**Critical: column renaming in `merge.py`** — The pipeline renames columns from snake_case to human-readable names before writing `wide.parquet`. The Rust server auto-discovers features from whatever column names exist in the parquet. Key renames:
|
||||||
|
- `pp_address` → `Address per Property Register`
|
||||||
|
- `postcode` → `Postcode`
|
||||||
|
- `latest_price` → `Last known price`
|
||||||
|
- `duration` → `Leashold/Freehold`
|
||||||
|
- `total_floor_area` → `Total floor area (sqm)`
|
||||||
|
- `current_energy_rating` → `Current energy rating`
|
||||||
|
|
||||||
|
The server and frontend must handle these human-readable names. See the full rename map in `merge.py`.
|
||||||
|
|
||||||
### Backend (`server-rs/`)
|
### Backend (`server-rs/`)
|
||||||
|
|
||||||
Rust + Axum. Loads `wide.parquet` and `filtered_uk_pois.parquet` into memory at startup with precomputed H3 indices (resolutions 7–11) and grid-based spatial indices (0.01° cells).
|
Rust + Axum. Loads parquet into memory at startup.
|
||||||
|
|
||||||
|
**Structure:**
|
||||||
|
- `data/property.rs` — Loads `wide.parquet`, auto-discovers numeric + enum features, computes histograms, sorts rows by spatial locality, precomputes H3 cells (resolutions 4–12)
|
||||||
|
- `data/poi.rs` — Loads `filtered_uk_pois.parquet`
|
||||||
|
- `index.rs` — `GridIndex`: 0.01° spatial grid for O(1) cell lookup
|
||||||
|
- `filter.rs` — Parses filter strings and checks rows. Format: `name:min:max` (numeric), `name:val1|val2` (enum)
|
||||||
|
- `routes/` — One file per endpoint
|
||||||
|
- `consts.rs` — Key constants (histogram bins, H3 range, max enum cardinality, excluded columns)
|
||||||
|
|
||||||
**API endpoints:**
|
**API endpoints:**
|
||||||
- `GET /api/features` — Numeric column metadata with histograms and percentiles
|
- `GET /api/features` — Feature metadata with histograms and 2nd/98th percentiles
|
||||||
- `GET /api/hexagons` — H3 aggregates filtered by bounds, resolution, and feature min/max
|
- `GET /api/hexagons?resolution=&bounds=&filters=` — H3 aggregates (min/max per feature per hex)
|
||||||
- `GET /api/pois` — POIs by bounds with optional category filter (max 5000)
|
- `GET /api/hexagon-properties?h3=&resolution=&filters=&limit=&offset=` — Paginated properties within a hexagon
|
||||||
- `GET /api/poi-categories` — Available POI categories
|
- `GET /api/pois?bounds=&categories=` — POIs by bounds (max 5000)
|
||||||
|
- `GET /api/poi-categories` — Available POI category names
|
||||||
|
|
||||||
Also serves `frontend/dist/` as static fallback.
|
Serves `frontend/dist/` as static fallback in production.
|
||||||
|
|
||||||
|
**Data representation:**
|
||||||
|
- Numeric features: row-major flat `Vec<f64>`, NaN = null
|
||||||
|
- Enum features: `Vec<u8>` indices into value list, 255 = null
|
||||||
|
- String fields (address, postcode): `Vec<String>`, empty = null
|
||||||
|
- The server accepts the parquet path as a CLI argument (defaults to `data_sources/processed/wide.parquet`)
|
||||||
|
|
||||||
### Frontend (`frontend/`)
|
### Frontend (`frontend/`)
|
||||||
|
|
||||||
React 18 + TypeScript SPA. deck.gl `H3HexagonLayer` over MapLibre GL basemap. Debounces API calls (150ms) on viewport changes. TailwindCSS for styling.
|
React 18 + TypeScript. deck.gl `H3HexagonLayer` over MapLibre GL. TailwindCSS. No state management library — pure React hooks.
|
||||||
|
|
||||||
|
**Key patterns:**
|
||||||
|
- `App.tsx` manages all state, API fetching (150ms debounce), and URL state sync (300ms debounce)
|
||||||
|
- URL encodes view/filters/POI categories/active tab as query params for shareable links
|
||||||
|
- AbortControllers cancel in-flight requests on new queries
|
||||||
|
- Zoom → H3 resolution: `<7→7, <9.5→8, <11→9, <13→10, ≥13→11`
|
||||||
|
- Bounds quantized to 0.01° to match backend caching
|
||||||
|
- Properties pane uses feature names from API response (human-readable), not hardcoded field names
|
||||||
|
- Proxy: dev server on :3030 proxies `/api` to :8001; also handles VS Code `/proxy/PORT` patterns
|
||||||
|
|
||||||
## Key Implementation Details
|
## Key Implementation Details
|
||||||
|
|
||||||
- Bounds quantized to 0.01° to improve cache hits on both backend and frontend
|
- **Spatial sort**: Rows sorted by 0.01° grid cell at load time for cache-friendly sequential access
|
||||||
- H3 hexagon results capped at 50,000 per request (truncated flag in response)
|
- **Row-major layout**: `feature_data[row * num_features + feat_idx]` — all features for one property are contiguous
|
||||||
- POI proximity counting uses a spatial grid (0.05° cells, ~5km) to avoid O(n×m) distance checks
|
- **H3 precomputation**: Resolutions 4–12 computed in parallel (rayon) at startup
|
||||||
- Fuzzy address matching uses `thefuzz.token_sort_ratio` with numeric token compatibility checks, parallelized across postcode buckets
|
- **Histogram percentiles without sorting**: O(n) two-pass algorithm — build histogram, interpolate percentiles
|
||||||
- The Rust server writes JSON via direct string buffer (avoids serde_json::Value allocations)
|
- **Direct JSON writing**: Hexagon endpoint writes JSON via string buffer, avoids serde_json::Value allocations
|
||||||
- POI transform validates exhaustive category coverage — pipeline fails if any OSM category is unmapped
|
- **POI transform validation**: Fails if any OSM category is unmapped — guarantees exhaustive coverage
|
||||||
|
- **Fuzzy join**: Groups by postcode, uses `thefuzz.token_sort_ratio` with numeric token compatibility, greedy assignment from highest score
|
||||||
## Data Sources
|
- **Filter bounds format**: `south,west,north,east` (not standard bbox order)
|
||||||
|
- **POI proximity**: Uses 0.05° grid (~5km cells) to reduce candidates before haversine distance check
|
||||||
- **Land Registry** — Price Paid bulk download
|
|
||||||
- **EPC** — Energy Performance Certificates (domestic)
|
|
||||||
- **ArcGIS** — Postcode → GPS/LSOA lookup
|
|
||||||
- **OpenStreetMap** — POIs from Geofabrik Great Britain PBF
|
|
||||||
- **IoD 2025** — English Indices of Deprivation (LSOA-level scores)
|
|
||||||
- **TFL API** — Journey time calculations to configurable destinations
|
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue