perfect-postcode/CLAUDE.md
2026-02-01 08:49:44 +00:00

6.8 KiB
Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Property Map is a full-stack geospatial application for visualizing UK property data on an interactive map. It combines Land Registry price-paid data, EPC energy certificates, postcode geolocation, TFL journey times, Index of Deprivation scores, crime statistics, ethnicity data, broadband speeds, school ratings, road noise, and OpenStreetMap POIs into a single wide parquet file, then serves aggregated H3 hexagon statistics and POI data via a Rust backend.

Commands

All commands use Task runner. Python uses uv run. Frontend uses npm run from frontend/.

# Development servers
task dev:server           # Rust backend on :8001 (cargo run --release)
task dev:frontend         # Webpack dev server on :3030 (proxies /api to :8001)

# Data pipeline
task prepare              # Build wide.parquet from all pre-downloaded sources

# Quality
task lint                 # Lint all: Python (ruff) + TypeScript (ESLint+Prettier) + Rust (clippy+fmt)
task format               # Auto-fix formatting for all languages
task test                 # Python tests (fuzzy join, haversine, POI counts)
task check                # Full validation: lint + build + test

# Building
task build:frontend       # TypeScript typecheck + webpack production build
task build:server         # cargo build --release (NOTE: dir is wrong in Taskfile, run from server-rs/)

# Granular lint/format
task lint:python          # uv run ruff check .
task lint:frontend        # eslint + prettier --check
task lint:rust            # cargo clippy -- -D warnings && cargo fmt --check
task format:python        # ruff check --fix && ruff format
task format:frontend      # eslint --fix + prettier --write
task format:rust          # cargo fmt --all

Running individual tests:

uv run pytest pipeline/utils/test_haversine.py       # Single test file
uv run pytest pipeline/utils/test_haversine.py -k "test_name"  # Single test

Architecture

Data Flow

Raw sources → [Download scripts] → data/*.parquet
  → [Fuzzy join EPC ↔ Price-Paid] → epc_pp.parquet
  → [Merge all datasets] → wide.parquet
  → [Rust server loads into memory + precomputes H3 + spatial grid]
  → [Frontend renders deck.gl H3HexagonLayer over MapLibre GL]

Data Pipeline (pipeline/)

Python + Polars. Two phases:

  1. Download (pipeline/download/) — Each script fetches one raw dataset into data/
  2. Transform (pipeline/transform/) — Joins and derives features:
    • join_epc_pp.py — Fuzzy-joins EPC ↔ price-paid by address within postcode buckets
    • merge.pyMain pipeline: joins all datasets → wide.parquet with human-readable column names
    • transform_poi.py — Filters POIs, maps to friendly names + emoji (exhaustive category validation)
    • poi_proximity.py — Counts POIs within 2km per postcode using 0.05° spatial grid
    • crime.py — Aggregates crime CSVs into yearly averages by LSOA

Critical: column renaming in merge.py — The pipeline renames columns from snake_case to human-readable names before writing wide.parquet. The Rust server auto-discovers features from whatever column names exist in the parquet. Key renames:

  • pp_addressAddress per Property Register
  • postcodePostcode
  • latest_priceLast known price
  • durationLeashold/Freehold
  • total_floor_areaTotal floor area (sqm)
  • current_energy_ratingCurrent energy rating

The server and frontend must handle these human-readable names. See the full rename map in merge.py.

Backend (server-rs/)

Rust + Axum. Loads parquet into memory at startup.

Structure:

  • data/property.rs — Loads wide.parquet, auto-discovers numeric + enum features, computes histograms, sorts rows by spatial locality, precomputes H3 cells (resolutions 412)
  • data/poi.rs — Loads filtered_uk_pois.parquet
  • index.rsGridIndex: 0.01° spatial grid for O(1) cell lookup
  • filter.rs — Parses filter strings and checks rows. Format: name:min:max (numeric), name:val1|val2 (enum)
  • routes/ — One file per endpoint
  • consts.rs — Key constants (histogram bins, H3 range, max enum cardinality, excluded columns)

API endpoints:

  • GET /api/features — Feature metadata with histograms and 2nd/98th percentiles
  • GET /api/hexagons?resolution=&bounds=&filters= — H3 aggregates (min/max per feature per hex)
  • GET /api/hexagon-properties?h3=&resolution=&filters=&limit=&offset= — Paginated properties within a hexagon
  • GET /api/pois?bounds=&categories= — POIs by bounds (max 5000)
  • GET /api/poi-categories — Available POI category names

Serves frontend/dist/ as static fallback in production.

Data representation:

  • Numeric features: row-major flat Vec<f64>, NaN = null
  • Enum features: Vec<u8> indices into value list, 255 = null
  • String fields (address, postcode): Vec<String>, empty = null
  • The server accepts the parquet path as a CLI argument (defaults to data_sources/processed/wide.parquet)

Frontend (frontend/)

React 18 + TypeScript. deck.gl H3HexagonLayer over MapLibre GL. TailwindCSS. No state management library — pure React hooks.

Key patterns:

  • App.tsx manages all state, API fetching (150ms debounce), and URL state sync (300ms debounce)
  • URL encodes view/filters/POI categories/active tab as query params for shareable links
  • AbortControllers cancel in-flight requests on new queries
  • Zoom → H3 resolution: <7→7, <9.5→8, <11→9, <13→10, ≥13→11
  • Bounds quantized to 0.01° to match backend caching
  • Properties pane uses feature names from API response (human-readable), not hardcoded field names
  • Proxy: dev server on :3030 proxies /api to :8001; also handles VS Code /proxy/PORT patterns

Key Implementation Details

  • Spatial sort: Rows sorted by 0.01° grid cell at load time for cache-friendly sequential access
  • Row-major layout: feature_data[row * num_features + feat_idx] — all features for one property are contiguous
  • H3 precomputation: Resolutions 412 computed in parallel (rayon) at startup
  • Histogram percentiles without sorting: O(n) two-pass algorithm — build histogram, interpolate percentiles
  • Direct JSON writing: Hexagon endpoint writes JSON via string buffer, avoids serde_json::Value allocations
  • POI transform validation: Fails if any OSM category is unmapped — guarantees exhaustive coverage
  • Fuzzy join: Groups by postcode, uses thefuzz.token_sort_ratio with numeric token compatibility, greedy assignment from highest score
  • Filter bounds format: south,west,north,east (not standard bbox order)
  • POI proximity: Uses 0.05° grid (~5km cells) to reduce candidates before haversine distance check