perfect-postcode/pipeline/transform/crime_spatial.py

"""Aggregate police.uk street crime to postcodes by spatial proximity.

Instead of attributing each incident to its published LSOA code, this transform
counts the anonymised incident *points* that fall within ``buffer_m`` (default
100m) of each postcode's boundary polygon (the polygon buffered outward). A point
inside several overlapping buffers counts for each postcode -- the same
multiplicity the tree-density filter uses for features near more than one
postcode. The wide 100m buffer deliberately smooths police.uk's snap-to-grid
coordinates, which would otherwise make the count hypersensitive to which side of
a narrow line a shared "map point" anchor happened to land on.

Counts are **area-normalised**: each postcode's count is divided by its buffered
catchment area and rescaled by the median catchment area, so the metric reflects
crime *density* rather than how much ground the buffer sweeps (a median-sized
catchment is left unchanged; a large rural postcode is no longer inflated simply
for covering more of the map). Normalising by the buffered area -- the region
that actually collects points -- rather than the raw polygon keeps tiny unit
postcodes from being over-inflated by the fixed buffer-ring floor. NOTE: this is
an incident *density of the surrounding streets*, not a per-resident risk --
zero-resident commercial centres (Soho, retail parks) legitimately rank high.

**Force-coverage calendar.** police.uk has multi-year publication gaps for whole
forces (Greater Manchester has published nothing between 2019-07 and the present
except 2022-08; BTP, Gloucestershire, Devon & Cornwall and others have shorter
gaps). A missing month is *no data*, not zero crime, so every figure here is
computed against the months the postcode's own force actually published:

* Each postcode is assigned a home force by majority vote of the incidents that
  matched it (BTP, which reports nationwide, is excluded from the vote);
  postcodes with no incidents inherit their outcode's majority force, then the
  national modal force.
* The headline ``"{type} (avg/yr)"`` is the POOLED annualised rate over the
  force's covered months: ``sum(counts in covered years) * 12 / covered_months``.
  Years in which the force published nothing contribute neither incidents nor
  months, so a coverage gap no longer reads as a low-crime period. (Pooling over
  covered months also fixes the old "divide by years-with-incidents" headline,
  which inflated sporadic categories by up to ~15x.)
* The by-year series only emits bars for years with at least
  ``min_bar_months`` covered months (default 6): annualising a single observed
  month x12 produced misleading spikes. Each bar is scaled by the force's
  covered months in that year, not the global month calendar.
* ``covered_years`` (list[struct{year, months}]) is written for every postcode
  so the server can tell "covered, zero crime" (year listed, no bar) from "no
  data" (year absent) instead of charting gaps as zeros.
* Postcodes whose boundary buffer is unusable (broken geometry) get null
  headline columns and an empty ``covered_years`` -- unknown, not zero.

Outputs mirror the old LSOA transform's shape but are keyed on ``postcode``:

* ``crime_by_postcode.parquet`` -- ``postcode`` + ``"{type} (avg/yr)"`` columns.
* ``crime_by_postcode_by_year.parquet`` -- one row per postcode: ``postcode`` +
  ``covered_years`` + nested ``"{type} (by year)"`` ``list[struct{year, count}]``
  columns, with Serious/Minor rollups.

Caveat: police.uk coordinates are snapped to a fixed set of anonymous "map
points", not true locations, and a share of rows have no coordinate at all
(dropped here). Spatial totals are therefore fuzzier than the old LSOA-tagged
counts -- by design, not a regression.
"""

from __future__ import annotations

import argparse
import re
import sys
from pathlib import Path

import numpy as np
import polars as pl
import shapely
from pyproj import Transformer

from pipeline.transform.crime import (
    LEGACY_CRIME_TYPE_ALIASES,
    MINOR_CRIME_TYPES,
    SERIOUS_CRIME_TYPES,
    find_street_crime_csvs,
)
from pipeline.transform.postcode_boundaries.loader import load_postcode_polygons

# Serious types first so column order is stable and self-documenting.
ALL_CRIME_TYPES: tuple[str, ...] = SERIOUS_CRIME_TYPES + MINOR_CRIME_TYPES

DEFAULT_BUFFER_M = 100.0
MONTH_DIR_RE = re.compile(r"^\d{4}-\d{2}$")
STREET_CSV_NAME_RE = re.compile(r"^(\d{4}-\d{2})-(.+)-street\.csv$")

# Minimum covered months for a year to get a by-year chart bar (and to be
# listed in `covered_years`). Annualising fewer observed months (x12 from a
# single month at the worst) produces bars dominated by noise, and the first
# (2010: one month) and current partial year would otherwise always chart as
# spikes/dips. Six months keeps the annualisation factor <= 2.
MIN_BAR_MONTHS = 6

# Forces that report nationwide rather than policing a territory. They never
# define a postcode's home force (their publication calendar says nothing about
# whether the *territorial* force covering the postcode published), but their
# incidents still count toward whichever postcodes they fall in.
NON_TERRITORIAL_FORCES = frozenset({"btp"})

COVERAGE_COLUMN = "covered_years"

# Generous GB bounds; points outside fall in no English postcode anyway, but
# filtering first keeps the WGS84->BNG transform out of its undefined region.
LON_BOUNDS = (-9.5, 2.5)
LAT_BOUNDS = (49.0, 61.5)

# Read CSVs in chunks of files to bound peak memory while keeping the STRtree
# query vectorised over a useful number of points.
_CSV_BATCH = 64


def _force_calendar(
    csvs: list[Path],
) -> tuple[list[int], list[str], np.ndarray]:
    """Derive the per-force publication calendar from the CSV paths.

    Each police.uk file lives under ``{crime_dir}/{YYYY-MM}/{YYYY-MM}-{force}-
    street.csv`` and holds that force's incidents for that month, so file
    presence IS the coverage signal: a (force, month) with no file published
    nothing. Returns the sorted distinct years, the force slugs (sorted), and
    ``months_in_year_force`` of shape (n_forces, n_years) -- how many months
    each force published in each year.
    """
    month_force: set[tuple[str, str]] = set()
    for path in csvs:
        if not MONTH_DIR_RE.fullmatch(path.parent.name):
            continue
        m = STREET_CSV_NAME_RE.fullmatch(path.name)
        if m is None or m.group(1) != path.parent.name:
            continue
        month_force.add((m.group(1), m.group(2)))
    if not month_force:
        raise ValueError("No valid YYYY-MM street crime CSVs found")

    years = sorted({int(month[:4]) for month, _ in month_force})
    forces = sorted({force for _, force in month_force})
    year_to_idx = {year: idx for idx, year in enumerate(years)}
    force_to_idx = {force: idx for idx, force in enumerate(forces)}

    months_in_year_force = np.zeros((len(forces), len(years)), dtype=np.int32)
    for month, force in month_force:
        months_in_year_force[force_to_idx[force], year_to_idx[int(month[:4])]] += 1

    # Surface coverage gaps loudly: any territorial force missing months inside
    # the global publication window is exactly the data hole the coverage
    # masking exists for.
    all_months = {month for month, _ in month_force}
    for force in forces:
        published = {m for m, f in month_force if f == force}
        missing = len(all_months) - len(published)
        if missing:
            print(
                f"  coverage gap: {force} missing {missing}/{len(all_months)} months"
            )

    return years, forces, months_in_year_force


def _build_tree(
    polygons: np.ndarray, buffer_m: float
) -> tuple[np.ndarray, shapely.STRtree]:
    """Buffer postcode polygons outward by ``buffer_m`` and index them.

    Buffer index == postcode index. Geometries that fail to buffer are replaced
    with an empty polygon so the index stays aligned; they simply never match.
    """
    buffers = shapely.buffer(polygons, buffer_m, quad_segs=8)
    broken = shapely.is_missing(buffers) | ~shapely.is_valid(buffers)
    if broken.any():
        print(f"  {int(broken.sum()):,} postcode buffers unusable; left empty")
        buffers[broken] = shapely.from_wkt("POLYGON EMPTY")
    return buffers, shapely.STRtree(buffers)


def _accumulate_counts(
    csvs: list[Path],
    tree: shapely.STRtree,
    type_to_idx: dict[str, int],
    year_to_idx: dict[int, int],
    force_to_idx: dict[str, int],
    transformer: Transformer,
    counts: np.ndarray,
    force_votes: np.ndarray,
) -> None:
    """Stream the crime CSVs, counting points-in-buffer per (postcode, type, year).

    Also accumulates ``force_votes`` (n_postcodes, n_forces): how many matched
    incidents each force's files contributed to each postcode, which later
    elects the postcode's home force for the coverage calendar.
    """
    schema = {
        "Longitude": pl.Float64,
        "Latitude": pl.Float64,
        "Month": pl.Utf8,
        "Crime type": pl.Utf8,
    }
    years = list(year_to_idx)
    total_points = 0
    total_matches = 0
    total_dropped = 0
    unknown_type_counts: dict[str, int] = {}

    for start in range(0, len(csvs), _CSV_BATCH):
        batch = csvs[start : start + _CSV_BATCH]
        # The source file identifies the publishing force (police.uk has no
        # force column with consistent naming); map each path back to its
        # force index for the home-force vote.
        path_to_fidx = {}
        for path in batch:
            m = STREET_CSV_NAME_RE.fullmatch(path.name)
            if m is not None and m.group(2) in force_to_idx:
                path_to_fidx[str(path)] = force_to_idx[m.group(2)]
        frame = (
            pl.scan_csv(
                batch,
                schema_overrides=schema,
                ignore_errors=True,
                include_file_paths="_source_path",
            )
            .select("Longitude", "Latitude", "Month", "Crime type", "_source_path")
            # strict=False: a single malformed Month drops only that row instead
            # of aborting the whole build (a non-numeric year becomes null and is
            # filtered out by the year membership check below).
            .with_columns(
                pl.col("Month").str.slice(0, 4).cast(pl.Int32, strict=False).alias("year")
            )
            .filter(
                pl.col("Longitude").is_not_null()
                & pl.col("Latitude").is_not_null()
                & pl.col("Longitude").is_between(*LON_BOUNDS)
                & pl.col("Latitude").is_between(*LAT_BOUNDS)
                & pl.col("Crime type").is_not_null()
                & (pl.col("Crime type") != "")
                & pl.col("year").is_in(years)
            )
            # Canonicalise legacy pre-2014 crime-type names ("Violent crime",
            # "Public disorder and weapons") to their current equivalents before
            # indexing, so ~1.9M historical incidents are counted instead of
            # dropped. `.replace` leaves current types unchanged.
            .with_columns(pl.col("Crime type").replace(LEGACY_CRIME_TYPE_ALIASES))
            # Map crime types to indices with default=None so an unrecognised
            # type yields a null index we can *report* rather than silently drop
            # (the legacy LSOA path surfaced unknown types via its dynamic pivot).
            .with_columns(
                pl.col("Crime type")
                .replace_strict(type_to_idx, default=None, return_dtype=pl.Int32)
                .alias("tidx"),
                pl.col("year")
                .replace_strict(year_to_idx, return_dtype=pl.Int32)
                .alias("yidx"),
                pl.col("_source_path")
                .replace_strict(path_to_fidx, default=-1, return_dtype=pl.Int32)
                .alias("fidx"),
            )
            .select("Longitude", "Latitude", "Crime type", "tidx", "yidx", "fidx")
            .collect(engine="streaming")
        )

        if frame.height == 0:
            continue

        unknown = frame.filter(pl.col("tidx").is_null())
        if unknown.height:
            for name, cnt in unknown.group_by("Crime type").len().iter_rows():
                unknown_type_counts[name] = unknown_type_counts.get(name, 0) + cnt
            frame = frame.filter(pl.col("tidx").is_not_null())
            if frame.height == 0:
                continue

        lon = frame["Longitude"].to_numpy()
        lat = frame["Latitude"].to_numpy()
        tidx = frame["tidx"].to_numpy()
        yidx = frame["yidx"].to_numpy()
        fidx = frame["fidx"].to_numpy()

        x, y = transformer.transform(lon, lat)
        finite = np.isfinite(x) & np.isfinite(y)
        total_dropped += int((~finite).sum())
        if not finite.any():
            continue
        x, y, tidx, yidx, fidx = (
            x[finite],
            y[finite],
            tidx[finite],
            yidx[finite],
            fidx[finite],
        )
        total_points += x.size

        points = shapely.points(x, y)
        point_index, postcode_index = tree.query(points, predicate="intersects")
        if point_index.size:
            np.add.at(
                counts,
                (postcode_index, tidx[point_index], yidx[point_index]),
                1,
            )
            matched_fidx = fidx[point_index]
            known_force = matched_fidx >= 0
            if known_force.any():
                np.add.at(
                    force_votes,
                    (postcode_index[known_force], matched_fidx[known_force]),
                    1,
                )
            total_matches += point_index.size

        print(
            f"  files {start + len(batch):,}/{len(csvs):,}: "
            f"{total_points:,} located points, {total_matches:,} postcode matches"
        )

    if total_dropped:
        print(f"Dropped {total_dropped:,} points outside the BNG transform domain")
    if unknown_type_counts:
        total_unknown = sum(unknown_type_counts.values())
        listed = ", ".join(
            f"{name!r} ({cnt:,})"
            for name, cnt in sorted(
                unknown_type_counts.items(), key=lambda kv: kv[1], reverse=True
            )
        )
        print(
            f"WARNING: dropped {total_unknown:,} incidents with crime types not in "
            f"ALL_CRIME_TYPES (taxonomy is stale -- update SERIOUS/MINOR_CRIME_TYPES): "
            f"{listed}",
            file=sys.stderr,
        )


def _assign_home_force(
    postcodes: np.ndarray,
    force_votes: np.ndarray,
    forces: list[str],
) -> np.ndarray:
    """Elect each postcode's home (territorial) force.

    Majority vote of matched incidents per publishing force; non-territorial
    forces (BTP) are excluded from the vote because their calendar says nothing
    about local coverage. Postcodes with no votes (no incidents ever, or
    BTP-only) inherit the majority force of their outcode, then the national
    modal force, so every postcode gets a coverage calendar.
    """
    votes = force_votes.astype(np.int64, copy=True)
    for idx, force in enumerate(forces):
        if force in NON_TERRITORIAL_FORCES:
            votes[:, idx] = 0

    home = votes.argmax(axis=1).astype(np.int32)
    has_vote = votes.max(axis=1) > 0
    home[~has_vote] = -1

    if not has_vote.any():
        raise ValueError("No incidents matched any postcode; cannot assign forces")

    # Outcode-majority fallback for postcodes with no (territorial) incidents.
    outcodes = np.array([pc.split(" ")[0] for pc in postcodes], dtype=object)
    national_modal = int(
        np.bincount(home[has_vote], minlength=len(forces)).argmax()
    )
    if (~has_vote).any():
        outcode_modal: dict[str, int] = {}
        voted_outcodes = outcodes[has_vote]
        voted_home = home[has_vote]
        for oc in np.unique(voted_outcodes):
            counts = np.bincount(voted_home[voted_outcodes == oc], minlength=len(forces))
            outcode_modal[oc] = int(counts.argmax())
        fallback = np.array(
            [outcode_modal.get(oc, national_modal) for oc in outcodes[~has_vote]],
            dtype=np.int32,
        )
        home[~has_vote] = fallback
        print(
            f"  {int((~has_vote).sum()):,} postcodes had no territorial incidents; "
            "home force inherited from outcode majority"
        )

    return home


def _rollup_long(
    long: pl.DataFrame, types: tuple[str, ...], rollup_name: str
) -> pl.DataFrame:
    """Sum per-year annualised counts across ``types`` into a single rollup."""
    return (
        long.filter(pl.col("Crime type").is_in(list(types)))
        .group_by("postcode", "year")
        .agg(pl.col("count").sum().round(1).alias("count"))
        .with_columns(pl.lit(rollup_name).alias("Crime type"))
        .select("postcode", "Crime type", "year", "count")
    )


def _write_avg_yr(
    postcodes: np.ndarray,
    counts: np.ndarray,
    months_in_year_force: np.ndarray,
    home_fidx: np.ndarray,
    norm: np.ndarray,
    output_path: Path,
) -> None:
    """Write ``postcode`` + ``"{type} (avg/yr)"`` density-normalised averages.

    The headline is the POOLED annualised rate over the home force's covered
    months: ``sum(counts in covered years) * 12 / covered_months``. Years the
    force published nothing contribute neither incidents nor months, so a
    coverage gap (e.g. Greater Manchester 2019-07 onwards) is excluded instead
    of read as zero crime. Pooling over the full covered window -- rather than
    averaging only over years a type happened to occur -- is what keeps a
    single robbery-year from printing as a perennial robbery rate. Each
    postcode's value is then multiplied by ``norm`` (median_area / buffered
    catchment area) so the metric is a density rather than a footprint-inflated
    raw count; postcodes with unusable geometry (norm == 0) are null, not 0.
    """
    n_postcodes, n_types = counts.shape[0], counts.shape[1]
    avg = np.full((n_postcodes, n_types), np.nan, dtype=np.float64)
    for f in range(months_in_year_force.shape[0]):
        sel = home_fidx == f
        if not sel.any():
            continue
        cov_months = months_in_year_force[f].astype(np.float64)
        denom = cov_months.sum()
        if denom <= 0:
            continue  # force never published; stays null
        covered_years = cov_months > 0
        pooled = counts[sel][:, :, covered_years].sum(axis=2, dtype=np.float64)
        avg[sel] = pooled * 12.0 / denom

    avg *= norm[:, None]
    avg[norm <= 0] = np.nan  # unusable geometry: unknown, not zero
    avg = np.round(avg, 1).astype(np.float32)

    data: dict[str, np.ndarray] = {"postcode": postcodes}
    for type_idx, name in enumerate(ALL_CRIME_TYPES):
        data[f"{name} (avg/yr)"] = avg[:, type_idx]

    # Serious/Minor rollup headlines = the exact SUM of their component (avg/yr)
    # columns, so each rollup always equals the sum of the parts shown beside it
    # and can never fall below one of its own components. All components share
    # the postcode's pooled covered-month denominator, so the sum is itself the
    # pooled rollup rate. Null components (unusable geometry) propagate to a
    # null rollup.
    for rollup_name, rollup_types in (
        ("Serious crime", SERIOUS_CRIME_TYPES),
        ("Minor crime", MINOR_CRIME_TYPES),
    ):
        rollup_idx = [ALL_CRIME_TYPES.index(name) for name in rollup_types]
        data[f"{rollup_name} (avg/yr)"] = np.round(
            avg[:, rollup_idx].sum(axis=1), 1
        ).astype(np.float32)

    frame = pl.DataFrame(data)
    value_cols = [c for c in frame.columns if c != "postcode"]
    frame = frame.with_columns(pl.col(c).fill_nan(None) for c in value_cols)

    output_path.parent.mkdir(parents=True, exist_ok=True)
    frame.write_parquet(output_path, compression="zstd")
    print(f"Wrote postcode crime averages: {output_path}")


def _write_by_year(
    postcodes: np.ndarray,
    counts: np.ndarray,
    years: list[int],
    months_in_year_force: np.ndarray,
    home_fidx: np.ndarray,
    norm: np.ndarray,
    min_bar_months: int,
    output_path: Path,
) -> None:
    """Write nested ``"{type} (by year)"`` series plus rollups and coverage.

    A bar is only emitted for (postcode, year)s where the postcode's home force
    published at least ``min_bar_months`` months -- annualising a thinner year
    (x12 from a single month at the extreme) charts noise, and a force-gap year
    must chart as *no data*, not zero. Bars are scaled by the force's covered
    months in that year and area-normalised by the same ``norm`` factor as the
    headline so chart and headline stay mutually consistent.

    Every postcode gets a row (the output is dense) carrying ``covered_years``
    -- the list of {year, months} the home force published at least
    ``min_bar_months`` months -- so consumers can distinguish covered-but-
    crime-free years (year listed, no bar => genuine zero) from coverage gaps
    (year absent => unknown). Postcodes with unusable geometry get an empty
    coverage list: their crime picture is unknown.
    """
    # (n_postcodes, n_years): covered months of each postcode's home force.
    cov_pc_year = months_in_year_force[home_fidx, :]
    usable = norm > 0

    annual = np.round(
        counts.astype(np.float64)
        * 12.0
        / np.maximum(cov_pc_year[:, None, :], 1)
        * norm[:, None, None],
        1,
    )
    bar_ok = (
        (counts > 0)
        & (cov_pc_year[:, None, :] >= min_bar_months)
        & usable[:, None, None]
    )

    pc_i, ty_i, yr_i = np.nonzero(bar_ok)

    type_names = np.array(ALL_CRIME_TYPES, dtype=object)
    year_values = np.array(years, dtype=np.int32)
    # Explicit schema: with full masking (e.g. every year below min_bar_months)
    # the fancy-indexed numpy object arrays are empty and polars would infer
    # Object columns, which breaks the rollup `is_in` below.
    long = pl.DataFrame(
        {
            "postcode": postcodes[pc_i].astype(str),
            "Crime type": type_names[ty_i].astype(str),
            "year": year_values[yr_i],
            "count": annual[pc_i, ty_i, yr_i].astype(np.float32),
        },
        schema_overrides={"postcode": pl.String, "Crime type": pl.String},
    )

    serious = _rollup_long(long, SERIOUS_CRIME_TYPES, "Serious crime")
    minor = _rollup_long(long, MINOR_CRIME_TYPES, "Minor crime")
    combined = pl.concat([long, serious, minor])

    by_type = (
        combined.sort("year")
        .group_by("postcode", "Crime type")
        .agg(pl.struct("year", "count").alias("series"))
    )
    wide = by_type.pivot(on="Crime type", index="postcode", values="series")
    type_cols = [c for c in wide.columns if c != "postcode"]
    wide = wide.rename({col: f"{col} (by year)" for col in type_cols})

    # Dense base: every postcode, with its home force's coverage calendar.
    # Built per force (there are ~45) and joined on the force index.
    coverage_per_force: list[list[dict[str, int]]] = []
    for f in range(months_in_year_force.shape[0]):
        coverage_per_force.append(
            [
                {"year": int(years[y]), "months": int(m)}
                for y, m in enumerate(months_in_year_force[f])
                if m >= min_bar_months
            ]
        )
    coverage_frame = pl.DataFrame(
        {
            "_fidx": pl.Series(range(len(coverage_per_force)), dtype=pl.Int32),
            COVERAGE_COLUMN: pl.Series(
                coverage_per_force,
                dtype=pl.List(pl.Struct({"year": pl.Int32, "months": pl.Int32})),
            ),
        }
    )
    base = pl.DataFrame(
        {
            "postcode": postcodes,
            "_fidx": pl.Series(home_fidx, dtype=pl.Int32),
            "_usable": pl.Series(usable),
        }
    )
    dense = (
        base.join(coverage_frame, on="_fidx", how="left")
        .with_columns(
            # Unusable geometry: empty coverage -- the crime picture is unknown.
            pl.when(pl.col("_usable"))
            .then(pl.col(COVERAGE_COLUMN))
            .otherwise(pl.col(COVERAGE_COLUMN).list.head(0))
            .alias(COVERAGE_COLUMN)
        )
        .drop("_fidx", "_usable")
    )
    wide = dense.join(wide, on="postcode", how="left")

    output_path.parent.mkdir(parents=True, exist_ok=True)
    wide.write_parquet(output_path, compression="zstd")
    print(f"Wrote postcode crime by-year series: {output_path}  {wide.shape}")


def transform_crime_spatial(
    crime_dir: Path,
    boundaries_dir: Path,
    output_path: Path,
    by_year_output_path: Path,
    buffer_m: float = DEFAULT_BUFFER_M,
    max_postcodes: int | None = None,
    max_files: int | None = None,
    min_bar_months: int = MIN_BAR_MONTHS,
) -> None:
    csvs, ignored_csv_count = find_street_crime_csvs(crime_dir)
    if not csvs:
        raise FileNotFoundError(f"No street crime CSV files found in {crime_dir}")
    if max_files is not None:
        csvs = csvs[:max_files]

    years, forces, months_in_year_force = _force_calendar(csvs)
    print(
        f"Found {len(csvs):,} street crime CSVs across {len(forces)} forces "
        f"({years[0]}-{years[-1]})"
        + (f" (ignored {ignored_csv_count} non-street CSVs)" if ignored_csv_count else "")
    )

    postcodes, polygons = load_postcode_polygons(boundaries_dir, max_postcodes)

    print(f"Buffering {len(postcodes):,} postcode polygons by {buffer_m:g}m...")
    buffers, tree = _build_tree(polygons, buffer_m)

    # Area-normalisation factor (median_area / catchment_area): divides out the
    # size of each postcode's catchment so the count measures crime density, not
    # how much ground the buffer sweeps. We normalise by the *buffered* area --
    # the region that actually collects points -- rather than the raw polygon, so
    # a tiny unit postcode isn't over-inflated by the fixed buffer-ring floor.
    # Buffers are in EPSG:27700, so shapely.area is in m^2.
    areas = shapely.area(buffers).astype(np.float64)
    usable_area = np.isfinite(areas) & (areas > 0)
    if not usable_area.any():
        raise ValueError("No postcode buffers have a positive area to normalise by")
    median_area = float(np.median(areas[usable_area]))
    norm = np.zeros(len(postcodes), dtype=np.float64)
    norm[usable_area] = median_area / areas[usable_area]
    print(
        f"Area-normalising to median catchment area {median_area:,.0f} m^2 "
        f"({int(usable_area.sum()):,}/{len(areas):,} postcodes have usable area)"
    )

    type_to_idx = {name: idx for idx, name in enumerate(ALL_CRIME_TYPES)}
    year_to_idx = {year: idx for idx, year in enumerate(years)}
    force_to_idx = {force: idx for idx, force in enumerate(forces)}
    counts = np.zeros((len(postcodes), len(ALL_CRIME_TYPES), len(years)), dtype=np.int32)
    force_votes = np.zeros((len(postcodes), len(forces)), dtype=np.int32)

    transformer = Transformer.from_crs("EPSG:4326", "EPSG:27700", always_xy=True)
    _accumulate_counts(
        csvs, tree, type_to_idx, year_to_idx, force_to_idx, transformer, counts, force_votes
    )

    home_fidx = _assign_home_force(np.asarray(postcodes), force_votes, forces)

    _write_avg_yr(
        postcodes, counts, months_in_year_force, home_fidx, norm, output_path
    )
    _write_by_year(
        postcodes,
        counts,
        years,
        months_in_year_force,
        home_fidx,
        norm,
        min_bar_months,
        by_year_output_path,
    )


def main() -> None:
    parser = argparse.ArgumentParser(
        description="Count police.uk crime points near each postcode boundary"
    )
    parser.add_argument(
        "--input",
        type=Path,
        default=Path("property-data/crime"),
        help="Directory containing police.uk street crime CSVs",
    )
    parser.add_argument(
        "--boundaries",
        type=Path,
        default=Path("property-data/postcode_boundaries/units"),
        help="Directory of per-district postcode boundary GeoJSONs",
    )
    parser.add_argument(
        "--output",
        type=Path,
        required=True,
        help="Output parquet: postcode + '{type} (avg/yr)' columns",
    )
    parser.add_argument(
        "--output-by-year",
        type=Path,
        required=True,
        help="Output parquet: postcode + nested '{type} (by year)' columns",
    )
    parser.add_argument(
        "--buffer-m",
        type=float,
        default=DEFAULT_BUFFER_M,
        help="Outward buffer (metres) added to each postcode boundary",
    )
    parser.add_argument(
        "--max-postcodes",
        type=int,
        default=None,
        help="Testing only: process the first N postcodes",
    )
    parser.add_argument(
        "--max-files",
        type=int,
        default=None,
        help="Testing only: process the first N monthly CSV files",
    )
    parser.add_argument(
        "--min-bar-months",
        type=int,
        default=MIN_BAR_MONTHS,
        help="Minimum covered months for a year to get a by-year bar",
    )
    args = parser.parse_args()

    if args.buffer_m <= 0:
        raise SystemExit("--buffer-m must be greater than zero")

    transform_crime_spatial(
        crime_dir=args.input,
        boundaries_dir=args.boundaries,
        output_path=args.output,
        by_year_output_path=args.output_by_year,
        buffer_m=args.buffer_m,
        max_postcodes=args.max_postcodes,
        max_files=args.max_files,
        min_bar_months=args.min_bar_months,
    )


if __name__ == "__main__":
    main()