"""Aggregate police.uk street crime to postcodes by spatial proximity. Instead of attributing each incident to its published LSOA code, this transform counts the anonymised incident *points* that fall within ``buffer_m`` (default 100m) of each postcode's boundary polygon (the polygon buffered outward). A point inside several overlapping buffers counts for each postcode -- the same multiplicity the tree-density filter uses for features near more than one postcode. The wide 100m buffer deliberately smooths police.uk's snap-to-grid coordinates, which would otherwise make the count hypersensitive to which side of a narrow line a shared "map point" anchor happened to land on. Counts are **area-normalised**: each postcode's count is divided by its buffered catchment area and rescaled by the median catchment area, so the metric reflects crime *density* rather than how much ground the buffer sweeps (a median-sized catchment is left unchanged; a large rural postcode is no longer inflated simply for covering more of the map). Normalising by the buffered area -- the region that actually collects points -- rather than the raw polygon keeps tiny unit postcodes from being over-inflated by the fixed buffer-ring floor. NOTE: this is an incident *density of the surrounding streets*, not a per-resident risk -- zero-resident commercial centres (Soho, retail parks) legitimately rank high. **Force-coverage calendar.** police.uk has multi-year publication gaps for whole forces (Greater Manchester has published nothing between 2019-07 and the present except 2022-08; BTP, Gloucestershire, Devon & Cornwall and others have shorter gaps). A missing month is *no data*, not zero crime, so every figure here is computed against the months the postcode's own force actually published: * Each postcode is assigned a home force by majority vote of the incidents that matched it (BTP, which reports nationwide, is excluded from the vote); postcodes with no incidents inherit their outcode's majority force, then the national modal force. * The headline ``"{type} (avg/yr)"`` is the POOLED annualised rate over the force's covered months: ``sum(counts in covered years) * 12 / covered_months``. Years in which the force published nothing contribute neither incidents nor months, so a coverage gap no longer reads as a low-crime period. (Pooling over covered months also fixes the old "divide by years-with-incidents" headline, which inflated sporadic categories by up to ~15x.) * The by-year series only emits bars for years with at least ``min_bar_months`` covered months (default 6): annualising a single observed month x12 produced misleading spikes. Each bar is scaled by the force's covered months in that year, not the global month calendar. * ``covered_years`` (list[struct{year, months}]) is written for every postcode so the server can tell "covered, zero crime" (year listed, no bar) from "no data" (year absent) instead of charting gaps as zeros. * Postcodes whose boundary buffer is unusable (broken geometry) get null headline columns and an empty ``covered_years`` -- unknown, not zero. Outputs mirror the old LSOA transform's shape but are keyed on ``postcode``: * ``crime_by_postcode.parquet`` -- ``postcode`` + ``"{type} (avg/yr)"`` columns. * ``crime_by_postcode_by_year.parquet`` -- one row per postcode: ``postcode`` + ``covered_years`` + nested ``"{type} (by year)"`` ``list[struct{year, count}]`` columns, with Serious/Minor rollups. Caveat: police.uk coordinates are snapped to a fixed set of anonymous "map points", not true locations, and a share of rows have no coordinate at all (dropped here). Spatial totals are therefore fuzzier than the old LSOA-tagged counts -- by design, not a regression. """ from __future__ import annotations import argparse import re import sys from pathlib import Path import numpy as np import polars as pl import shapely from pyproj import Transformer from pipeline.transform.crime import ( LEGACY_CRIME_TYPE_ALIASES, MINOR_CRIME_TYPES, SERIOUS_CRIME_TYPES, find_street_crime_csvs, ) from pipeline.transform.postcode_boundaries.loader import load_postcode_polygons # Serious types first so column order is stable and self-documenting. ALL_CRIME_TYPES: tuple[str, ...] = SERIOUS_CRIME_TYPES + MINOR_CRIME_TYPES DEFAULT_BUFFER_M = 100.0 MONTH_DIR_RE = re.compile(r"^\d{4}-\d{2}$") STREET_CSV_NAME_RE = re.compile(r"^(\d{4}-\d{2})-(.+)-street\.csv$") # Minimum covered months for a year to get a by-year chart bar (and to be # listed in `covered_years`). Annualising fewer observed months (x12 from a # single month at the worst) produces bars dominated by noise, and the first # (2010: one month) and current partial year would otherwise always chart as # spikes/dips. Six months keeps the annualisation factor <= 2. MIN_BAR_MONTHS = 6 # Forces that report nationwide rather than policing a territory. They never # define a postcode's home force (their publication calendar says nothing about # whether the *territorial* force covering the postcode published), but their # incidents still count toward whichever postcodes they fall in. NON_TERRITORIAL_FORCES = frozenset({"btp"}) COVERAGE_COLUMN = "covered_years" # Generous GB bounds; points outside fall in no English postcode anyway, but # filtering first keeps the WGS84->BNG transform out of its undefined region. LON_BOUNDS = (-9.5, 2.5) LAT_BOUNDS = (49.0, 61.5) # Read CSVs in chunks of files to bound peak memory while keeping the STRtree # query vectorised over a useful number of points. _CSV_BATCH = 64 def _force_calendar( csvs: list[Path], ) -> tuple[list[int], list[str], np.ndarray]: """Derive the per-force publication calendar from the CSV paths. Each police.uk file lives under ``{crime_dir}/{YYYY-MM}/{YYYY-MM}-{force}- street.csv`` and holds that force's incidents for that month, so file presence IS the coverage signal: a (force, month) with no file published nothing. Returns the sorted distinct years, the force slugs (sorted), and ``months_in_year_force`` of shape (n_forces, n_years) -- how many months each force published in each year. """ month_force: set[tuple[str, str]] = set() for path in csvs: if not MONTH_DIR_RE.fullmatch(path.parent.name): continue m = STREET_CSV_NAME_RE.fullmatch(path.name) if m is None or m.group(1) != path.parent.name: continue month_force.add((m.group(1), m.group(2))) if not month_force: raise ValueError("No valid YYYY-MM street crime CSVs found") years = sorted({int(month[:4]) for month, _ in month_force}) forces = sorted({force for _, force in month_force}) year_to_idx = {year: idx for idx, year in enumerate(years)} force_to_idx = {force: idx for idx, force in enumerate(forces)} months_in_year_force = np.zeros((len(forces), len(years)), dtype=np.int32) for month, force in month_force: months_in_year_force[force_to_idx[force], year_to_idx[int(month[:4])]] += 1 # Surface coverage gaps loudly: any territorial force missing months inside # the global publication window is exactly the data hole the coverage # masking exists for. all_months = {month for month, _ in month_force} for force in forces: published = {m for m, f in month_force if f == force} missing = len(all_months) - len(published) if missing: print( f" coverage gap: {force} missing {missing}/{len(all_months)} months" ) return years, forces, months_in_year_force def _build_tree( polygons: np.ndarray, buffer_m: float ) -> tuple[np.ndarray, shapely.STRtree]: """Buffer postcode polygons outward by ``buffer_m`` and index them. Buffer index == postcode index. Geometries that fail to buffer are replaced with an empty polygon so the index stays aligned; they simply never match. """ buffers = shapely.buffer(polygons, buffer_m, quad_segs=8) broken = shapely.is_missing(buffers) | ~shapely.is_valid(buffers) if broken.any(): print(f" {int(broken.sum()):,} postcode buffers unusable; left empty") buffers[broken] = shapely.from_wkt("POLYGON EMPTY") return buffers, shapely.STRtree(buffers) def _accumulate_counts( csvs: list[Path], tree: shapely.STRtree, type_to_idx: dict[str, int], year_to_idx: dict[int, int], force_to_idx: dict[str, int], transformer: Transformer, counts: np.ndarray, force_votes: np.ndarray, ) -> None: """Stream the crime CSVs, counting points-in-buffer per (postcode, type, year). Also accumulates ``force_votes`` (n_postcodes, n_forces): how many matched incidents each force's files contributed to each postcode, which later elects the postcode's home force for the coverage calendar. """ schema = { "Longitude": pl.Float64, "Latitude": pl.Float64, "Month": pl.Utf8, "Crime type": pl.Utf8, } years = list(year_to_idx) total_points = 0 total_matches = 0 total_dropped = 0 unknown_type_counts: dict[str, int] = {} for start in range(0, len(csvs), _CSV_BATCH): batch = csvs[start : start + _CSV_BATCH] # The source file identifies the publishing force (police.uk has no # force column with consistent naming); map each path back to its # force index for the home-force vote. path_to_fidx = {} for path in batch: m = STREET_CSV_NAME_RE.fullmatch(path.name) if m is not None and m.group(2) in force_to_idx: path_to_fidx[str(path)] = force_to_idx[m.group(2)] frame = ( pl.scan_csv( batch, schema_overrides=schema, ignore_errors=True, include_file_paths="_source_path", ) .select("Longitude", "Latitude", "Month", "Crime type", "_source_path") # strict=False: a single malformed Month drops only that row instead # of aborting the whole build (a non-numeric year becomes null and is # filtered out by the year membership check below). .with_columns( pl.col("Month").str.slice(0, 4).cast(pl.Int32, strict=False).alias("year") ) .filter( pl.col("Longitude").is_not_null() & pl.col("Latitude").is_not_null() & pl.col("Longitude").is_between(*LON_BOUNDS) & pl.col("Latitude").is_between(*LAT_BOUNDS) & pl.col("Crime type").is_not_null() & (pl.col("Crime type") != "") & pl.col("year").is_in(years) ) # Canonicalise legacy pre-2014 crime-type names ("Violent crime", # "Public disorder and weapons") to their current equivalents before # indexing, so ~1.9M historical incidents are counted instead of # dropped. `.replace` leaves current types unchanged. .with_columns(pl.col("Crime type").replace(LEGACY_CRIME_TYPE_ALIASES)) # Map crime types to indices with default=None so an unrecognised # type yields a null index we can *report* rather than silently drop # (the legacy LSOA path surfaced unknown types via its dynamic pivot). .with_columns( pl.col("Crime type") .replace_strict(type_to_idx, default=None, return_dtype=pl.Int32) .alias("tidx"), pl.col("year") .replace_strict(year_to_idx, return_dtype=pl.Int32) .alias("yidx"), pl.col("_source_path") .replace_strict(path_to_fidx, default=-1, return_dtype=pl.Int32) .alias("fidx"), ) .select("Longitude", "Latitude", "Crime type", "tidx", "yidx", "fidx") .collect(engine="streaming") ) if frame.height == 0: continue unknown = frame.filter(pl.col("tidx").is_null()) if unknown.height: for name, cnt in unknown.group_by("Crime type").len().iter_rows(): unknown_type_counts[name] = unknown_type_counts.get(name, 0) + cnt frame = frame.filter(pl.col("tidx").is_not_null()) if frame.height == 0: continue lon = frame["Longitude"].to_numpy() lat = frame["Latitude"].to_numpy() tidx = frame["tidx"].to_numpy() yidx = frame["yidx"].to_numpy() fidx = frame["fidx"].to_numpy() x, y = transformer.transform(lon, lat) finite = np.isfinite(x) & np.isfinite(y) total_dropped += int((~finite).sum()) if not finite.any(): continue x, y, tidx, yidx, fidx = ( x[finite], y[finite], tidx[finite], yidx[finite], fidx[finite], ) total_points += x.size points = shapely.points(x, y) point_index, postcode_index = tree.query(points, predicate="intersects") if point_index.size: np.add.at( counts, (postcode_index, tidx[point_index], yidx[point_index]), 1, ) matched_fidx = fidx[point_index] known_force = matched_fidx >= 0 if known_force.any(): np.add.at( force_votes, (postcode_index[known_force], matched_fidx[known_force]), 1, ) total_matches += point_index.size print( f" files {start + len(batch):,}/{len(csvs):,}: " f"{total_points:,} located points, {total_matches:,} postcode matches" ) if total_dropped: print(f"Dropped {total_dropped:,} points outside the BNG transform domain") if unknown_type_counts: total_unknown = sum(unknown_type_counts.values()) listed = ", ".join( f"{name!r} ({cnt:,})" for name, cnt in sorted( unknown_type_counts.items(), key=lambda kv: kv[1], reverse=True ) ) print( f"WARNING: dropped {total_unknown:,} incidents with crime types not in " f"ALL_CRIME_TYPES (taxonomy is stale -- update SERIOUS/MINOR_CRIME_TYPES): " f"{listed}", file=sys.stderr, ) def _assign_home_force( postcodes: np.ndarray, force_votes: np.ndarray, forces: list[str], ) -> np.ndarray: """Elect each postcode's home (territorial) force. Majority vote of matched incidents per publishing force; non-territorial forces (BTP) are excluded from the vote because their calendar says nothing about local coverage. Postcodes with no votes (no incidents ever, or BTP-only) inherit the majority force of their outcode, then the national modal force, so every postcode gets a coverage calendar. """ votes = force_votes.astype(np.int64, copy=True) for idx, force in enumerate(forces): if force in NON_TERRITORIAL_FORCES: votes[:, idx] = 0 home = votes.argmax(axis=1).astype(np.int32) has_vote = votes.max(axis=1) > 0 home[~has_vote] = -1 if not has_vote.any(): raise ValueError("No incidents matched any postcode; cannot assign forces") # Outcode-majority fallback for postcodes with no (territorial) incidents. outcodes = np.array([pc.split(" ")[0] for pc in postcodes], dtype=object) national_modal = int( np.bincount(home[has_vote], minlength=len(forces)).argmax() ) if (~has_vote).any(): outcode_modal: dict[str, int] = {} voted_outcodes = outcodes[has_vote] voted_home = home[has_vote] for oc in np.unique(voted_outcodes): counts = np.bincount(voted_home[voted_outcodes == oc], minlength=len(forces)) outcode_modal[oc] = int(counts.argmax()) fallback = np.array( [outcode_modal.get(oc, national_modal) for oc in outcodes[~has_vote]], dtype=np.int32, ) home[~has_vote] = fallback print( f" {int((~has_vote).sum()):,} postcodes had no territorial incidents; " "home force inherited from outcode majority" ) return home def _rollup_long( long: pl.DataFrame, types: tuple[str, ...], rollup_name: str ) -> pl.DataFrame: """Sum per-year annualised counts across ``types`` into a single rollup.""" return ( long.filter(pl.col("Crime type").is_in(list(types))) .group_by("postcode", "year") .agg(pl.col("count").sum().round(1).alias("count")) .with_columns(pl.lit(rollup_name).alias("Crime type")) .select("postcode", "Crime type", "year", "count") ) def _write_avg_yr( postcodes: np.ndarray, counts: np.ndarray, months_in_year_force: np.ndarray, home_fidx: np.ndarray, norm: np.ndarray, output_path: Path, ) -> None: """Write ``postcode`` + ``"{type} (avg/yr)"`` density-normalised averages. The headline is the POOLED annualised rate over the home force's covered months: ``sum(counts in covered years) * 12 / covered_months``. Years the force published nothing contribute neither incidents nor months, so a coverage gap (e.g. Greater Manchester 2019-07 onwards) is excluded instead of read as zero crime. Pooling over the full covered window -- rather than averaging only over years a type happened to occur -- is what keeps a single robbery-year from printing as a perennial robbery rate. Each postcode's value is then multiplied by ``norm`` (median_area / buffered catchment area) so the metric is a density rather than a footprint-inflated raw count; postcodes with unusable geometry (norm == 0) are null, not 0. """ n_postcodes, n_types = counts.shape[0], counts.shape[1] avg = np.full((n_postcodes, n_types), np.nan, dtype=np.float64) for f in range(months_in_year_force.shape[0]): sel = home_fidx == f if not sel.any(): continue cov_months = months_in_year_force[f].astype(np.float64) denom = cov_months.sum() if denom <= 0: continue # force never published; stays null covered_years = cov_months > 0 pooled = counts[sel][:, :, covered_years].sum(axis=2, dtype=np.float64) avg[sel] = pooled * 12.0 / denom avg *= norm[:, None] avg[norm <= 0] = np.nan # unusable geometry: unknown, not zero avg = np.round(avg, 1).astype(np.float32) data: dict[str, np.ndarray] = {"postcode": postcodes} for type_idx, name in enumerate(ALL_CRIME_TYPES): data[f"{name} (avg/yr)"] = avg[:, type_idx] # Serious/Minor rollup headlines = the exact SUM of their component (avg/yr) # columns, so each rollup always equals the sum of the parts shown beside it # and can never fall below one of its own components. All components share # the postcode's pooled covered-month denominator, so the sum is itself the # pooled rollup rate. Null components (unusable geometry) propagate to a # null rollup. for rollup_name, rollup_types in ( ("Serious crime", SERIOUS_CRIME_TYPES), ("Minor crime", MINOR_CRIME_TYPES), ): rollup_idx = [ALL_CRIME_TYPES.index(name) for name in rollup_types] data[f"{rollup_name} (avg/yr)"] = np.round( avg[:, rollup_idx].sum(axis=1), 1 ).astype(np.float32) frame = pl.DataFrame(data) value_cols = [c for c in frame.columns if c != "postcode"] frame = frame.with_columns(pl.col(c).fill_nan(None) for c in value_cols) output_path.parent.mkdir(parents=True, exist_ok=True) frame.write_parquet(output_path, compression="zstd") print(f"Wrote postcode crime averages: {output_path}") def _write_by_year( postcodes: np.ndarray, counts: np.ndarray, years: list[int], months_in_year_force: np.ndarray, home_fidx: np.ndarray, norm: np.ndarray, min_bar_months: int, output_path: Path, ) -> None: """Write nested ``"{type} (by year)"`` series plus rollups and coverage. A bar is only emitted for (postcode, year)s where the postcode's home force published at least ``min_bar_months`` months -- annualising a thinner year (x12 from a single month at the extreme) charts noise, and a force-gap year must chart as *no data*, not zero. Bars are scaled by the force's covered months in that year and area-normalised by the same ``norm`` factor as the headline so chart and headline stay mutually consistent. Every postcode gets a row (the output is dense) carrying ``covered_years`` -- the list of {year, months} the home force published at least ``min_bar_months`` months -- so consumers can distinguish covered-but- crime-free years (year listed, no bar => genuine zero) from coverage gaps (year absent => unknown). Postcodes with unusable geometry get an empty coverage list: their crime picture is unknown. """ # (n_postcodes, n_years): covered months of each postcode's home force. cov_pc_year = months_in_year_force[home_fidx, :] usable = norm > 0 annual = np.round( counts.astype(np.float64) * 12.0 / np.maximum(cov_pc_year[:, None, :], 1) * norm[:, None, None], 1, ) bar_ok = ( (counts > 0) & (cov_pc_year[:, None, :] >= min_bar_months) & usable[:, None, None] ) pc_i, ty_i, yr_i = np.nonzero(bar_ok) type_names = np.array(ALL_CRIME_TYPES, dtype=object) year_values = np.array(years, dtype=np.int32) # Explicit schema: with full masking (e.g. every year below min_bar_months) # the fancy-indexed numpy object arrays are empty and polars would infer # Object columns, which breaks the rollup `is_in` below. long = pl.DataFrame( { "postcode": postcodes[pc_i].astype(str), "Crime type": type_names[ty_i].astype(str), "year": year_values[yr_i], "count": annual[pc_i, ty_i, yr_i].astype(np.float32), }, schema_overrides={"postcode": pl.String, "Crime type": pl.String}, ) serious = _rollup_long(long, SERIOUS_CRIME_TYPES, "Serious crime") minor = _rollup_long(long, MINOR_CRIME_TYPES, "Minor crime") combined = pl.concat([long, serious, minor]) by_type = ( combined.sort("year") .group_by("postcode", "Crime type") .agg(pl.struct("year", "count").alias("series")) ) wide = by_type.pivot(on="Crime type", index="postcode", values="series") type_cols = [c for c in wide.columns if c != "postcode"] wide = wide.rename({col: f"{col} (by year)" for col in type_cols}) # Dense base: every postcode, with its home force's coverage calendar. # Built per force (there are ~45) and joined on the force index. coverage_per_force: list[list[dict[str, int]]] = [] for f in range(months_in_year_force.shape[0]): coverage_per_force.append( [ {"year": int(years[y]), "months": int(m)} for y, m in enumerate(months_in_year_force[f]) if m >= min_bar_months ] ) coverage_frame = pl.DataFrame( { "_fidx": pl.Series(range(len(coverage_per_force)), dtype=pl.Int32), COVERAGE_COLUMN: pl.Series( coverage_per_force, dtype=pl.List(pl.Struct({"year": pl.Int32, "months": pl.Int32})), ), } ) base = pl.DataFrame( { "postcode": postcodes, "_fidx": pl.Series(home_fidx, dtype=pl.Int32), "_usable": pl.Series(usable), } ) dense = ( base.join(coverage_frame, on="_fidx", how="left") .with_columns( # Unusable geometry: empty coverage -- the crime picture is unknown. pl.when(pl.col("_usable")) .then(pl.col(COVERAGE_COLUMN)) .otherwise(pl.col(COVERAGE_COLUMN).list.head(0)) .alias(COVERAGE_COLUMN) ) .drop("_fidx", "_usable") ) wide = dense.join(wide, on="postcode", how="left") output_path.parent.mkdir(parents=True, exist_ok=True) wide.write_parquet(output_path, compression="zstd") print(f"Wrote postcode crime by-year series: {output_path} {wide.shape}") def transform_crime_spatial( crime_dir: Path, boundaries_dir: Path, output_path: Path, by_year_output_path: Path, buffer_m: float = DEFAULT_BUFFER_M, max_postcodes: int | None = None, max_files: int | None = None, min_bar_months: int = MIN_BAR_MONTHS, ) -> None: csvs, ignored_csv_count = find_street_crime_csvs(crime_dir) if not csvs: raise FileNotFoundError(f"No street crime CSV files found in {crime_dir}") if max_files is not None: csvs = csvs[:max_files] years, forces, months_in_year_force = _force_calendar(csvs) print( f"Found {len(csvs):,} street crime CSVs across {len(forces)} forces " f"({years[0]}-{years[-1]})" + (f" (ignored {ignored_csv_count} non-street CSVs)" if ignored_csv_count else "") ) postcodes, polygons = load_postcode_polygons(boundaries_dir, max_postcodes) print(f"Buffering {len(postcodes):,} postcode polygons by {buffer_m:g}m...") buffers, tree = _build_tree(polygons, buffer_m) # Area-normalisation factor (median_area / catchment_area): divides out the # size of each postcode's catchment so the count measures crime density, not # how much ground the buffer sweeps. We normalise by the *buffered* area -- # the region that actually collects points -- rather than the raw polygon, so # a tiny unit postcode isn't over-inflated by the fixed buffer-ring floor. # Buffers are in EPSG:27700, so shapely.area is in m^2. areas = shapely.area(buffers).astype(np.float64) usable_area = np.isfinite(areas) & (areas > 0) if not usable_area.any(): raise ValueError("No postcode buffers have a positive area to normalise by") median_area = float(np.median(areas[usable_area])) norm = np.zeros(len(postcodes), dtype=np.float64) norm[usable_area] = median_area / areas[usable_area] print( f"Area-normalising to median catchment area {median_area:,.0f} m^2 " f"({int(usable_area.sum()):,}/{len(areas):,} postcodes have usable area)" ) type_to_idx = {name: idx for idx, name in enumerate(ALL_CRIME_TYPES)} year_to_idx = {year: idx for idx, year in enumerate(years)} force_to_idx = {force: idx for idx, force in enumerate(forces)} counts = np.zeros((len(postcodes), len(ALL_CRIME_TYPES), len(years)), dtype=np.int32) force_votes = np.zeros((len(postcodes), len(forces)), dtype=np.int32) transformer = Transformer.from_crs("EPSG:4326", "EPSG:27700", always_xy=True) _accumulate_counts( csvs, tree, type_to_idx, year_to_idx, force_to_idx, transformer, counts, force_votes ) home_fidx = _assign_home_force(np.asarray(postcodes), force_votes, forces) _write_avg_yr( postcodes, counts, months_in_year_force, home_fidx, norm, output_path ) _write_by_year( postcodes, counts, years, months_in_year_force, home_fidx, norm, min_bar_months, by_year_output_path, ) def main() -> None: parser = argparse.ArgumentParser( description="Count police.uk crime points near each postcode boundary" ) parser.add_argument( "--input", type=Path, default=Path("property-data/crime"), help="Directory containing police.uk street crime CSVs", ) parser.add_argument( "--boundaries", type=Path, default=Path("property-data/postcode_boundaries/units"), help="Directory of per-district postcode boundary GeoJSONs", ) parser.add_argument( "--output", type=Path, required=True, help="Output parquet: postcode + '{type} (avg/yr)' columns", ) parser.add_argument( "--output-by-year", type=Path, required=True, help="Output parquet: postcode + nested '{type} (by year)' columns", ) parser.add_argument( "--buffer-m", type=float, default=DEFAULT_BUFFER_M, help="Outward buffer (metres) added to each postcode boundary", ) parser.add_argument( "--max-postcodes", type=int, default=None, help="Testing only: process the first N postcodes", ) parser.add_argument( "--max-files", type=int, default=None, help="Testing only: process the first N monthly CSV files", ) parser.add_argument( "--min-bar-months", type=int, default=MIN_BAR_MONTHS, help="Minimum covered months for a year to get a by-year bar", ) args = parser.parse_args() if args.buffer_m <= 0: raise SystemExit("--buffer-m must be greater than zero") transform_crime_spatial( crime_dir=args.input, boundaries_dir=args.boundaries, output_path=args.output, by_year_output_path=args.output_by_year, buffer_m=args.buffer_m, max_postcodes=args.max_postcodes, max_files=args.max_files, min_bar_months=args.min_bar_months, ) if __name__ == "__main__": main()