This commit is contained in:
Andras Schmelczer 2026-05-28 14:27:52 +01:00
parent d83691323f
commit f5f017b01f
14 changed files with 103 additions and 46 deletions

View file

@ -23,12 +23,12 @@ links:
**The short version:**
- One Alpine container, ~75 lines of Bash, that snapshots a BTRFS volume and pushes the snapshot to one or more [Borg](https://borgbackup.readthedocs.io/) repositories on a fixed interval. The snapshot is the only thing standing between "consistent backup" and "corrupt database in the archive."
- Multi-target via numeric env vars (`BORG_REPO_0`, `BORG_REPO_1`, ...). The wrapper iterates until the next index isn't set. No config format, no DSL the env file is the configuration.
- Multi-target via numeric env vars (`BORG_REPO_0`, `BORG_REPO_1`, ...). The wrapper iterates until the next index isn't set. No config format, no DSL; the env file is the configuration.
- Two years of self-hosting, multiple restored incidents, zero data loss I noticed.
## The problem the snapshot solves
I self-host several databases that are mid-write at every moment of the day. `tar | borg create` against the live volume is a race: a Postgres or SQLite file that's half-written when borg reads it goes into the archive in a state nothing on Earth can replay. The "right" answer is to coordinate a quiesce with every database a fan-out of `pg_dump`, SQLite `.backup`, Redis `BGSAVE`, and so on, all with retry, timeouts, and per-app credentials.
I self-host several databases that are mid-write at every moment of the day. `tar | borg create` against the live volume is a race: a Postgres or SQLite file that's half-written when borg reads it goes into the archive in a state nothing on Earth can replay. The "right" answer is to coordinate a quiesce with every database: a fan-out of `pg_dump`, SQLite `.backup`, Redis `BGSAVE`, and so on, all with retry, timeouts, and per-app credentials.
The cheaper answer, if you've put everything on one BTRFS volume, is `btrfs subvolume snapshot`. It returns instantly with a copy-on-write fork of the entire filesystem. Every file is now atomically consistent at exactly the same instant. Run borg against the snapshot, not against the live volume.
@ -59,7 +59,7 @@ BORG_REPO_1=/local-backup
There's also a no-index fallback (`BORG_REPO=...` with no number) for the single-target case. Same script, no extra config plane.
I keep coming back to this pattern for small-system orchestration. The env file *is* the data structure. There's no YAML parsing, no JSON schema, no config-validation layer between you and the variable that actually matters.
I keep coming back to this pattern for small-system orchestration. The env file _is_ the data structure. There's no YAML parsing, no JSON schema, no config-validation layer between you and the variable that actually matters.
## The scheduler is a sleep, not cron
@ -79,7 +79,7 @@ A comment in the file says it out loud: "Using a simple sleep loop to schedule b
Two subtleties worth naming:
- **First-boot grace period.** If `backup_completion_time.log` doesn't exist yet (fresh container, first backup still running), fall back to `container_start_time.log` so the container isn't reported unhealthy during the first scheduled run.
- **Partial success is not success.** In multi-target mode, the completion log is only written if *every* target succeeded. One repo failing means the healthcheck stays red even if the other two are fine. Stale-but-quiet was the failure mode I wanted to make impossible.
- **Partial success is not success.** In multi-target mode, the completion log is only written if _every_ target succeeded. One repo failing means the healthcheck stays red even if the other two are fine. Stale-but-quiet was the failure mode I wanted to make impossible.
## Smaller calls
@ -90,7 +90,7 @@ Two subtleties worth naming:
- **`--files-cache=ctime,size,inode`.** The default `mtime,size,inode` re-hashes files when their mtime changes; on BTRFS, ctime is the more honest signal of "this content actually changed."
- **`compression=zstd,12`.** The sweet spot for backup data on my hardware: substantially better than zlib, not so slow it dominates the run.
- **`borg compact --threshold=5 --cleanup-commits`.** Reclaims space from pruned archives whenever the segment-file fragmentation crosses 5%.
- **`IGNORE_GIT_UNTRACKED=true`.** Optional. Walks every `.git` dir under the snapshot, runs `git ls-files --others --exclude-standard`, and feeds the result into `--exclude-from`. Skips `target/`, `node_modules/`, build caches anything the repo already knows isn't worth keeping.
- **`IGNORE_GIT_UNTRACKED=true`.** Optional. Walks every `.git` dir under the snapshot, runs `git ls-files --others --exclude-standard`, and feeds the result into `--exclude-from`. Skips `target/`, `node_modules/`, build caches; anything the repo already knows isn't worth keeping.
- **`SYS_ADMIN` capability on the container.** Needed for `btrfs subvolume snapshot` and `delete` from inside the namespace. The narrower capability set didn't have a way through.
## What I'd change