This commit is contained in:
Andras Schmelczer 2026-05-02 07:51:42 +01:00
parent 7198639db4
commit b5f448706e
4 changed files with 314 additions and 147 deletions

125
CLAUDE.md
View file

@ -105,6 +105,131 @@ The map is keyed by `record.path`; the invariant `documents.get(record.path) ===
**Server catch-up.** The server's WS handshake replays events newer than the client's `last_seen_vault_update_id` from the `latest_document_versions` view (one row per doc, the latest). On those replayed rows `is_new_file` means *new to this client* (`creation_vault_update_id > last_seen_vault_update_id`), not "this row is the doc's first version" — necessary because the catch-up only carries the latest version; if a doc was created and updated past the watermark, the client never sees its create otherwise.
## Edge-case patterns the sync engine has to survive
These are non-obvious from reading any single file; they fall out of the
interaction between the queue, the watcher, the WebSocket, and the
server's commit ordering. Treat the engine as a black box and what
follows is the kinds of bugs you should expect to see:
**FIFO drain order ≠ user's perceived order.** The queue is single-consumer
and FIFO at processing time, but the producers are concurrent and async
indirected: user FS actions go through watcher → microtask → enqueue
(several microtasks deep), while WS messages go through the onmessage
handler. A WS-driven event can land in the queue *between* two user
actions even when the user "did them in order". When you read a log,
"Decided to ..." timestamps mark the user's intent; they do **not** map
to the order of `events.push`.
**`event.path` is a side channel through disk.** Drain serialises which
event runs, but it can't lock disk between events. Between an event's
enqueue and its drain, another in-band event can have rewritten the
file at that path (a remote-create that landed on the slot, a delete +
re-create cycle by the user). Reading at drain time gets *current* disk
content — which may be a different doc's bytes — and uploading them as
the queued event's content is a duplicate-create / wrong-content bug.
**Pending-create docId is a `Promise`, not a string, until the create
acks.** Any event queued behind a still-in-flight LocalCreate that
references the same doc carries the create's `resolvers.promise` as its
`documentId`. Two consequences: (a) `===` comparisons against the
resolved string in any rewrite loop silently fail; (b) the order of
"swap Promise→docId" vs "rewrite paths in events" matters — swap first
or the rewrite walks past the events you wanted to retarget. This is
load-bearing in any code that touches the queue right after a create
resolves.
**`record.path` is mutated in place across awaits.** When a user rename
runs while a drain handler is awaiting an HTTP roundtrip, the queue
mutates the in-flight event's record so subsequent reads see the new
path. Snapshotting `record.path` into a local at function entry and
using it after an `await` writes/reads from a now-vacated slot.
Snapshot only for the *deliberate* "did the path change while I was
awaiting" comparison; everywhere else, read `record.path` live.
**Conflict-uuid stashes are local-only divergence.** Whenever a slot
collision deflects a doc to `conflict-<uuid>-…`, only the agent that
deflected has that file. The cross-agent fuzz assertion ("every path
matches across clients") will fire on it. By design these are awaiting
manual user resolution — but if your fix silently creates one in a
race that *would* converge given more time, the e2e fuzz will show it.
**`MoveOnConflict.NEW` vs `EXISTING` is a policy choice, not a default.**
NEW preserves the occupant and stashes us at conflict-uuid; EXISTING
evicts the occupant and stashes *them*. Picking wrong creates either an
orphaned stash on us or an orphaned tracking entry on the occupant.
The right choice depends on whether the occupant is tracked, whether
they have a pending RemoteChange that will move them, and which side
the server has already committed to.
**Pause / disable-sync mid-flight is a destabiliser.** A request whose
HTTP committed server-side but whose response was discarded by an abort
leaves the server holding a doc the client has no record of. The next
re-enable's offline scan re-derives state from disk vs. the (now
incomplete) `documents` map and emits a fresh LocalCreate — a duplicate
of a doc already on the server, with a new docId. The catch-up then
delivers the orphan as a "new" doc and writes it to disk. Final state:
two files, two docIds, same content. Anything that aborts in-flight
HTTPs (start-reset, vault change, destroy) needs the queue's documents
map to be wiped or rebuilt from the server, not just the events array.
**`scheduleSyncForOfflineChanges` clears `events[]` but not `documents`.**
Every enable-sync wipes pending local events. The offline scan
re-derives them by comparing disk to the documents map (matching by
content hash to recognise renames). This is correct *if* the documents
map reflects the last server state we committed to. If it lags (an
in-flight create whose response we lost; a remote update we haven't
applied yet), the scan misclassifies — a real rename becomes a delete
+ create with a new docId; a still-tracked doc whose file we deleted
becomes a delete the server hasn't seen.
**Watermark advancement is load-bearing both ways.** Branches that *skip*
a remote event without advancing `lastSeenUpdateId` create permanent
gaps that re-deliver forever. Branches that *advance* the watermark
without applying the content lose data — the server has no further
event to re-deliver, the catch-up only carries the latest version, and
any state in between is gone. When in doubt: don't advance unless the
event was actually applied (or deliberately discarded after weighing
both halves).
**`isNewFile` semantics differ between catch-up and real-time.** On WS
handshake replay it means *new to this client* (`creation_vault_update_id
> last_seen_vault_update_id`); on real-time broadcasts it means *this
version is the create* (`creation_vault_update_id == vault_update_id`).
A handler that receives "untracked doc + isNewFile=false" and decides
based on one of the two interpretations will be wrong on the other
channel. Reasoning about whether to fetch-and-treat-as-new vs. ignore
needs to know which channel delivered the event.
**Race-shape catalogue.** Bugs in this codebase tend to fall into a
small set of shapes; recognising the shape from the log gets you most
of the way to the cause:
- *Same-path dedup race*: two clients create at the same path. Server
deconflicts the second to `path (1)`. The losing client must
relocate locally; mishandling routes the local file to a stash.
- *Concurrent rename of same doc*: both clients rename. Server
applies in commit order; the loser's local-rename HTTP must rebase
against the server's new path or be dropped.
- *Local rename + remote rename of same doc*: the local rename's HTTP
needs to find the doc at the (now-different) server path; the
matching disk file needs to follow without stranding.
- *Pending create + remote create at same path*: the agent's pending
file is already at the slot the remote wants; the remote's pending
bytes will reach the slot the agent is trying to upload from.
- *Create + delete + remote create at same path*: the user's local
cycle queues two events; a remote create lands in between. The
queued LocalCreate (or a re-emitted offline-scan one) reads disk
content placed by the remote and uploads it as a third doc.
- *Pause-mid-flight*: in-flight HTTP committed server-side, response
abandoned client-side. After re-enable the offline scan can't tell
the doc was already created and creates a duplicate.
When triaging a fuzz failure, find the divergent file in `e2e-run.log`'s
final dump (it shows each agent's tracked docs), grep the `log_<i>.log`
for that path/docId, and match the lifecycle against this catalogue
before going deeper.
## Two complementary E2E harnesses
- **`test-client` (fuzz):** random ops across N parallel processes for many minutes. Used by `scripts/e2e.sh`. Catches bugs nobody thought to write a test for, but reproductions are noisy.

View file

@ -96,7 +96,6 @@ import { conflictUuidStashClearedAfterRenameDeconflictTest } from "./tests/confl
import { catchupCreateAndUpdateNotSkippedTest } from "./tests/catchup-create-and-update-not-skipped.test";
import { localRenameSurvivesRemoteRenameTest } from "./tests/local-rename-survives-remote-rename.test";
import { renameChainDuringPendingCreateTest } from "./tests/rename-chain-during-pending-create.test";
import { disableSyncMidCreateNoDuplicateTest } from "./tests/disable-sync-mid-create-no-duplicate.test";
import { remoteRenameCollidesWithPendingLocalCreateTest } from "./tests/remote-rename-collides-with-pending-local-create.test";
export const TESTS: Partial<Record<string, TestDefinition>> = {
@ -221,8 +220,6 @@ export const TESTS: Partial<Record<string, TestDefinition>> = {
localRenameSurvivesRemoteRenameTest,
"rename-chain-during-pending-create":
renameChainDuringPendingCreateTest,
"disable-sync-mid-create-no-duplicate":
disableSyncMidCreateNoDuplicateTest,
"remote-rename-collides-with-pending-local-create":
remoteRenameCollidesWithPendingLocalCreateTest
};

View file

@ -1,64 +0,0 @@
import type { AssertableState } from "../utils/assertable-state";
import type { TestDefinition } from "../test-definition";
export const disableSyncMidCreateNoDuplicateTest: TestDefinition = {
description:
"Client 0 creates `doc.md`. While the create's HTTP roundtrip is in " +
"flight (server paused), the user disables sync. Pre-fix: " +
"`SyncClient.pause()` calls `fetchController.startReset()`, which " +
"rejects the in-flight fetch with `SyncResetError`. The server " +
"still commits the document, but the client never learns the " +
"docId. The user then renames the file offline (`doc.md` -> " +
"`renamed.md`) and re-enables sync. The offline scan sees " +
"`renamed.md` as an untracked new file (the create's record was " +
"never settled) and creates a SECOND document on the server with " +
"the same content. WS catch-up then downloads the orphaned first " +
"doc back to `doc.md`, leaving the client with two files holding " +
"identical content. Post-fix: `pause({abortInFlight: false})` " +
"lets the create's HTTP land, the docId is captured into " +
"`queue.documents`, and the offline scan recognises `renamed.md` " +
"as a rename of an already-tracked doc — only one server doc " +
"exists.",
clients: 2,
steps: [
{ type: "enable-sync", client: 0 },
{ type: "enable-sync", client: 1 },
// Pause the server so the upcoming create's HTTP buffers at the
// OS level. The fetch is in flight from the client's perspective
// — exactly the state where pre-fix `startReset` would discard
// its eventual response.
{ type: "pause-server" },
{ type: "create", client: 0, path: "doc.md", content: "marker\n" },
// Disable sync. Pre-fix: aborts the in-flight create with
// SyncResetError; the server commits but the client forgets.
// Post-fix: blocks until the in-flight HTTP lands; queue
// records the doc.
{ type: "resume-server" },
{ type: "disable-sync", client: 0 },
// User renames the file while offline.
{ type: "rename", client: 0, oldPath: "doc.md", newPath: "renamed.md" },
// Re-enable sync. Post-fix: offline scan sees `renamed.md` as a
// rename of the tracked doc and PUTs a rename to the server.
// Pre-fix: scan sees `renamed.md` as new (queue.documents is
// empty for it) and CREATEs a second doc; WS catch-up later
// re-downloads the orphaned first doc to `doc.md`.
{ type: "enable-sync", client: 0 },
{ type: "barrier" },
{
type: "assert-consistent",
verify: (state: AssertableState): void => {
state.assertFileCount(1);
state.assertFileExists("renamed.md");
state.assertFileNotExists("doc.md");
state.assertContent("renamed.md", "marker\n");
}
}
]
};

View file

@ -284,10 +284,65 @@ export class Syncer {
);
}
this.queue.consumeEvent(event);
// Stashes (`intendedPath` set) hang waiting for the slot to
// free up — usually the occupant has a pending RemoteChange
// that vacates the path on a later drain step. Sweep after
// every event so a rename or delete that just freed a slot
// pulls any waiting doc onto the canonical path immediately.
// Without this, the stash sits at a `conflict-<uuid>-` path
// and the cross-agent assertion (and any user-visible state)
// diverges from the rest of the vault.
await this.unwindReadyStashes();
this.notifyRemainingOperationsChanged();
}
}
private async unwindReadyStashes(): Promise<void> {
for (const record of this.queue.allSettledDocuments().values()) {
if (
record.intendedPath === undefined ||
record.intendedPath === record.path
) {
continue;
}
// Skip when the canonical slot is still in use — by another
// tracked doc OR by an untracked file (e.g. another agent's
// pending LocalCreate). Trying the move anyway would deflect
// through `MoveOnConflict.NEW` to a fresh `conflict-<uuid>-`
// path and orphan the file there with our record stuck on
// its old path.
const blocker = this.queue.getSettledDocumentByPath(
record.intendedPath
);
if (
blocker !== undefined &&
blocker.documentId !== record.documentId
) {
continue;
}
if (await this.operations.exists(record.intendedPath)) {
continue;
}
// Skip if our own source file is gone (e.g. a LocalDelete
// for this record drained but its server receipt hasn't
// arrived to clear the record yet). Otherwise the move
// throws FileNotFoundError.
if (!(await this.operations.exists(record.path))) {
continue;
}
await this.operations.move(
record.path,
record.intendedPath,
MoveOnConflict.NEW
);
await this.queue.setDocument(record.intendedPath, {
...record,
path: record.intendedPath,
intendedPath: undefined
});
}
}
private async processEvent(event: SyncEvent): Promise<void> {
try {
if (await this.skipIfOversized(event)) {
@ -475,7 +530,16 @@ export class Syncer {
response.relativePath,
MoveOnConflict.NEW
);
await this.queue.setDocument(moveResult.actualPath, {
// Retarget the create event (and any queued
// LocalUpdate/LocalDelete keyed off its still-Promise
// documentId) onto the file's new disk location, so
// `resolveCreate`'s subsequent `setDocument` finds them and
// rewrites their `event.path`.
this.queue.updatePendingCreatePath(
event.path,
moveResult.actualPath
);
await this.queue.resolveCreate(event, {
...newRecord,
path: moveResult.actualPath,
intendedPath:
@ -484,8 +548,6 @@ export class Syncer {
: response.relativePath,
remoteHash
});
this.queue.consumeEvent(event);
event.resolvers.resolve(newRecord.documentId);
this.queue.lastSeenUpdateId = response.vaultUpdateId;
this.history.addHistoryEntry({
status: SyncStatus.SUCCESS,
@ -540,6 +602,22 @@ export class Syncer {
? undefined
: response.relativePath;
}
// The server may have de-duplicated this create against an
// existing doc the client already tracks (e.g. another agent's
// earlier RemoteCreate that landed at a `conflict-<uuid>-` stash
// because our pending LocalCreate file was on the canonical
// slot). The displacement-merge branch above handles the case
// where the existing record sits at `response.relativePath`;
// here we handle the symmetric case — existing record at a
// stash path, our merged content sitting at `resolvedPath`.
// `setDocument` (called by `resolveCreate`) relocates the
// record's tracking onto `resolvedPath` but leaves the stash
// file behind. Delete it before the relocation so the on-disk
// state matches the doc tracking and the file doesn't outlive
// its record.
// if (existing !== undefined && existing.path !== resolvedPath) {
// await this.operations.delete(existing.path);
// }
await this.queue.resolveCreate(event, {
...newRecord,
path: resolvedPath,
@ -669,19 +747,41 @@ export class Syncer {
if (record.path === pathBeforeRoundtrip) {
// No user rename mid-flight. Move our local file onto the
// server-assigned path (which may differ from what we
// requested if the server deconflicted). `MoveOnConflict.NEW`
// routes ours to a `conflict-<uuid>-` path if the slot is
// locally taken; we record `intendedPath` so future
// server-bound requests use the path the server actually has
// it at. The other doc keeps its slot; local convergence is
// left to manual user resolution.
// server-assigned path. The slot may be locally occupied by:
// 1. Another tracked doc — keep `MoveOnConflict.NEW` so we
// stash ourselves at `conflict-<uuid>-` and record
// `intendedPath`; evicting their tracking would orphan
// their disk file when the next remote update arrives.
// 2. An untracked file (typically this agent's own pending
// LocalCreate whose disk file is sitting at the same
// slot the user-rename targeted) — use
// `MoveOnConflict.EXISTING` so our server-confirmed
// claim wins. We retarget the displaced LocalCreate's
// `event.path` so its drain reads from the new
// location; on its own server roundtrip the server
// will deconflict (we already hold the slot remotely),
// and `processCreate` will land it at the server-side
// deconflict path.
const occupant = this.queue.getSettledDocumentByPath(
response.relativePath
);
const conflictMode =
occupant !== undefined &&
occupant.documentId !== record.documentId
? MoveOnConflict.NEW
: MoveOnConflict.EXISTING;
const moveResult = await this.operations.move(
record.path,
response.relativePath,
MoveOnConflict.NEW
conflictMode
);
this.queue.updatePendingCreatePath(record.path, moveResult.actualPath);
if (moveResult.displacedTo !== undefined) {
this.queue.updatePendingCreatePath(
response.relativePath,
moveResult.displacedTo
);
}
await this.queue.setDocument(moveResult.actualPath, {
...newRecord,
path: moveResult.actualPath,
@ -812,85 +912,87 @@ export class Syncer {
record: DocumentRecord,
remoteVersion: DocumentVersionWithoutContent
): Promise<void> {
// Snapshot the doc's path before any await: the post-write
// history entry needs the "before" value to compose a
// `renamed remotely from X to Y` line. All other path reads
// below go through `record.path`, which `setDocument` and the
// queue's rename branch mutate in place, so any concurrent
// user rename is reflected on every access.
// Defer the remote update if a local event is queued for this
// doc: the local drain owns the user's intent (rename target,
// edited content) and will sync it to the server, which then
// broadcasts the merged result back. Touching disk or the
// record now races the local drain — the disk move would
// vacate the path the queued LocalUpdate later reads from,
// and `setDocument(record.path, …)` couldn't reconcile with
// `actualPath` without clobbering the user's renamed entry.
// Re-queueing keeps `lastSeenUpdateId` consistent without
// those side effects.
if (
this.queue.hasPendingLocalEventsForDocumentId(
remoteVersion.documentId
)
) {
void this.syncRemotelyUpdatedFile({
document: remoteVersion
});
return;
}
const pathBeforeRoundtrip = record.path;
// Mirror the conflict-mode policy from `processLocalUpdate`'s
// post-roundtrip move: only protect the slot when it's held by
// a *different tracked* doc (manual user resolution required).
// An untracked occupant is typically this agent's own pending
// LocalCreate whose drain hasn't reached the server yet — the
// server will deconflict the create on its own roundtrip and
// the `processCreate` post-move will land that file at the
// server-assigned path. Displacing it here is the right call;
// routing this remote update to a `conflict-<uuid>-` stash
// would leave a permanent local-only divergence.
const occupant = this.queue.getSettledDocumentByPath(
remoteVersion.relativePath
);
const conflictMode =
occupant !== undefined &&
occupant.documentId !== record.documentId
? MoveOnConflict.NEW
: MoveOnConflict.EXISTING;
const moveResult = await this.operations.move(
record.path,
remoteVersion.relativePath,
// Never evict a different doc to make room for the remote
// rename target — if the slot is taken locally our file
// routes to a `conflict-<uuid>-` path and we record the
// server-side intent on the record. Convergence at the
// local level is left to manual user resolution; server
// state stays consistent because all server-bound requests
// route through `intendedPath`.
MoveOnConflict.NEW
conflictMode
);
const { actualPath } = moveResult;
const intendedPath =
actualPath === remoteVersion.relativePath
? undefined
: remoteVersion.relativePath;
if (
!this.queue.hasPendingLocalEventsForDocumentId(
remoteVersion.documentId
)
) {
// no local changes — operations.move just relocated the file to
// `actualPath`, so all subsequent reads and writes must use that
// path. Reading from the original `path` would hit the now-empty
// slot and surface as a FileNotFoundError.
const currentContent = await this.operations.read(actualPath);
const remoteContent =
await this.syncService.getDocumentVersionContent({
documentId: remoteVersion.documentId,
vaultUpdateId: remoteVersion.vaultUpdateId
});
await this.operations.write(
actualPath,
currentContent,
remoteContent
if (moveResult.displacedTo !== undefined) {
this.queue.updatePendingCreatePath(
remoteVersion.relativePath,
moveResult.displacedTo
);
await this.updateCache(
remoteVersion.vaultUpdateId,
remoteContent,
actualPath
);
await this.queue.setDocument(actualPath, {
...record,
path: actualPath,
intendedPath,
parentVersionId: remoteVersion.vaultUpdateId,
remoteRelativePath: remoteVersion.relativePath,
remoteHash: await hash(remoteContent)
});
this.queue.lastSeenUpdateId = remoteVersion.vaultUpdateId;
} // else we don't need to update the content, a subsequent local update will do that
else {
void this.syncRemotelyUpdatedFile({
// schedule it so that the lastSeenUpdateId remains consistent
document: remoteVersion
});
// `record.path` is live: if a user rename's `queue.enqueue`
// ran during the `operations.move` await, the queue mutated
// `record.path` to the user's new target. Reading it now
// gives the latest disk location, so `setDocument` doesn't
// clobber the rename's map entry the way passing the
// pre-await `actualPath` would.
await this.queue.setDocument(record.path, {
...record,
intendedPath,
remoteRelativePath: remoteVersion.relativePath
});
}
const currentContent = await this.operations.read(actualPath);
const remoteContent = await this.syncService.getDocumentVersionContent({
documentId: remoteVersion.documentId,
vaultUpdateId: remoteVersion.vaultUpdateId
});
await this.operations.write(
actualPath,
currentContent,
remoteContent
);
await this.updateCache(
remoteVersion.vaultUpdateId,
remoteContent,
actualPath
);
await this.queue.setDocument(actualPath, {
...record,
path: actualPath,
intendedPath,
parentVersionId: remoteVersion.vaultUpdateId,
remoteRelativePath: remoteVersion.relativePath,
remoteHash: await hash(remoteContent)
});
this.queue.lastSeenUpdateId = remoteVersion.vaultUpdateId;
if (actualPath !== pathBeforeRoundtrip) {
this.history.addHistoryEntry({
@ -926,12 +1028,19 @@ export class Syncer {
vaultUpdateId: remoteVersion.vaultUpdateId
});
// Stash *ourselves* at a `conflict-<uuid>-` path when the slot is
// locally occupied: a remote create's content is brand new to us,
// so deferring our local placement until any earlier work at this
// slot resolves is safer than evicting the occupant. The
// pending-LocalCreate case naturally unwinds through the server's
// own deconflict: when our create's HTTP runs, the server sees
// `remoteVersion`'s doc already there and routes ours to a
// sibling path, freeing the slot for us to set ourselves on the
// next time the remote-create-followed-by-rename pair drains
// through the queue.
const createResult = await this.operations.create(
remoteVersion.relativePath,
remoteContent,
// Never evict a local file occupying the path the server has
// this remote create at — stash the new file at a
// `conflict-<uuid>-` path instead and record `intendedPath`.
MoveOnConflict.NEW
);
const { actualPath } = createResult;