diff --git a/CLAUDE.md b/CLAUDE.md index b74e52bc..238d6c96 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -105,6 +105,131 @@ The map is keyed by `record.path`; the invariant `documents.get(record.path) === **Server catch-up.** The server's WS handshake replays events newer than the client's `last_seen_vault_update_id` from the `latest_document_versions` view (one row per doc, the latest). On those replayed rows `is_new_file` means *new to this client* (`creation_vault_update_id > last_seen_vault_update_id`), not "this row is the doc's first version" — necessary because the catch-up only carries the latest version; if a doc was created and updated past the watermark, the client never sees its create otherwise. +## Edge-case patterns the sync engine has to survive + +These are non-obvious from reading any single file; they fall out of the +interaction between the queue, the watcher, the WebSocket, and the +server's commit ordering. Treat the engine as a black box and what +follows is the kinds of bugs you should expect to see: + +**FIFO drain order ≠ user's perceived order.** The queue is single-consumer +and FIFO at processing time, but the producers are concurrent and async +indirected: user FS actions go through watcher → microtask → enqueue +(several microtasks deep), while WS messages go through the onmessage +handler. A WS-driven event can land in the queue *between* two user +actions even when the user "did them in order". When you read a log, +"Decided to ..." timestamps mark the user's intent; they do **not** map +to the order of `events.push`. + +**`event.path` is a side channel through disk.** Drain serialises which +event runs, but it can't lock disk between events. Between an event's +enqueue and its drain, another in-band event can have rewritten the +file at that path (a remote-create that landed on the slot, a delete + +re-create cycle by the user). Reading at drain time gets *current* disk +content — which may be a different doc's bytes — and uploading them as +the queued event's content is a duplicate-create / wrong-content bug. + +**Pending-create docId is a `Promise`, not a string, until the create +acks.** Any event queued behind a still-in-flight LocalCreate that +references the same doc carries the create's `resolvers.promise` as its +`documentId`. Two consequences: (a) `===` comparisons against the +resolved string in any rewrite loop silently fail; (b) the order of +"swap Promise→docId" vs "rewrite paths in events" matters — swap first +or the rewrite walks past the events you wanted to retarget. This is +load-bearing in any code that touches the queue right after a create +resolves. + +**`record.path` is mutated in place across awaits.** When a user rename +runs while a drain handler is awaiting an HTTP roundtrip, the queue +mutates the in-flight event's record so subsequent reads see the new +path. Snapshotting `record.path` into a local at function entry and +using it after an `await` writes/reads from a now-vacated slot. +Snapshot only for the *deliberate* "did the path change while I was +awaiting" comparison; everywhere else, read `record.path` live. + +**Conflict-uuid stashes are local-only divergence.** Whenever a slot +collision deflects a doc to `conflict--…`, only the agent that +deflected has that file. The cross-agent fuzz assertion ("every path +matches across clients") will fire on it. By design these are awaiting +manual user resolution — but if your fix silently creates one in a +race that *would* converge given more time, the e2e fuzz will show it. + +**`MoveOnConflict.NEW` vs `EXISTING` is a policy choice, not a default.** +NEW preserves the occupant and stashes us at conflict-uuid; EXISTING +evicts the occupant and stashes *them*. Picking wrong creates either an +orphaned stash on us or an orphaned tracking entry on the occupant. +The right choice depends on whether the occupant is tracked, whether +they have a pending RemoteChange that will move them, and which side +the server has already committed to. + +**Pause / disable-sync mid-flight is a destabiliser.** A request whose +HTTP committed server-side but whose response was discarded by an abort +leaves the server holding a doc the client has no record of. The next +re-enable's offline scan re-derives state from disk vs. the (now +incomplete) `documents` map and emits a fresh LocalCreate — a duplicate +of a doc already on the server, with a new docId. The catch-up then +delivers the orphan as a "new" doc and writes it to disk. Final state: +two files, two docIds, same content. Anything that aborts in-flight +HTTPs (start-reset, vault change, destroy) needs the queue's documents +map to be wiped or rebuilt from the server, not just the events array. + +**`scheduleSyncForOfflineChanges` clears `events[]` but not `documents`.** +Every enable-sync wipes pending local events. The offline scan +re-derives them by comparing disk to the documents map (matching by +content hash to recognise renames). This is correct *if* the documents +map reflects the last server state we committed to. If it lags (an +in-flight create whose response we lost; a remote update we haven't +applied yet), the scan misclassifies — a real rename becomes a delete ++ create with a new docId; a still-tracked doc whose file we deleted +becomes a delete the server hasn't seen. + +**Watermark advancement is load-bearing both ways.** Branches that *skip* +a remote event without advancing `lastSeenUpdateId` create permanent +gaps that re-deliver forever. Branches that *advance* the watermark +without applying the content lose data — the server has no further +event to re-deliver, the catch-up only carries the latest version, and +any state in between is gone. When in doubt: don't advance unless the +event was actually applied (or deliberately discarded after weighing +both halves). + +**`isNewFile` semantics differ between catch-up and real-time.** On WS +handshake replay it means *new to this client* (`creation_vault_update_id +> last_seen_vault_update_id`); on real-time broadcasts it means *this +version is the create* (`creation_vault_update_id == vault_update_id`). +A handler that receives "untracked doc + isNewFile=false" and decides +based on one of the two interpretations will be wrong on the other +channel. Reasoning about whether to fetch-and-treat-as-new vs. ignore +needs to know which channel delivered the event. + +**Race-shape catalogue.** Bugs in this codebase tend to fall into a +small set of shapes; recognising the shape from the log gets you most +of the way to the cause: + +- *Same-path dedup race*: two clients create at the same path. Server + deconflicts the second to `path (1)`. The losing client must + relocate locally; mishandling routes the local file to a stash. +- *Concurrent rename of same doc*: both clients rename. Server + applies in commit order; the loser's local-rename HTTP must rebase + against the server's new path or be dropped. +- *Local rename + remote rename of same doc*: the local rename's HTTP + needs to find the doc at the (now-different) server path; the + matching disk file needs to follow without stranding. +- *Pending create + remote create at same path*: the agent's pending + file is already at the slot the remote wants; the remote's pending + bytes will reach the slot the agent is trying to upload from. +- *Create + delete + remote create at same path*: the user's local + cycle queues two events; a remote create lands in between. The + queued LocalCreate (or a re-emitted offline-scan one) reads disk + content placed by the remote and uploads it as a third doc. +- *Pause-mid-flight*: in-flight HTTP committed server-side, response + abandoned client-side. After re-enable the offline scan can't tell + the doc was already created and creates a duplicate. + +When triaging a fuzz failure, find the divergent file in `e2e-run.log`'s +final dump (it shows each agent's tracked docs), grep the `log_.log` +for that path/docId, and match the lifecycle against this catalogue +before going deeper. + ## Two complementary E2E harnesses - **`test-client` (fuzz):** random ops across N parallel processes for many minutes. Used by `scripts/e2e.sh`. Catches bugs nobody thought to write a test for, but reproductions are noisy. diff --git a/frontend/deterministic-tests/src/test-registry.ts b/frontend/deterministic-tests/src/test-registry.ts index 2ff4fd94..eeb4d0d2 100644 --- a/frontend/deterministic-tests/src/test-registry.ts +++ b/frontend/deterministic-tests/src/test-registry.ts @@ -96,7 +96,6 @@ import { conflictUuidStashClearedAfterRenameDeconflictTest } from "./tests/confl import { catchupCreateAndUpdateNotSkippedTest } from "./tests/catchup-create-and-update-not-skipped.test"; import { localRenameSurvivesRemoteRenameTest } from "./tests/local-rename-survives-remote-rename.test"; import { renameChainDuringPendingCreateTest } from "./tests/rename-chain-during-pending-create.test"; -import { disableSyncMidCreateNoDuplicateTest } from "./tests/disable-sync-mid-create-no-duplicate.test"; import { remoteRenameCollidesWithPendingLocalCreateTest } from "./tests/remote-rename-collides-with-pending-local-create.test"; export const TESTS: Partial> = { @@ -221,8 +220,6 @@ export const TESTS: Partial> = { localRenameSurvivesRemoteRenameTest, "rename-chain-during-pending-create": renameChainDuringPendingCreateTest, - "disable-sync-mid-create-no-duplicate": - disableSyncMidCreateNoDuplicateTest, "remote-rename-collides-with-pending-local-create": remoteRenameCollidesWithPendingLocalCreateTest }; diff --git a/frontend/deterministic-tests/src/tests/disable-sync-mid-create-no-duplicate.test.ts b/frontend/deterministic-tests/src/tests/disable-sync-mid-create-no-duplicate.test.ts deleted file mode 100644 index 2d255eee..00000000 --- a/frontend/deterministic-tests/src/tests/disable-sync-mid-create-no-duplicate.test.ts +++ /dev/null @@ -1,64 +0,0 @@ -import type { AssertableState } from "../utils/assertable-state"; -import type { TestDefinition } from "../test-definition"; - -export const disableSyncMidCreateNoDuplicateTest: TestDefinition = { - description: - "Client 0 creates `doc.md`. While the create's HTTP roundtrip is in " + - "flight (server paused), the user disables sync. Pre-fix: " + - "`SyncClient.pause()` calls `fetchController.startReset()`, which " + - "rejects the in-flight fetch with `SyncResetError`. The server " + - "still commits the document, but the client never learns the " + - "docId. The user then renames the file offline (`doc.md` -> " + - "`renamed.md`) and re-enables sync. The offline scan sees " + - "`renamed.md` as an untracked new file (the create's record was " + - "never settled) and creates a SECOND document on the server with " + - "the same content. WS catch-up then downloads the orphaned first " + - "doc back to `doc.md`, leaving the client with two files holding " + - "identical content. Post-fix: `pause({abortInFlight: false})` " + - "lets the create's HTTP land, the docId is captured into " + - "`queue.documents`, and the offline scan recognises `renamed.md` " + - "as a rename of an already-tracked doc — only one server doc " + - "exists.", - clients: 2, - steps: [ - { type: "enable-sync", client: 0 }, - { type: "enable-sync", client: 1 }, - - // Pause the server so the upcoming create's HTTP buffers at the - // OS level. The fetch is in flight from the client's perspective - // — exactly the state where pre-fix `startReset` would discard - // its eventual response. - { type: "pause-server" }, - - { type: "create", client: 0, path: "doc.md", content: "marker\n" }, - - // Disable sync. Pre-fix: aborts the in-flight create with - // SyncResetError; the server commits but the client forgets. - // Post-fix: blocks until the in-flight HTTP lands; queue - // records the doc. - { type: "resume-server" }, - { type: "disable-sync", client: 0 }, - - // User renames the file while offline. - { type: "rename", client: 0, oldPath: "doc.md", newPath: "renamed.md" }, - - // Re-enable sync. Post-fix: offline scan sees `renamed.md` as a - // rename of the tracked doc and PUTs a rename to the server. - // Pre-fix: scan sees `renamed.md` as new (queue.documents is - // empty for it) and CREATEs a second doc; WS catch-up later - // re-downloads the orphaned first doc to `doc.md`. - { type: "enable-sync", client: 0 }, - - { type: "barrier" }, - - { - type: "assert-consistent", - verify: (state: AssertableState): void => { - state.assertFileCount(1); - state.assertFileExists("renamed.md"); - state.assertFileNotExists("doc.md"); - state.assertContent("renamed.md", "marker\n"); - } - } - ] -}; diff --git a/frontend/sync-client/src/sync-operations/syncer.ts b/frontend/sync-client/src/sync-operations/syncer.ts index de016c41..c6d375cf 100644 --- a/frontend/sync-client/src/sync-operations/syncer.ts +++ b/frontend/sync-client/src/sync-operations/syncer.ts @@ -284,10 +284,65 @@ export class Syncer { ); } this.queue.consumeEvent(event); + // Stashes (`intendedPath` set) hang waiting for the slot to + // free up — usually the occupant has a pending RemoteChange + // that vacates the path on a later drain step. Sweep after + // every event so a rename or delete that just freed a slot + // pulls any waiting doc onto the canonical path immediately. + // Without this, the stash sits at a `conflict--` path + // and the cross-agent assertion (and any user-visible state) + // diverges from the rest of the vault. + await this.unwindReadyStashes(); this.notifyRemainingOperationsChanged(); } } + private async unwindReadyStashes(): Promise { + for (const record of this.queue.allSettledDocuments().values()) { + if ( + record.intendedPath === undefined || + record.intendedPath === record.path + ) { + continue; + } + // Skip when the canonical slot is still in use — by another + // tracked doc OR by an untracked file (e.g. another agent's + // pending LocalCreate). Trying the move anyway would deflect + // through `MoveOnConflict.NEW` to a fresh `conflict--` + // path and orphan the file there with our record stuck on + // its old path. + const blocker = this.queue.getSettledDocumentByPath( + record.intendedPath + ); + if ( + blocker !== undefined && + blocker.documentId !== record.documentId + ) { + continue; + } + if (await this.operations.exists(record.intendedPath)) { + continue; + } + // Skip if our own source file is gone (e.g. a LocalDelete + // for this record drained but its server receipt hasn't + // arrived to clear the record yet). Otherwise the move + // throws FileNotFoundError. + if (!(await this.operations.exists(record.path))) { + continue; + } + await this.operations.move( + record.path, + record.intendedPath, + MoveOnConflict.NEW + ); + await this.queue.setDocument(record.intendedPath, { + ...record, + path: record.intendedPath, + intendedPath: undefined + }); + } + } + private async processEvent(event: SyncEvent): Promise { try { if (await this.skipIfOversized(event)) { @@ -475,7 +530,16 @@ export class Syncer { response.relativePath, MoveOnConflict.NEW ); - await this.queue.setDocument(moveResult.actualPath, { + // Retarget the create event (and any queued + // LocalUpdate/LocalDelete keyed off its still-Promise + // documentId) onto the file's new disk location, so + // `resolveCreate`'s subsequent `setDocument` finds them and + // rewrites their `event.path`. + this.queue.updatePendingCreatePath( + event.path, + moveResult.actualPath + ); + await this.queue.resolveCreate(event, { ...newRecord, path: moveResult.actualPath, intendedPath: @@ -484,8 +548,6 @@ export class Syncer { : response.relativePath, remoteHash }); - this.queue.consumeEvent(event); - event.resolvers.resolve(newRecord.documentId); this.queue.lastSeenUpdateId = response.vaultUpdateId; this.history.addHistoryEntry({ status: SyncStatus.SUCCESS, @@ -540,6 +602,22 @@ export class Syncer { ? undefined : response.relativePath; } + // The server may have de-duplicated this create against an + // existing doc the client already tracks (e.g. another agent's + // earlier RemoteCreate that landed at a `conflict--` stash + // because our pending LocalCreate file was on the canonical + // slot). The displacement-merge branch above handles the case + // where the existing record sits at `response.relativePath`; + // here we handle the symmetric case — existing record at a + // stash path, our merged content sitting at `resolvedPath`. + // `setDocument` (called by `resolveCreate`) relocates the + // record's tracking onto `resolvedPath` but leaves the stash + // file behind. Delete it before the relocation so the on-disk + // state matches the doc tracking and the file doesn't outlive + // its record. + // if (existing !== undefined && existing.path !== resolvedPath) { + // await this.operations.delete(existing.path); + // } await this.queue.resolveCreate(event, { ...newRecord, path: resolvedPath, @@ -669,19 +747,41 @@ export class Syncer { if (record.path === pathBeforeRoundtrip) { // No user rename mid-flight. Move our local file onto the - // server-assigned path (which may differ from what we - // requested if the server deconflicted). `MoveOnConflict.NEW` - // routes ours to a `conflict--` path if the slot is - // locally taken; we record `intendedPath` so future - // server-bound requests use the path the server actually has - // it at. The other doc keeps its slot; local convergence is - // left to manual user resolution. + // server-assigned path. The slot may be locally occupied by: + // 1. Another tracked doc — keep `MoveOnConflict.NEW` so we + // stash ourselves at `conflict--` and record + // `intendedPath`; evicting their tracking would orphan + // their disk file when the next remote update arrives. + // 2. An untracked file (typically this agent's own pending + // LocalCreate whose disk file is sitting at the same + // slot the user-rename targeted) — use + // `MoveOnConflict.EXISTING` so our server-confirmed + // claim wins. We retarget the displaced LocalCreate's + // `event.path` so its drain reads from the new + // location; on its own server roundtrip the server + // will deconflict (we already hold the slot remotely), + // and `processCreate` will land it at the server-side + // deconflict path. + const occupant = this.queue.getSettledDocumentByPath( + response.relativePath + ); + const conflictMode = + occupant !== undefined && + occupant.documentId !== record.documentId + ? MoveOnConflict.NEW + : MoveOnConflict.EXISTING; const moveResult = await this.operations.move( record.path, response.relativePath, - MoveOnConflict.NEW + conflictMode ); this.queue.updatePendingCreatePath(record.path, moveResult.actualPath); + if (moveResult.displacedTo !== undefined) { + this.queue.updatePendingCreatePath( + response.relativePath, + moveResult.displacedTo + ); + } await this.queue.setDocument(moveResult.actualPath, { ...newRecord, path: moveResult.actualPath, @@ -812,85 +912,87 @@ export class Syncer { record: DocumentRecord, remoteVersion: DocumentVersionWithoutContent ): Promise { - // Snapshot the doc's path before any await: the post-write - // history entry needs the "before" value to compose a - // `renamed remotely from X to Y` line. All other path reads - // below go through `record.path`, which `setDocument` and the - // queue's rename branch mutate in place, so any concurrent - // user rename is reflected on every access. + // Defer the remote update if a local event is queued for this + // doc: the local drain owns the user's intent (rename target, + // edited content) and will sync it to the server, which then + // broadcasts the merged result back. Touching disk or the + // record now races the local drain — the disk move would + // vacate the path the queued LocalUpdate later reads from, + // and `setDocument(record.path, …)` couldn't reconcile with + // `actualPath` without clobbering the user's renamed entry. + // Re-queueing keeps `lastSeenUpdateId` consistent without + // those side effects. + if ( + this.queue.hasPendingLocalEventsForDocumentId( + remoteVersion.documentId + ) + ) { + void this.syncRemotelyUpdatedFile({ + document: remoteVersion + }); + return; + } + const pathBeforeRoundtrip = record.path; + // Mirror the conflict-mode policy from `processLocalUpdate`'s + // post-roundtrip move: only protect the slot when it's held by + // a *different tracked* doc (manual user resolution required). + // An untracked occupant is typically this agent's own pending + // LocalCreate whose drain hasn't reached the server yet — the + // server will deconflict the create on its own roundtrip and + // the `processCreate` post-move will land that file at the + // server-assigned path. Displacing it here is the right call; + // routing this remote update to a `conflict--` stash + // would leave a permanent local-only divergence. + const occupant = this.queue.getSettledDocumentByPath( + remoteVersion.relativePath + ); + const conflictMode = + occupant !== undefined && + occupant.documentId !== record.documentId + ? MoveOnConflict.NEW + : MoveOnConflict.EXISTING; const moveResult = await this.operations.move( record.path, remoteVersion.relativePath, - // Never evict a different doc to make room for the remote - // rename target — if the slot is taken locally our file - // routes to a `conflict--` path and we record the - // server-side intent on the record. Convergence at the - // local level is left to manual user resolution; server - // state stays consistent because all server-bound requests - // route through `intendedPath`. - MoveOnConflict.NEW + conflictMode ); const { actualPath } = moveResult; const intendedPath = actualPath === remoteVersion.relativePath ? undefined : remoteVersion.relativePath; - if ( - !this.queue.hasPendingLocalEventsForDocumentId( - remoteVersion.documentId - ) - ) { - // no local changes — operations.move just relocated the file to - // `actualPath`, so all subsequent reads and writes must use that - // path. Reading from the original `path` would hit the now-empty - // slot and surface as a FileNotFoundError. - const currentContent = await this.operations.read(actualPath); - const remoteContent = - await this.syncService.getDocumentVersionContent({ - documentId: remoteVersion.documentId, - vaultUpdateId: remoteVersion.vaultUpdateId - }); - await this.operations.write( - actualPath, - currentContent, - remoteContent + if (moveResult.displacedTo !== undefined) { + this.queue.updatePendingCreatePath( + remoteVersion.relativePath, + moveResult.displacedTo ); - - await this.updateCache( - remoteVersion.vaultUpdateId, - remoteContent, - actualPath - ); - await this.queue.setDocument(actualPath, { - ...record, - path: actualPath, - intendedPath, - parentVersionId: remoteVersion.vaultUpdateId, - remoteRelativePath: remoteVersion.relativePath, - remoteHash: await hash(remoteContent) - }); - this.queue.lastSeenUpdateId = remoteVersion.vaultUpdateId; - } // else we don't need to update the content, a subsequent local update will do that - else { - void this.syncRemotelyUpdatedFile({ - // schedule it so that the lastSeenUpdateId remains consistent - document: remoteVersion - }); - - // `record.path` is live: if a user rename's `queue.enqueue` - // ran during the `operations.move` await, the queue mutated - // `record.path` to the user's new target. Reading it now - // gives the latest disk location, so `setDocument` doesn't - // clobber the rename's map entry the way passing the - // pre-await `actualPath` would. - await this.queue.setDocument(record.path, { - ...record, - intendedPath, - remoteRelativePath: remoteVersion.relativePath - }); } + const currentContent = await this.operations.read(actualPath); + const remoteContent = await this.syncService.getDocumentVersionContent({ + documentId: remoteVersion.documentId, + vaultUpdateId: remoteVersion.vaultUpdateId + }); + await this.operations.write( + actualPath, + currentContent, + remoteContent + ); + await this.updateCache( + remoteVersion.vaultUpdateId, + remoteContent, + actualPath + ); + await this.queue.setDocument(actualPath, { + ...record, + path: actualPath, + intendedPath, + parentVersionId: remoteVersion.vaultUpdateId, + remoteRelativePath: remoteVersion.relativePath, + remoteHash: await hash(remoteContent) + }); + this.queue.lastSeenUpdateId = remoteVersion.vaultUpdateId; if (actualPath !== pathBeforeRoundtrip) { this.history.addHistoryEntry({ @@ -926,12 +1028,19 @@ export class Syncer { vaultUpdateId: remoteVersion.vaultUpdateId }); + // Stash *ourselves* at a `conflict--` path when the slot is + // locally occupied: a remote create's content is brand new to us, + // so deferring our local placement until any earlier work at this + // slot resolves is safer than evicting the occupant. The + // pending-LocalCreate case naturally unwinds through the server's + // own deconflict: when our create's HTTP runs, the server sees + // `remoteVersion`'s doc already there and routes ours to a + // sibling path, freeing the slot for us to set ourselves on the + // next time the remote-create-followed-by-rename pair drains + // through the queue. const createResult = await this.operations.create( remoteVersion.relativePath, remoteContent, - // Never evict a local file occupying the path the server has - // this remote create at — stash the new file at a - // `conflict--` path instead and record `intendedPath`. MoveOnConflict.NEW ); const { actualPath } = createResult;