From 3d382ad7415c1a8d5945cb8524763515ba0e501a Mon Sep 17 00:00:00 2001 From: Andras Schmelczer Date: Tue, 10 Mar 2026 20:29:35 +0000 Subject: [PATCH] Improve docs and compare with alternatives --- README.md | 56 ++++++++++++++++++--- docs/advanced-ts.md | 4 +- src/lib.rs | 5 +- src/operation_transformation/edited_text.rs | 39 ++++++-------- src/operation_transformation/operation.rs | 23 ++++----- src/raw_operation.rs | 4 +- src/tokenizer.rs | 2 +- src/tokenizer/token.rs | 10 ++-- src/types/history.rs | 3 +- src/types/number_or_text.rs | 4 ++ src/types/span_with_history.rs | 3 +- src/types/text_with_cursors.rs | 5 +- src/utils/string_builder.rs | 7 ++- src/wasm.rs | 10 ++-- 14 files changed, 106 insertions(+), 69 deletions(-) diff --git a/README.md b/README.md index 7925ed5..bdb6c39 100644 --- a/README.md +++ b/README.md @@ -33,7 +33,7 @@ Alternatively, add `reconcile-text` to your `Cargo.toml`: ```toml [dependencies] -reconcile-text = "0.5" +reconcile-text = "0.8" ``` Then start merging: @@ -52,7 +52,7 @@ let result = reconcile(parent, &left.into(), &right.into(), &*BuiltinTokenizer:: assert_eq!(result.apply().text(), "Hi beautiful world"); ``` -See the [merge-file example](examples/merge-file.rs) for another example or the [library's documentation](https://docs.rs/reconcile-text/latest/reconcile_text). +See the [merge-file example](examples/merge-file.rs) for another example, or the [library's documentation](https://docs.rs/reconcile-text/latest/reconcile_text). ### JavaScript/TypeScript @@ -77,7 +77,7 @@ const result = reconcile(parent, left, right); console.log(result.text); // "Hi beautiful world" ``` -See the [example website source](examples/website/src/index.ts) for a more complex example or the [advanced examples document](https://github.com/schmelczer/reconcile/blob/main/docs/advanced-ts.md). +See the [example website source](examples/website/src/index.ts) for a more complex example, or the [advanced examples document](https://github.com/schmelczer/reconcile/blob/main/docs/advanced-ts.md). ## Motivation @@ -87,13 +87,13 @@ This creates **Differential Synchronisation** scenarios ([2], [3]): we only know > **Note**: Some text domains require more careful handling. Legal contracts, for instance, could have unintended meaning changes from conflicting edits that create double negations. At the same time, semantic conflicts can still arise when merging code, even in the absence of syntactic conflicts. -Differential sync is implemented by [universal-sync](https://github.com/invisible-college/universal-sync) and my Obsidian plugin [vault-link](https://github.com/schmelczer/vault-link), and it requires a merging tool which creates conflict-free results for the best user experience. +Differential sync is implemented by [universal-sync](https://github.com/invisible-college/universal-sync) and my Obsidian plugin [vault-link](https://github.com/schmelczer/vault-link), and it requires a merging tool that creates conflict-free results for the best user experience. ## How it works `reconcile-text` starts off similarly to `diff3` ([4], [5]) but adds automated conflict resolution. Given a **parent** document and two modified versions (`left` and `right`), the following happens: -1. **Tokenisation** — Input texts get split into meaningful units (words, characters, etc.) for granular merging +1. **Tokenisation** — Input texts are split into meaningful units (words, characters, etc.) for granular merging 2. **Diff computation** — Myers' algorithm calculates differences between (parent ↔ left) and (parent ↔ right) 3. **Diff optimisation** — Operations are reordered and consolidated to maximise chained changes 4. **Operational Transformation** — Edits are woven together using OT principles, preserving all modifications and updating cursors @@ -102,6 +102,48 @@ Whilst the primary goal of `reconcile-text` isn't to implement OT, it provides a However, when only the end result of concurrent changes is observable, merge quality depends entirely on the quality of the underlying 2-way diffs. For instance, `move` operations cannot be supported because Myers' algorithm decomposes them into separate `insert` and `delete` operations, regardless of the merging algorithm used. +## Comparison with other approaches + +### Traditional 3-way merge (diff3, Git) + +Tools like `diff3` ([4]) and Git produce **conflict markers** (`<<<<<<<` / `=======` / `>>>>>>>`) when both sides modify the same region. This works for source code where a human must verify correctness, but breaks the reading flow for prose. `reconcile-text` uses the same diff3-like foundation but adds an OT-inspired resolution step that eliminates conflict markers entirely. Libraries like [diffy](https://crates.io/crates/diffy), [merge3](https://github.com/breezy-team/merge3-rs) (Rust), and [node-diff3](https://github.com/bhousel/node-diff3) (JavaScript) all fall into this category. + +### diff-match-patch + +[diff-match-patch](https://github.com/google/diff-match-patch) ([6]) is a widely-used library created by Neil Fraser at Google in 2006, providing character-level diffing (Myers' algorithm), fuzzy string matching (Bitap algorithm), and patch application. It powers Fraser's **Differential Synchronisation** protocol ([2]): compute a diff between two texts, apply the patch to a third text that may have drifted, and repeat until convergence. If a patch fails, the failure self-corrects in the next sync cycle. + +The key differences from `reconcile-text`: + +- **2-way vs 3-way** — diff-match-patch diffs two texts and applies the result as a patch. It has no concept of a common ancestor and cannot reason about "left changes" vs "right changes". `reconcile-text` performs true 3-way merging, understanding the intent behind each side's edits. + +- **Character-level only** — Word-level and line-level diffs require encoding tokens as single Unicode characters before diffing ([7]). `reconcile-text` supports word, character, line, and custom tokenisation natively. + +- **Patches can fail** — `patch_apply` returns a boolean array indicating success per patch; failed patches are silently dropped. In Differential Synchronisation, failures self-correct in the next cycle, but for one-shot merges edits can be lost. `reconcile-text` always produces a complete merged result. + +- **No cursor tracking or change provenance** — diff-match-patch does not reposition cursors or track which side made which edit. `reconcile-text` does both automatically. + +See the [comparison example](examples/compare-with-diff-match-patch.rs) for concrete cases where diff-match-patch garbles adjacent edits and silently drops an entire sentence, while `reconcile-text` merges both users' changes correctly. + +> **When to use diff-match-patch instead**: when you don't have a common ancestor—for example, synchronising texts that have diverged through an unknown sequence of edits. If you have a common ancestor (as in most version control and collaborative editing scenarios), `reconcile-text` produces more reliable results. + +### CRDTs (Yjs, Automerge, Loro, diamond-types) + +Conflict-free Replicated Data Types guarantee convergence by mathematical construction: every operation commutes, so the order of application doesn't matter. Libraries like [Yjs](https://github.com/yjs/yjs) (and its Rust port [Yrs](https://github.com/y-crdt/y-crdt)), [Automerge](https://github.com/automerge/automerge), [Loro](https://github.com/loro-dev/loro), [cola](https://github.com/nomad/cola), and [diamond-types](https://github.com/josephg/diamond-types) implement this approach. + +CRDTs capture every individual keystroke or operation, assigning each a unique identity. This makes them ideal when you control the complete editing infrastructure: the editor, the transport layer, and the storage format. They work peer-to-peer, handle arbitrary numbers of concurrent editors, and never lose an edit. + +The trade-off is that CRDTs require **maintaining document state over time**—an operation log or internal data structure that grows with the document's edit history. You cannot simply hand a CRDT library three plain strings and get a merged result. This makes them unsuitable for Differential Synchronisation scenarios where you only observe the final state of each document, which is exactly the niche `reconcile-text` fills. + +> **When to use CRDTs instead**: if you control the complete editing stack and can capture every operation as it happens, CRDTs provide stronger convergence guarantees. They also support more than two concurrent editors naturally, whereas `reconcile-text` merges exactly two forks at a time (though merges can be chained). + +### Operational Transformation (OT) + +OT libraries like [ot.js](https://ot.js.org/) and [ShareJS](https://github.com/josephg/ShareJS) transform concurrent operations against each other so that applying them in any order produces the same result. Like CRDTs, they capture individual operations and require infrastructure to coordinate them—typically a central server that determines the canonical operation order. + +`reconcile-text` borrows the *concept* of OT (transforming one side's edits against the other) but applies it to a different problem. Instead of transforming individual keystrokes in real time, it transforms the consolidated diff output of two complete edits. This means it doesn't need a server, doesn't need to capture operations as they happen, and works entirely offline. + +> **When to use OT instead**: if you need real-time collaboration with sub-second latency and can run a coordination server, dedicated OT libraries handle this well. `reconcile-text` is designed for merge points, not live keystroke-by-keystroke synchronisation. + ## Development Contributions are welcome! @@ -142,8 +184,10 @@ Install [rustup](https://rustup.rs): [MIT](./LICENSE) -[1]:https://marijnhaverbeke.nl/blog/collaborative-editing-cm.html +[1]: https://marijnhaverbeke.nl/blog/collaborative-editing-cm.html [2]: https://neil.fraser.name/writing/sync/ [3]: https://www.cis.upenn.edu/~bcpierce/papers/diff3-short.pdf [4]: https://blog.jcoglan.com/2017/05/08/merging-with-diff3/ [5]: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35605.pdf +[6]: https://github.com/google/diff-match-patch +[7]: https://github.com/google/diff-match-patch/wiki/Line-or-Word-Diffs diff --git a/docs/advanced-ts.md b/docs/advanced-ts.md index 5ecf065..0480bd4 100644 --- a/docs/advanced-ts.md +++ b/docs/advanced-ts.md @@ -40,7 +40,7 @@ console.log(result.history); /* ## Tokenisation Strategies -Reconcile offers different approaches to split text for merging: +`reconcile-text` offers different approaches to split text for merging: - **Word tokeniser** (`"Word"`) — Splits on word boundaries (recommended for prose) - **Character tokeniser** (`"Character"`) — Individual characters (fine-grained control) @@ -48,7 +48,7 @@ Reconcile offers different approaches to split text for merging: ## Cursor Tracking -Reconcile automatically tracks cursor positions through merges, which is handy in collaborative editors. Selections can be tracked by providing them as a pair of cursors. +`reconcile-text` automatically tracks cursor positions through merges, which is useful for collaborative editors. Selections can be tracked by providing them as a pair of cursors. ```javascript const result = reconcile( diff --git a/src/lib.rs b/src/lib.rs index 654daa7..e759cc2 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -59,7 +59,7 @@ //! //! For specialised use cases, such as structured languages, custom //! tokenisation logic can be implemented by providing a function with the -//! signature `Fn(&str) -> Vec>`:: +//! signature `Fn(&str) -> Vec>`: //! //! ``` //! use reconcile_text::{reconcile, Token, BuiltinTokenizer}; @@ -151,10 +151,11 @@ //! ] //! ); //! ``` +//! //! ## Efficiently serialize changes //! //! The edits can be serialized into a compact representation without the full -//! original text, making the size only depends on the changes made. +//! original text, making the size depend only on the changes made. //! //! ```rust //! # #[cfg(feature = "serde")] diff --git a/src/operation_transformation/edited_text.rs b/src/operation_transformation/edited_text.rs index 1f41cb1..bc30137 100644 --- a/src/operation_transformation/edited_text.rs +++ b/src/operation_transformation/edited_text.rs @@ -18,18 +18,16 @@ use crate::{ utils::string_builder::StringBuilder, }; -/// A text document and a sequence of operations that can be applied to the text -/// document. `EditedText` supports merging two sequences of operations using -/// the principles of Operational Transformation. +/// A text document with a sequence of operations derived from diffing it +/// against an updated version. Supports merging two `EditedText` instances +/// (from the same original) via Operational Transformation. /// -/// It's mainly created through the `from_strings` method, then merged with -/// another `EditedText` derived from the same original text and then applied to -/// the original text to get the reconciled text of concurrent edits. +/// Created via `from_strings`, `from_strings_with_tokenizer`, or `from_diff`, +/// then merged with another `EditedText` and applied to get the reconciled +/// text. /// -/// In addition to text and operations, it also keeps track of cursor positions -/// in the original text. The cursor positions are updated when the operations -/// are applied, so that the cursor positions can be used to restore the -/// cursor positions in the updated text. +/// Also tracks cursor positions from the updated text, repositioning them +/// when operations are applied. #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))] #[derive(Debug, Clone, PartialEq, Default)] pub struct EditedText<'a, T> @@ -43,12 +41,8 @@ where } impl<'a> EditedText<'a, String> { - /// Create an `EditedText` from the given original (old) and updated (new) - /// strings. The returned `EditedText` represents the changes from the - /// original to the updated text. When the return value is applied to - /// the original text, it will result in the updated text. The default - /// word tokenizer is used to tokenize the text which splits the text on - /// whitespaces. + /// Create an `EditedText` from the given original and updated strings. + /// Uses the default word tokenizer (splits on word boundaries). #[must_use] pub fn from_strings(original: &'a str, updated: &TextWithCursors) -> Self { Self::from_strings_with_tokenizer(original, updated, &*BuiltinTokenizer::Word) @@ -59,11 +53,8 @@ impl<'a, T> EditedText<'a, T> where T: PartialEq + Clone + Debug, { - /// Create an `EditedText` from the given original (old) and updated (new) - /// strings. The returned `EditedText` represents the changes from the - /// original to the updated text. When the return value is applied to - /// the original text, it will result in the updated text. The tokenizer - /// function is used to tokenize the text. + /// Create an `EditedText` from the given original and updated strings + /// using the provided tokenizer. pub fn from_strings_with_tokenizer( original: &'a str, updated: &TextWithCursors, @@ -110,7 +101,7 @@ where /// /// # Panics /// - /// Panics if there's an integer overflow (in i64) when calculating new + /// Panics if there's an integer overflow (in isize) when calculating new /// cursor positions. #[must_use] #[allow(clippy::too_many_lines)] @@ -280,7 +271,7 @@ where /// Apply the operations to the text and return the resulting text in chunks /// together with the provenance describing where each chunk came from. /// - /// The result includes deleted spans as well. + /// Returns all spans including deletions (not present in the merged text). /// /// ``` /// use reconcile_text::{History, SpanWithHistory, BuiltinTokenizer, reconcile}; @@ -422,7 +413,7 @@ where result } - /// Deserialize an `EditedText` from a change list and the original text. + /// Reconstruct an `EditedText` from a diff and the original text. /// /// # Errors /// diff --git a/src/operation_transformation/operation.rs b/src/operation_transformation/operation.rs index 7a8f92a..823a50b 100644 --- a/src/operation_transformation/operation.rs +++ b/src/operation_transformation/operation.rs @@ -46,9 +46,8 @@ impl Operation where T: PartialEq + Clone + Debug, { - /// Creates an equal operation with the given index. - /// This operation is used to indicate that the text at the given index - /// is unchanged. + /// Creates an equal (retain) operation starting at the given character + /// offset in the original text. pub fn create_equal(order: usize, length: usize) -> Self { Operation::Equal { order, @@ -69,13 +68,14 @@ where } } - /// Creates an insert operation with the given index and text. + /// Creates an insert operation at the given character offset with the + /// given tokens. pub fn create_insert(order: usize, text: Vec>) -> Self { Operation::Insert { order, text } } - /// Creates a delete operation with the given index and number of - /// to-be-deleted characters. + /// Creates a delete operation at the given character offset for the + /// specified number of characters. pub fn create_delete(order: usize, deleted_character_count: usize) -> Self { Operation::Delete { order, @@ -179,8 +179,8 @@ where builder } - /// Returns the number of affected characters. It is always greater than 0 - /// because empty operations cannot be created. + /// Returns the number of affected characters. May be 0 after + /// `merge_operations`. pub fn len(&self) -> usize { match self { Operation::Equal { length, .. } => *length, @@ -192,10 +192,9 @@ where } } - /// Merges the operation with the given context, producing a new operation - /// and updating the context. This implements a comples FSM that handles - /// the merging of operations in a way that is consistent with the text. - /// The contexts are updated in-place. + /// Adjusts this operation based on `previous_operation` from the other side + /// to avoid duplicating or conflicting changes. Updates + /// `previous_operation` in-place. #[allow(clippy::too_many_lines)] pub fn merge_operations(self, previous_operation: &mut Option) -> Operation { let operation = self; diff --git a/src/raw_operation.rs b/src/raw_operation.rs index aa3ab8f..1572e88 100644 --- a/src/raw_operation.rs +++ b/src/raw_operation.rs @@ -2,9 +2,9 @@ use std::fmt::Debug; use crate::{tokenizer::token::Token, utils::myers_diff::myers_diff}; -/// Text editing operation containing the to-be-changed `Tokens`-s. +/// Text editing operation containing the affected tokens. /// -/// `RawOperations` can be joined together when the underlying tokens +/// `RawOperation`s can be joined together when the underlying tokens /// allow for joining subsequent operations. #[derive(Debug, Clone, PartialEq)] pub enum RawOperation diff --git a/src/tokenizer.rs b/src/tokenizer.rs index 62ab528..fabafcd 100644 --- a/src/tokenizer.rs +++ b/src/tokenizer.rs @@ -12,7 +12,7 @@ use wasm_bindgen::prelude::*; pub mod token; -/// A trait for tokenizers that take a string and return a list of tokens. +/// Type alias for tokenizer functions that split a string into tokens. pub type Tokenizer = dyn Fn(&str) -> Vec>; #[cfg_attr(feature = "wasm", wasm_bindgen)] diff --git a/src/tokenizer/token.rs b/src/tokenizer/token.rs index 58e6ab6..2f2dc82 100644 --- a/src/tokenizer/token.rs +++ b/src/tokenizer/token.rs @@ -3,13 +3,11 @@ use std::fmt::Debug; #[cfg(feature = "serde")] use serde::{Deserialize, Serialize}; -/// A token is a string that has been normalized in some way. +/// A token with a normalized form (used for diffing) and an original form +/// (used when applying operations). Joinability flags control whether +/// adjacent insertions interleave or group. /// -/// A token consists of the normalized form is used for comparison, and the -/// original form used for subsequently applying `Operation`-s to a text -/// document. -/// -/// It's UTF-8 compatible. +/// UTF-8 compatible. #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))] #[derive(Debug, Clone)] pub struct Token diff --git a/src/types/history.rs b/src/types/history.rs index e30ae90..431a481 100644 --- a/src/types/history.rs +++ b/src/types/history.rs @@ -15,8 +15,7 @@ pub enum History { RemovedFromRight = "RemovedFromRight", } -/// Simple enum for describing the result of `reconcile` in a flat list. -/// When compiled to WASM, the enum values are the same as their names. +/// Provenance label for each span returned by `apply_with_history`. #[derive(Debug, Clone, Copy, PartialEq, Eq)] #[cfg(not(feature = "wasm"))] #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))] diff --git a/src/types/number_or_text.rs b/src/types/number_or_text.rs index bb42bc7..fe17af4 100644 --- a/src/types/number_or_text.rs +++ b/src/types/number_or_text.rs @@ -27,6 +27,10 @@ impl TryFrom for NumberOrText { } if let Some(num) = value.clone().as_f64() { + if num.is_nan() { + return Err(DeserialisationError::new("NaN is not a valid number")); + } + if num.abs() > INTEGRAL_LIMIT { return Err(DeserialisationError::new( "Floating-point number exceeds safe integer limit, use BigInt instead", diff --git a/src/types/span_with_history.rs b/src/types/span_with_history.rs index 09f778f..1e4481c 100644 --- a/src/types/span_with_history.rs +++ b/src/types/span_with_history.rs @@ -5,8 +5,7 @@ use wasm_bindgen::prelude::*; use crate::types::history::History; -/// Wrapper type for `(String, History)` where History describes the origin of -/// `text`. +/// A text span annotated with its origin in a merge result. #[allow(clippy::unsafe_derive_deserialize)] #[cfg_attr(feature = "wasm", wasm_bindgen)] #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))] diff --git a/src/types/text_with_cursors.rs b/src/types/text_with_cursors.rs index c9dec89..8f4af19 100644 --- a/src/types/text_with_cursors.rs +++ b/src/types/text_with_cursors.rs @@ -12,12 +12,15 @@ pub struct TextWithCursors { #[cfg_attr(feature = "wasm", wasm_bindgen)] impl TextWithCursors { + /// # Panics + /// + /// Panics if any cursor's `char_index` exceeds the text's character length. #[cfg_attr(feature = "wasm", wasm_bindgen(constructor))] #[must_use] pub fn new(text: String, cursors: Vec) -> Self { let length = text.chars().count(); for cursor in &cursors { - debug_assert!( + assert!( cursor.char_index <= length, // cursor.char_index == length means that the cursor is at the end "Cursor positions ({}) must be contained within the text (of length {length}) or \ diff --git a/src/utils/string_builder.rs b/src/utils/string_builder.rs index 34110d8..40928c3 100644 --- a/src/utils/string_builder.rs +++ b/src/utils/string_builder.rs @@ -1,9 +1,8 @@ use std::{fmt, iter::Iterator}; -/// A helper for building a string in-order based on an original string and a -/// series of insertions, deletions, and copies applied to it. It is safe to use -/// with UTF-8 strings as all operations are based on character indices. The -/// methods must be called in-order. +/// A helper for building a string sequentially from an original string via +/// insertions, deletions, and copies. All operations use character counts, +/// safe for UTF-8. Methods must be called in-order. pub struct StringBuilder<'a> { original: Box + 'a>, buffer: String, diff --git a/src/wasm.rs b/src/wasm.rs index 6fc02f2..959f1a6 100644 --- a/src/wasm.rs +++ b/src/wasm.rs @@ -22,7 +22,7 @@ pub fn reconcile( crate::reconcile(parent, left, right, &*tokenizer).apply() } -/// WASM wrapper around `crate::reconcile` for merging text. +/// WASM wrapper around `crate::reconcile` that also returns provenance history. #[wasm_bindgen(js_name = reconcileWithHistory)] #[must_use] pub fn reconcile_with_history( @@ -94,12 +94,12 @@ pub fn diff(parent: &str, changed: &TextWithCursors, tokenizer: BuiltinTokenizer .collect() } -/// Inverse of `diff`, applies a compact diff representation to a parent text +/// Inverse of `diff`, applies a compact diff representation to a parent text. /// -/// # Panics +/// # Errors /// -/// Panics if the diff format is invalid or there's an integer overflow when -/// applying the diff. +/// Returns a JS error if the diff format is invalid or references ranges +/// exceeding the original text length. #[wasm_bindgen(js_name = undiff)] #[must_use] pub fn undiff(parent: &str, diff: Vec, tokenizer: BuiltinTokenizer) -> String {