Traverse allows AI to read and write ink

Introduction

Despite the overwhelming popularity of digital ink for note-taking in academic and professional settings, it largely remains a niche research direction, often limited to narrowly focused models for handwriting recognition.

With the rapidly improving capabilities of large language models, there have been multiple efforts to bring ink into the transformer paradigm, such as the Cursive Transformer and Microsoft’s TrInk. Yet none of these efforts have produced a general-purpose tokenizer that would allow LLMs to understand and generate ink the way text tokenizers allow them to understand and generate text.

Traverse is a tokenizer for digital pen sequences, built as a direct successor to the Cursive Transformer, capable of encoding practically any digital ink sequence at any scale with dynamic precision.

Related Work

In early 2025, Sam Greydanus and Zachary Wimpee released a research paper called The Cursive Transformer. They proposed a new tokenization approach for pen sequences into discrete tokens by using polar absolute coordinates.

The tokenizer has a vocabulary of pen tokens, where each movement is encoded as a pair of tokens: an angle token associated with a specific absolute rotation angle $\theta$ , and a magnitude token associated with a specific absolute travel distance $d$ . Apart from these movement tokens, there are two more tokens to indicate a ‘pen up’ event and a ‘pen down’ event.

In short, given a list of strokes, the tokenizer loops over each point within each stroke, keeping track of its current position on the canvas, and for each step picks the closest angle and magnitude tokens that would bring the current position to the next target point. Pen up and pen down tokens are emitted on transition segments, for example for a space between ‘hello’ and ‘world’, or a gap between letters within a word.

Suppose the vocabulary has 4 rotation bins (0°, 90°, 180°, 270°) and 2 absolute distance bins (5px and 10px), plus dedicated pen-up and pen-down tokens. If you want to draw a 10×20 rectangle with a starting point at (0, 0), both side lengths divide evenly into the available bins:

Cursive Transformer tokenizing a 10×20 rectangle with absolute distance bins

Now suppose you scale that rectangle to 1.5×, giving a 15×30 rectangle. There is no 15px or 30px bin, so the tokenizer has to approximate each side with multiple tokens, producing a completely different and longer sequence for the same shape:

The same rectangle at 1.5× scale requires over 50% more tokens

This illustrates the core issues that make the approach largely impractical for large-scale pen sequence modelling:

Travel distances are absolute, so the same shape at different scales produces completely different token sequences. This wastes model capacity on learning scale rather than structure, and small strokes become impossible to represent at all.
The Douglas resampling used to reduce point count is also scale-dependent, so small strokes get resampled into an uninterpretable mess of straight lines.

Method

Traverse is a scale-invariant tokenizer for digital pen sequences based on the Cursive Transformer work, introducing a number of significant improvements:

Scale Invariance. Encoded ‘hello’ at 1x and ‘hello’ at 2x will result in identical or near-identical token sequences.
Inference-time Dynamic Precision. Despite the fixed vocabulary, the tokenizer allows encoding sequences with any precision value, making it possible to adjust the speed/quality ratio in model deployment.
Adaptive Precision. If the stroke sequence changes its scale within the sequence (for example, you have an equation and the power of some term is written in small print), the precision will automatically adapt to the new scale, preserving the precision value relative to the rest of the handwriting, making it possible to express any stroke sequence.
Curvature-based resampling. Instead of resampling by point density in absolute terms, the resampling is curvature-based, making it scale invariant.

Traverse operates in two stages: raw strokes are first resampled based on curvature, then the resampled points are encoded into discrete tokens using relative distance bins at a given precision value. The following sections describe each stage in order.

Traverse encoding pipeline

Resampling

One of the key components that made the Cursive Transformer work was the Douglas-Peucker resampling, which reduces point count in a stroke while preserving its visual shape. The problem is that Douglas resampling operates on an absolute distance threshold, making it scale-dependent.

For the tokenizer to produce consistent sequences across scales, the resampling itself has to be scale-invariant too. Curvature turned out to be a natural choice here, since it’s a dimensionless geometric property that doesn’t depend on scale.

The resampling works by building a dimensionless ‘complexity’ metric for each stroke from its length and curvature. For each segment $i$ between consecutive points, the contribution to the stroke’s total complexity is

Segment complexity, where

l_i

is the segment length,

P

is the stroke’s perimeter,

w

is the curvature weight, and

\bar{\kappa}_i

is the average Menger curvature normalized by

P

Each segment’s share of the budget is proportional to its length, but boosted if it bends sharply. Normalizing curvature by the perimeter is what makes this scale-invariant, which follows from the property of a circle having the same normalized curvature regardless of its radius.

Raw input (129,675 points)

Curvature-based resampling (595 points)

This gives us each stroke’s total complexity. The point budget is then allocated globally, where each stroke receives points proportional to its share of the total complexity across all strokes, scaled by a density hyperparameter $\rho$ (with a minimum of 2 points per stroke). Each stroke is then resampled by placing its allocated points at uniform intervals along its cumulative complexity, so that high-curvature regions receive denser sampling than straight regions.

The result is that a small tight curve and a large gentle arc of the same visual complexity receive comparable point counts, giving a consistent representation of strokes regardless of their absolute size.

Tokenization

Since the resampling is scale-invariant, the relative distance between points would be more or less preserved, regardless of the absolute scale. That notion became the foundation of Traverse’s tokenizer.

Instead of having absolute distance bins, Traverse uses relative multiplicative distance bins. The tokenizer keeps track of a base distance $b$ that evolves throughout encoding: each token is picked based on its multiplicative distance bin relative to $b$ , and then $b$ is updated by multiplying it with that bin’s value.

Unlike the Cursive Transformer, tokens in Traverse pack the distance, the rotation, and the pen state. That helps the model be more efficient during inference, and easier to spot patterns in pen up and pen down tokens during training.

Each token

t_n

encodes a rotation, a relative distance multiplier, and a pen state

Because the multiplicative bins can be both additive (2x, 4x) and subtractive (½x, ¼x), and you’re encoding a distance between two points, you can approach the target point to arbitrary precision: prioritizing quality by spending more ‘refinement’ tokens on reaching the target closer, or prioritizing speed of inference by tolerating larger error, but producing smaller token sequences.

Suppose our vocabulary has 12 rotation bins (30° each) and 3 distance multipliers at $[\frac{1}{2}\times, 1\times, 2\times]$ , for both pen-down and pen-up states. Now suppose you draw the same 10×20 and 15×30 rectangles using Traverse. $b$ is initialized from the first segment, so the first long side is $1\times$ , the short side halves it to $\frac{1}{2}\times$ , and the pattern continues with the same multipliers regardless of absolute size:

Traverse tokenizing both rectangles with identical token sequences

Since every distance bin is a power of 2, $b$ is quantized to the nearest power of 2 as well: $b = 2^{\text{round}(\log_2 b)}$ . This keeps $b$ on a discrete power-of-2 lattice throughout encoding, so two encoders starting from slightly different initial scales snap to the same lattice point and produce identical token sequences. $b$ is also re-snapped to the lattice after each pen-up transition, giving every new stroke a clean starting point.

Precision

This notion of variable precision introduces a question: how do you decide that your error is ‘close enough’ for some given precision value? In the original Cursive Transformer, the ‘close enough’ metric was the absolute distance between the current point and the target point. But absolute distance breaks scale invariance: smaller shapes become less precise.

In Traverse, the precision $p$ is a ratio where $0 \leq p < 1$ , and the acceptance threshold is $(1 - p) \cdot \tilde{b}$ , where $\tilde{b}$ is the smoothed base distance. At $p = 0$ the threshold equals $\tilde{b}$ and each point is reached in a single token. As $p \to 1$ the threshold approaches zero and token density grows as $\frac{1}{1 - p}$ , so the same finite vocabulary can reproduce a shape to any level of detail.

Lowest precision (

p = 0

) yields 1,052 tokens

Higher precision (

p = 0.2

) yields 1,980 tokens

The initial idea was to set the precision value relative to the current segment length, which would be scale-invariant, but also very sensitive to sudden jumps in precision that would fragment the encoding quality, jumping from very dense and precise points in some areas to extremely low precision in other areas, making them practically illegible.

Instead, the precision anchor is computed as an Exponential Moving Average of $b$ in log-space:

Smoothed base distance, where

\alpha

is the smoothing factor and the clamp suppresses sudden spikes

The latest distance contributes most, while all the previous distances contribute exponentially less, providing a stable signal to normalize precision against. The delta clamping suppresses sudden spikes in $b$ across fragments, which further stabilizes the precision.

Due to a large difference between pen up and pen down distances, it was empirically determined that only refining the precision for pen down steps was optimal for retaining geometric features of the shape the ink represents.

Adaptive precision across a handwriting sample

The result is that if the writing includes some elements that are relatively smaller than other elements within the same sequence, the encoder dynamically adapts its precision to ensure uniform quality across the ink sequence, while remaining scale-invariant.

Evaluation

To quantitatively evaluate performance, the IAM On-Line Handwriting Database is used as a benchmark.

Vocabulary usage uniformity is comparable across both tokenizers, with Traverse landing within a few percentage points of The Cursive Transformer despite operating on a much smaller vocabulary. Uniformity is reported as the normalized Shannon entropy $H / \log_2 |V|$ , where $H = -\sum_i p_i \log_2 p_i$ over token probabilities $p_i$ and $|V|$ is the vocabulary size, reaching $1$ when every token is used equally.

Traverse vs. Cursive Transformer across precision levels

To make the comparison between tokenization approaches fair, each sample is uniformly resampled, encoded and decoded with Traverse (5 distance bins at [0.125, 0.5, 1, 2, 4] and 32 uniform rotation bins per pen state, for a vocabulary of 320 pen tokens) and The Cursive Transformer (220 shared angle bins and 151 absolute distance bins per pen state, for a vocabulary of 522 pen tokens) to produce reconstructed stroke sequences, then uniformly resampled again to match the original point count. Reconstruction quality is measured as the Mean Squared Error between the original and reconstructed sequences.

MSE (lower is better) vs. median token count

At a comparable token budget, Traverse cuts reconstruction error by roughly 2.3x against The Cursive Transformer (177 tokens at $p = 0.4$ vs. 160 tokens, MSE 2.2e-4 vs. 5.0e-4). Crucially, this is achieved with a single fixed 320-token vocabulary: precision becomes an inference-time parameter, letting the same tokenizer trade tokens for fidelity from $p = 0$ for fast, coarse output up to $p = 0.8$ for reconstructions roughly 5x sharper than The Cursive Transformer.

Limitations

Although Traverse supports encoding with arbitrary precision, the overall quality remains constrained by the fixed set of distance and rotation bins in its vocabulary. While the tokenizer can theoretically reach any level of detail by emitting multiple fraction tokens to lower the base distance, the adaptive speed is ultimately limited by the lowest available bin resolution.

Furthermore, despite significant improvements in token efficiency compared to The Cursive Transformer, Traverse still produces relatively long sequences. Given the quadratic attention complexity of transformer models, very long sequences can strain memory and inference latency.

Experimental Setup

With a goal to validate the proposed tokenization approach, a small GPT-2-style decoder-only model called Graphite has been trained. The model itself uses a vanilla GPT-2 decoder-only architecture with 8 layers, 8 heads, and 468 embedding size. It was trained on 1x NVIDIA RTX A4000 for 16 epochs, with a batch size of 24, and a learning rate 5e-4.

Dataset

The training set consisted of short snippets of handwritten words, sentences, equations and numbers. The data has been collected from volunteer students who agreed to share their study notes made on an iPad app for notetaking, recorded with Apple Pencil. In total, about 50K samples have been collected across 10 students.

To make the dataset supervised, all the handwriting within samples were transcribed using an open-source OCR model.

Vocabulary

In order to allow the model to learn the connection between text symbols and pen tokens, a character-level tokenizer has been used. Te allowed text symbols set was built from ASCII characters list, which includes uppercase and lowercase letters, digits, and a short set of symbols.

Apart from text symbol tokens, all the discrete tokens from Traverse have been added, along with a handful of special tokens for prompt formatting. The specific tokenizer parameters (5 distance bins at [0.125, 0.5, 1, 2, 4] and 32 uniformly spaced rotation bins for both pen-down and pen-up states, yielding a vocabulary of 320 pen tokens) were empirically determined by inspecting the bin usage distribution on the training dataset and selecting the values that produced the most uniform coverage across the vocabulary.