A tiny character-level transformer, trained on Project Gutenberg books and running fully in your browser via ONNX Runtime Web.
This page runs a small generative language model entirely in your browser. No server, no API call, no data leaving your device.
A character-level transformer with ~12M parameters: 6 layers, 6 attention heads, a 384-dimensional residual stream, and a 1024-character context window. Trained from scratch on a subset of Project Gutenberg books published before 1919 — that's why the prose tends to sound vaguely Victorian.
Each colored box above is one token — the model's atomic unit of input and output. Tokens are character-level here, so every box is a single letter, digit, punctuation mark, or piece of whitespace. The underlined ones are what you typed; the rest were sampled one at a time by the model. Adjacent tokens get cycling background colors so the boundaries are visible.
The trained PyTorch model is exported to ONNX and loaded by ONNX Runtime Web. The browser runs the actual matrix multiplications via WebGPU (when available) or WASM SIMD. Sampling — temperature, top-k, optional lookahead — is implemented in plain JavaScript and drives the model in a tight autoregressive loop, one token per forward pass.
When the lookahead checkbox is on, instead of greedily sampling one token at a time, the sampler expands a depth-N tree with branching factor K, ranks all KN candidate paths by their joint probability, and picks one weighted by temperature. Trades wall-clock for more coherent output — useful for a small char-level model where greedy sampling drifts quickly.
Training uses a linear warmup over the first 100 iterations followed by ReduceLROnPlateau — the learning rate only drops when validation loss stops improving for several eval intervals, rather than following a predetermined decay schedule. The loss curve and lr staircase are both logged to TensorBoard during training.
torch.onnx.export with a dynamic time axis