Vectors
Local embeddings, TurboQuant 3-bit, HNSW above 1K. The retrieval layer of memory.
Memory has two retrieval modes. Structured reads walk the typed-object graph by attribute and edge. Semantic reads use vector similarity. The vectors layer makes the second mode work on a smartphone budget, with no cloud calls and no separate service. Three choices carry the design: embed locally, compress aggressively, switch to an approximate index only when the corpus is big enough to need one.
LOCALLocal embeddings — fastembed + ONNX
The embedding model is all-MiniLM-L6-v2 (384 dimensions), run through fastembed, which bundles ONNX Runtime. The model cache (~87 MB on first download) is anchored under .ioboxx/fastembed_cache/ so it stays per-twin and is cleaned by --purge. After the first download, the model runs offline forever. No API keys, no per-call billing, no telemetry — data never leaves the machine.
384 dimensions is deliberately modest. Higher-dimensional models exist and score marginally better on benchmarks, but the marginal recall is not worth the storage and the compute cost on-device. 384-dim is the point where semantic retrieval is good enough for working memory and the per-vector budget stays in range of the smartphone target.
CODECTurboQuant 3-bit compression
Raw 32-bit floats are too expensive to keep at scale. Every vector is compressed to 3 bits per dimension using TurboQuant — the Google Research codec (ICLR 2026) that combines PolarQuant (Cartesian-to-polar rotation, fixed circular grid) with a QJL error-correction bit (Quantized Johnson-Lindenstrauss). The result is >9× compression with negligible accuracy loss on standard semantic-search benchmarks.
The arithmetic decides whether on-device memory is feasible at all. With a 384-dim vector and one 32-bit float per dimension, raw storage per vector is roughly 4,100 bytes (1,536 bytes of payload plus index and node overhead measured against the original Google paper's 1024-dim working figure of ~4 KB). At 3 bits per dimension, that drops to roughly 384 bytes per vector. The result for a realistic working set:
| Scenario | Per vector | 77.5K vectors | Verdict |
|---|---|---|---|
| Uncompressed 32-bit floats | ~4,100 B | ~318 MB | needs a server |
| TurboQuant 3-bit PolarQuant + QJL | ~384 B | ~30 MB | fits on a phone |
That difference — ~310 MB versus ~30 MB — is the difference between an embedding tensor that has to live on a server and one that fits in the smartphone memory budget alongside the rest of the typed-object substrate.
INDEXHNSW + flat hybrid
The retrieval index lives at crates/vectors/src/hnsw.rs — a custom HNSW implementation with M=16, ef_construct=200, and ef_search=50. It auto-activates once the corpus exceeds 1,000 vectors. Below that threshold, the engine runs a flat cosine scan — cheaper than maintaining a graph for tiny indexes and easier to reason about during the cold-start phase of a new twin.
HNSW is incremental: inserts attach a new node to the layered proximity graph without rebuilding. Deletes are lazy-repair — the node is tombstoned, neighbour links are pruned on the next traversal that encounters them. The whole index persists to .ioboxx/vectors.bin via bincode, so startup only re-embeds nodes that are missing a vector; the rest are loaded from disk.
The 1,000-vector threshold is the empirical sweet spot. Below it, the flat scan is fast enough that the graph maintenance cost is wasted. Above it, the index build cost pays back across queries that would otherwise sweep the entire embedding tensor.
WRITE PATHSubscribe-on-write embedding
Embedding is a native engine capability, not a queued workflow. create_object and set_attributes emit through an embed_sink that computes the vector immediately and attaches it to the node — no BPMN process instance, no scheduled poller, no separate embedder binary. The earlier iobox-ai design queued embeddings through a full BPMN "Embeddings Backfill" process with a 5-second dispatch loop. The ioboxx-core rewrite collapsed that into a direct write-through path and freed the BPMN engine for the workflows it was meant for.
RATIONALEWhy these specific choices
- 384-dim is good enough for most semantic retrieval. Bigger models cost more for marginal recall and don't change behaviour qualitatively for working memory.
- HNSW above 1K is the crossover point where the index build cost pays back. Below it, the flat scan wins; above it, the graph wins.
- 3-bit TurboQuant brings the embedding tensor into the smartphone memory budget — alongside the typed-object substrate, the commitlog, and the snapshot. Without it, sovereign on-device memory is not feasible at any realistic scale.
- Subscribe-on-write eliminates a class of "stale embeddings" bugs that come with any queue. The vector lands with the attribute that produced it.
COMPOSITIONWhere this fits
The embedding tensor is one of the four tensor slices on a Node — the structured-attribute slice, the lineage-edge slice, the surprise-and- relevance slice, and the embedding slice all live on the same object. The unified argument is in tensor-as-substrate. If you arrived here from memory, go back up one level for the broader shape; this page only covers the retrieval layer.