I'd love to see those results independently verified, and I'd also love a good explanation of how they're getting such great performance.
Typically, the recipe is to keep the hot parts of the data structure in SRAM in CPU caches and a lot of SIMD. At the time of those measurements, USearch used ~100 custom kernels for different data types, similarity metrics, and hardware platforms. The upcoming release of the underlying SimSIMD micro-kernels project will push this number beyond 1000. So we should be able to squeeze a lot more performance later this year.
That said, self-reported numbers only go so far—it'd be great to see USearch in more third-party benchmarks like VectorDBBench or ANN-Benchmarks. Those would make for a much more interesting comparison!
On the technical side, USearch has some impressive work, and you're right that SIMD and cache optimization are well-established techniques (definitely part of our toolbox too). Curious about your setup though—vector search has a pretty uniform compute pattern, so while 100+ custom kernels are great for adapting to different hardware (something we're also pursuing), I suspect most of the gain usually comes from a core set of techniques, especially when you're optimizing for peak QPS on a given machine and index type. Looking forward to seeing what your upcoming release brings!
And we always welcome independent verification—if you have any questions or want to discuss the results, feel free to reach out via GitHub Issues or our Discord.
You're absolutely right that a basic HNSW implementation is relatively straightforward. But achieving this level of performance required going beyond the usual techniques.
A better comparison would be with Meta's FAISS
This sort of behaviour is now absolutely rampant in the AI industry.
Average latency across ~500 queries per collection per database:
Qdrant: 21.1ms LanceDB: 5.9ms Zvec: 0.8ms
Both Qdrant and LanceDB are running with Inverse Document Frequency enabled so that is a slight performance hit, Zvec running with HNSW.
Overlap of answers between the 3 is virtually identical with same default ranking.
So yes, Zvec is incredible, but the gotcha is that the reason zvec is fast is because it is primarily constrained by local disk performance and the data must be local disk, meaning you may have a central repository storing the data, but every instance running zvec needs to have a local (high perf) disk attached. I mounted blobfuse2 object storage to test and zvec numbers went to over 100ms, so disk is almost all that matters.
My take? Right now the way zvec behaves, it will be amazing for on-device vector lookups, not as helpful for cloud vectors.
Just a bit of context on the storage behavior: Zvec currently uses memory-mapped files (mmap) by default, so once the relevant data is warmed up in the page cache, performance should be nearly identical regardless of whether the underlying storage is local disk or object storage—it's essentially in-memory at that point. The 100ms latency you observed with blobfuse2 likely reflects cold reads (data not yet cached), which can be slower than local disk in practice. Our published benchmarks are all conducted with sufficient RAM and full warmup, so the storage layer's latency isn't a factor in those numbers.
For benchmarks you may just prepare Phoronix Test Suite module (https://www.phoronix-test-suite.com/) to facilitate replication on a variety of machines.
It would be great to see how it compares to Faiss / HNSWLib etc. I'd will consider integrating it into txtai as an ANN backend.
See e.g., https://scikit-learn.org/stable/auto_examples/neighbors/plot...