Check it out! https://leebutterman.com/wikipedia-search-by-vibes/

this webapp is running on airplane mode on an iphone and rendering search results in real time for the default onomatopoeia query and thereafter for the sequoia query, it is rendering search results for millions of embeddings in realtime

What is this?

This is a browser-based search engine for Wikipedia, where you can search for “the reddish tall trees on the san francisco coast” and find results like “Sequoia sempervirens” (a name of a redwood tree). The browser downloads the database, and search happens offline. To download two million Wikipedia pages with their titles takes roughly 100MB and under 50 milliseconds to see the final results. This uses sentence transformers to embed documents, product quantization to compress embeddings, pq.js to run distance computation in the browser, and transformers.js to run sentence transformers in the browser for queries.

Is this good?

Yes.

Real-time search over millions of documents is happening in real-time completely offline. Results stream back every 10ms on a mobile device, and search results update gradually as the database is sequentially scanned.

Timing: first results in 21ms, 70% of final results in 116ms, faceted search in 23ms

The distance computation over 2M embeddings takes 250ms in total, over 20 iterations, and we can display intermediate results with a faceted top-10 computation that takes 8ms. To display intermediate results, we run batches of 100k distance computations at a time, and compute the top-k and repaint after a (30ms) timer runs out.

We order embeddings by compressed page size: more information-dense pages are the first to be analyzed and returned in a top-10 ranking, and might be more useful in a search result. Note that the search results continue to stream in and update the top results, but most of the lower-page-size pages do not rank in the top 10, so the search appears faster than if we did not update the UI until everything returned.

70% of the final search results were in the first 670K embeddings, which in total rendered in 116 milliseconds (note the topk timing at the bottom left, which counts distance calculations as positive times and topk calculations as negative times):

final search query about redwoods with timing data

Note that changing the facet for the onomatopoeia search (changing the first letter of the page to return) avoided running a new embedding, and returned in under 25ms. Changing the number of results from top 10 to top 20 or top 100 is similarly instantaneous.

200k embeddings and page titles compress down to 10MB in Arrow

The database is small enough to support casual use cases of up to a million embeddings without special treatment.

Note that, for high performance, we use Arrow instead of JSON. Arrow can store our 8-bit integer product quantization arrays compactly, and Arrow can store an array of strings as an array of indexes into one buffer, which is a significant savings over a million Javascript string objects.

These ONNX models run in WASM for now

There is no GPU acceleration, only WebAssembly, so far. ONNX is a convenient compile target. WebGPU is still very new, and is an eagerly-anticipated future direction.

Step 1: embed all of Wikipedia with a sentence transformer

There are a lot of sentence transformers to choose from! There is a leaderboard of sentence embeddings: https://huggingface.co/blog/mteb

The all-minilm-l6-v2 model has reasonable performance https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 and is small and available in ONNX weights https://huggingface.co/Xenova/all-MiniLM-L6-v2/ for transformers.js https://github.com/xenova/transformers.js .

Step 2: use product quantization to compress embeddings

6M pages * 384-dimension embeddings * 32-bit floats is over 9GB. Even a million embeddings in float16 precision is 800MB. This is too large for casual usage.

As a first approximation, to choose the top million, one approach would be to choose the pages with the most information: compress each page and see the number of bytes that come out. Lists would be overrepresented (lists are less compressible than general text), there’s no appreciation of the link structure of webpages, but it’s cheap to compute and easy to start with.

FAISS (https://faiss.ai) is a highly popular embedding search engine serverside, with a lot of tuning knobs for creating different styles of search indices. Autofaiss (https://github.com/criteo/autofaiss) will usually recommend using Product Quantization, after creating IVF indices or HNSW indices (Pinecone has a great intro to vector indexing https://www.pinecone.io/learn/vector-indexes/).

Product quantization is exceptionally simple to implement: creating a ‘distance table’ is under 5 lines of numpy and using that to find distances is a one-liner.

Intermezzo: faceted search

Often times, you will want to search in some product subcategories, like finding only PDFs in a web search, or results in ancient Latin. Splitting up the distance computation from computing a top-10 ranking allows us to fudge the distances in flight before ranking. For million-scale search, this is highly feasible. In this search of Wikipedia, there is one search facet: the first character of the page. Because the top-k ranking is separate from distance computation we can avoid recomputing query embeddings and distances to explore different facet values in real time.

Step 3: hand-write ONNX

ONNX has a specific opcode that does exactly the product quantization step! That opcode is GatherElements. Unfortunately, the PyTorch ONNX export does not use this special opcode for the model as written. Thankfully, there is abundant support for reading and writing ONNX outside of a PyTorch compilation step.

A useful graphical editing tool for ONNX is ONNX-modifier, at https://github.com/ZhangGe6/onnx-modifier , which presents a friendly interface to add elements into the dataflow graph of any exported ONNX model.

By taking the multiple steps in the PyTorch model that gets compiled to ONNX, and replacing all of those with one ONNX opcode, distance computation is roughly 4x faster.

Step 4: export numpy to Arrow

As mentioned, the Arrow format is much more compact in memory and much more compact on disk to store the embeddings and the metadata (page titles).

Because the Arrow array format only stores one-dimensional data, and because we have 48 dimensions of embedding data, and because we do not want to store embedding data wrapped in another data format, we need two separate schemas, one for the metadata (with a hundred thousand rows each), and one for the embeddings (with a hundred thousand * 48 rows each), and we reshape the embeddings at load time.

Storing the product quantization codebook in JSON is under 1.0MB, so it is less crucial to optimize this part.

Step 5: let me know what you think 🙂

Lots of the library functions in the full Wikipedia search app should migrate into reusable pq.js components. A lot of the ONNX shapes are pre-baked, so it would be useful to support different quantization levels and different embedding dimensions. Give a shout!