Compression is all you need: 2x, 4x, 8x and 16x cheaper vector database

While building Depths AI, I have been scratching my head on how we can make vector databases at least 10x cheaper. Just like you, I also hate paying bank breaking amounts for storing vector embeddings

In this article, I cover 4 possible ways to reduce your vector database storage bills 2x, 4x, 8x and up to 16x lower:

2x: Just store the embeddings as Float16. Simplest trick, so lame that it is surprising that it is not the default
4x: Slim down even further, approximate the embedding as an array of 8-bit integers: 4x lesser bits per vector embedding.
8x: Don’t just stop at Float16, make your vector embedding 4x smaller: so, for example, have 384 instead of 1536 dimensions.
16x: What if we quantize our embeddings down to int8 and also make them 4x smaller?

What do we compress?

Current dense vector embeddings possibly use too many bytes to convey the information? Can we try to squeeze in the same semantic information in lesser number of bytes? We have 2 ways to do this:

Either make the vector smaller: cut down on the number of dimensions
Keep the dimensions intact, but approximate with a lower precision number to represent each dimension: int8 or float16 instead of float32: using 8 and 16 bits per number respectively, instead of 32.

And of course, merge the two techniques.

Larger the vector embedding, more possible scope to cut down on less relevant info

Shortening the vector: Principal Component Analysis

OG ML gang knows the power of PCA: an algorithm that basically:

Takes in a batch of vector and finds the “components” that define the information stored in the vectors (eigenvalues to be precise)
Then, we simply take, say top 25%, top 50% of those components, sorted by how much of info (variance to be precise) do they contribute to, within the vector. It turns out, for most vector embeddings, ~90% information gets captured within just top 50% components

Obviously, a smaller vector embedding would have a lesser room for irrelevance, so compressions would show better impact on larger embeddings.

<aside> 💡

Note that the only drawback of PCA is that you would have to separately store the “matrix” that is used for “compressing” the document vectors, and the mean of all your document vectors as well. These two metrics would then be needed to first “compress” incoming query to same size as document vectors, and then perform search.

</aside>

Show me the numbers homeboy

I chose two different embedding sizes, to drive the point home.