sqlite-vec
By: Sid
Vector DB's are something I'll be using soon and while going through them I was reminded of a lecture given by Alex Garcia at Bengaluru System Meetup and it was truly an amazing talk. Before we get started lets have a breif intro with what Vector Databases are 😋
Vector Databases :
As the name suggests...
They store and query a lot of vectors.
But first, What is a Vector? 🤔
•Vector : A vector is basically a list of numbers
•Embeddings : Embeddings are basically a representation of text, images, audio etc in the form of a vector
Nowww, these embeddings actually let you perform calculations on these texts.
How?
Consider these text exampeles
Book A: "A beginner's guide to baking bread and pastries"
Book B: "Complete cookbook for making cakes and cookies"
Book C: "Advanced quantum physics and string theory"
These get converted to embeddings that looks something like this :
Book A: [0.8, 0.9, 0.7, 0.1, ...] # high values for cooking/baking related dimensions
Book B: [0.7, 0.8, 0.6, 0.2, ...] # similar pattern because it's also about baking
Book C: [0.1, 0.1, 0.2, 0.9, ...] # completely different pattern (science related)
Now using these vectors, we can perform multiple operations like finding related documents, finding groups/clusters or even answering questions :))
COOL RIGHT?????
Now, how do they actually store these ?
Consider A single 1024 dimension vector
The size required for this 1024 dimension vector is :
1024 * sizeof(float) = 1024 * 4 = 4096 bytes ~ 4KB
Now, if we were to store 1 million of them, that'd be about 4.1 GB's!
While storing 1 million 1024-dimensional vectors takes ~4.1GB, a traditional full-text search index for similar searchable content would require dozens of megabytes just to index text data.
This realllyyy shows how vector storage can be more space-efficient for certain types of data retrieval
And, what about the querying part??
Something he covered the most during the lecture was
KNN Queries
What are KNN Queries ?
An example that made me understand this was : Let's say we have a restaurant recommendation system:
Now, if we were using a vector DB, the original restaurant descriptions of the foods like
Quick burger joint with drive-through
are converted into embeddings like:
restaurant1_vector = [0.2, 0.8, 0.5, ...] # 1024 dimensions
and now, when the user searches for something like
"Fast food restaurants for burgers"
The query is:
1.First converted into a vector
2.KNN then finds the closest vectors by distance using something like cosine similarity
3.and returns the closest, second closest values and so on ...
What is sqlite-vec?
Now that we've understood all of this.
sqlite-vec is a sqlite extension for vector search✨
•It works inside any custom functions and virtual tables
•It also works inside an sqlite instance.
💡Note : It actually stores your vectors inside your sqlite database ! so all your backup/restore and streaming replication services actually work with sqlite-vec because all the vectors they produce are eventually stored in the DB.\
btw, sqlite-vec doesn't generate embeddings for you but there are multiple services out there that'll do it for you. feel free to try them
Why sqlite-vec?
•PURE SQL API
•runs everywhere
•has different compression techniques to reduce the size of a vector index [tradeoff of quality for speed and size ].
1.One example of this is Binary Quantization : where if the value is >0.0 store as 1 else 0. This basically stores a 32-bit floating vector as a bit vector[THATS A 32X SIZE REDUCTION!!! 🥳🥳🥳🥳].
It uses hamming distance instead of cosine distance btw
2.Another one is Scalar Quantization : where you basically scale a 32-bit floating point into a int8, int16, float16, etc but for now sqlite-vec supports only int8 so thats a (4x SIZE REDUCTION!! 🥳🥳)
3.Matryoshka Embeddings
•has Hybrid Search [Vector Search + Full-Text search] [read this blog plspls, really cool]
•has Metadata Filtering
What did I learn from this ?
1.How cool sqlite actually is => I'm gonna give sqllite-vec a shot and you can too btw here
2.He spoke about how cool sqlite extensions are and what they can and can't do. He listed a few cool extensions as well. Imma try them out
•sqlite-http
•sqlite-html
•sqlite-lines
•sqlite-path
•sqlite-url
3.Try out vector dbs and benchmark them and try building something on these