database

I've previously experimented with storing and retrieving text embeddings using SQLite and opted to calculate the cosine similarity of each entry with each other and then store their scores in a table. Similar entries could then be queried by filtering records based on ID and sorting by their similarity score. This was a pattern I copied from Simon Willison.

Max Woolf explores using Apache Parquet files for this purpose as they generated text embeddings and then calculated the similarity between 32,254 Magic the Gathering cards. The write-up also includes instructions to read and write Parquet files using Polaris.

Polars is a relatively new Python library which is primarily written in Rust and supports Arrow, which gives it a massive performance increase over pandas and many other DataFrame libraries.

Max concludes by comparing this method to more traditional vector databases where SQLite (with sqlite-vec) was mentioned.

Notably, SQLite databases are just a single portable file, however interacting with them has more technical overhead and considerations than the read_parquet() and write_parquet() of polars. One notable implementation of vector databases in SQLite is the sqlite-vec extension, which also allows for simultaneous filtering and similarity calculations.

Discovered via Simon Willison.

Read from link

sqlite-vec is a vector search extension for SQLite with install options for Node.js, Python, Rust, rqlite and more. I previously used SQLite to store embeddings when investigating the viability of using a text embedding model to recommend related posts on the blog. However, back then I'd calculated the cosine similarity between each embedding which is compute intensive. Using this extension would allow me to dynamically retrieve the distance between an embedding and the embeddings stored in the database.

The extension can be compiled on various platforms but currently the PyPI package does not have an ARM release.

I also encountered an error when trying to retrieve distances for embeddings: "OperationalError: A LIMIT or 'k = ?' constraint is required on vec0 knn queries". I wasn't alone in this the workaround is to use k instead of LIMIT in the query. In the coming few weeks I'll be playing around with this extension some more, looks very promising so far.

Read from link

Philipp Keller documents three examples of defining a database schema for your tagging strategy with performance tests and sample queries. The simple "MySQLicious" solution with one table for items and tags. The "Scuttle" solution with two tables one for tags and the other for items. Finally, the classic associative tables approach, or as called by the author the "Toxi" solution, with a table for items, another for tags, and an item-mapping table.

The last approach also has a Wikipedia entry, that I sometimes refer to when building similar tables as a subtle reminder.

Read from link

What if I told you that by tuning a few knobs, you can configure SQLite to reach ~8,300 writes / s and ~168,000 read / s concurrently, with 0 errors

Some interesting configurations that are possible with SQLite today making it much more versatile even though it isn't designed to be a client/server SQL database. Discovered via Simon Willison's weblog.

Read from link