Links

The Best Way to Use Text Embeddings Portably is With Parquet and Polars - Max Woolf (minimaxir.com) 03.03.2025 Tags: database embedding programming python

I've previously experimented with storing and retrieving text embeddings using SQLite and opted to calculate the cosine similarity of each entry with each other and then store their scores in a table. Similar entries could then be queried by filtering records based on ID and sorting by their similarity score. This was a pattern I copied from Simon Willison.

Max Woolf explores using Apache Parquet files for this purpose as they generated text embeddings and then calculated the similarity between 32,254 Magic the Gathering cards. The write-up also includes instructions to read and write Parquet files using Polaris.

Polars is a relatively new Python library which is primarily written in Rust and supports Arrow, which gives it a massive performance increase over pandas and many other DataFrame libraries.

Max concludes by comparing this method to more traditional vector databases where SQLite (with sqlite-vec) was mentioned.

Notably, SQLite databases are just a single portable file, however interacting with them has more technical overhead and considerations than the read_parquet() and write_parquet() of polars. One notable implementation of vector databases in SQLite is the sqlite-vec extension, which also allows for simultaneous filtering and similarity calculations.

Discovered via Simon Willison.

Read from link

Presenting ActivityBot at FOSDEM - Terence Eden (shkspr.mobi) 19.02.2025 Tags: database

Databases are bullshit! You can just save JSON to disk and no one will stop you!

Read from link

Versioning versus Coordination (brooker.co.za) 17.02.2025 Tags: database system-design

Marc Brooker writes about the benefits of versioning over coordination (i.e. locks) in database systems, particularly in distributed ones by using a physical clock.

Read from link

sqlite-vec - A vector search SQLite extension that runs anywhere! (github.com) 15.11.2024 Tags: database tool

sqlite-vec is a vector search extension for SQLite with install options for Node.js, Python, Rust, rqlite and more. I previously used SQLite to store embeddings when investigating the viability of using a text embedding model to recommend related posts on the blog. However, back then I'd calculated the cosine similarity between each embedding which is compute intensive. Using this extension would allow me to dynamically retrieve the distance between an embedding and the embeddings stored in the database.

The extension can be compiled on various platforms but currently the PyPI package does not have an ARM release.

I also encountered an error when trying to retrieve distances for embeddings: "OperationalError: A LIMIT or 'k = ?' constraint is required on vec0 knn queries". I wasn't alone in this the workaround is to use k instead of LIMIT in the query. In the coming few weeks I'll be playing around with this extension some more, looks very promising so far.

Read from link

Optimizing SQLite settings for Django (gcollazo.com) 17.06.2024 Tags: database programming

Written by Giovanni Collazo this guide on running SQLite in production with Django projects compliments another I very recently shared called Optimizing SQLite for servers by Sylvain Kerkour.

I will also be saving this for future reference. This is discovered via Simon Willison's Weblog.

Read from link

Tags: Database schemas (howto.philippkeller.com) 18.05.2024 Tags: database programming

Philipp Keller documents three examples of defining a database schema for your tagging strategy with performance tests and sample queries. The simple "MySQLicious" solution with one table for items and tags. The "Scuttle" solution with two tables one for tags and the other for items. Finally, the classic associative tables approach, or as called by the author the "Toxi" solution, with a table for items, another for tags, and an item-mapping table.

The last approach also has a Wikipedia entry, that I sometimes refer to when building similar tables as a subtle reminder.

Read from link

Optimizing SQLite for servers (kerkour.com) 01.04.2024 Tags: database programming

What if I told you that by tuning a few knobs, you can configure SQLite to reach ~8,300 writes / s and ~168,000 read / s concurrently, with 0 errors

Some interesting configurations that are possible with SQLite today making it much more versatile even though it isn't designed to be a client/server SQL database. Discovered via Simon Willison's weblog.

Read from link