The Best Way to Use Text Embeddings Portably is With Parquet and Polars - Max Woolf

Shared: | Tags: database embedding programming python

I've previously experimented with storing and retrieving text embeddings using SQLite and opted to calculate the cosine similarity of each entry with each other and then store their scores in a table. Similar entries could then be queried by filtering records based on ID and sorting by their similarity score. This was a pattern I copied from Simon Willison.

Max Woolf explores using Apache Parquet files for this purpose as they generated text embeddings and then calculated the similarity between 32,254 Magic the Gathering cards. The write-up also includes instructions to read and write Parquet files using Polaris.

Polars is a relatively new Python library which is primarily written in Rust and supports Arrow, which gives it a massive performance increase over pandas and many other DataFrame libraries.

Max concludes by comparing this method to more traditional vector databases where SQLite (with sqlite-vec) was mentioned.

Notably, SQLite databases are just a single portable file, however interacting with them has more technical overhead and considerations than the read_parquet() and write_parquet() of polars. One notable implementation of vector databases in SQLite is the sqlite-vec extension, which also allows for simultaneous filtering and similarity calculations.

Discovered via Simon Willison.

Read from link