programming

I've previously experimented with storing and retrieving text embeddings using SQLite and opted to calculate the cosine similarity of each entry with each other and then store their scores in a table. Similar entries could then be queried by filtering records based on ID and sorting by their similarity score. This was a pattern I copied from Simon Willison.

Max Woolf explores using Apache Parquet files for this purpose as they generated text embeddings and then calculated the similarity between 32,254 Magic the Gathering cards. The write-up also includes instructions to read and write Parquet files using Polaris.

Polars is a relatively new Python library which is primarily written in Rust and supports Arrow, which gives it a massive performance increase over pandas and many other DataFrame libraries.

Max concludes by comparing this method to more traditional vector databases where SQLite (with sqlite-vec) was mentioned.

Notably, SQLite databases are just a single portable file, however interacting with them has more technical overhead and considerations than the read_parquet() and write_parquet() of polars. One notable implementation of vector databases in SQLite is the sqlite-vec extension, which also allows for simultaneous filtering and similarity calculations.

Discovered via Simon Willison.

Read from link

Adrian Cockroft writes about AB testing and building an "abtest" where "Each customer should be in a small number of tests." and the "abtest service should be used across your entire system, for all experiments that touch a customer."

There are three ways a user is enrolled into a test, (1) new users on sign up, (2) feature-specific test when meeting a condition, (3) and a random selection of existing users.

Discovered via Matt Weagle on Mastodon.

Read from link

When submitting a document, design change or any proposal for review among peers we've all been exposed to the law of triviality. In other words a disproportionate amount of comments like "typo?", "change the variable name?" or other trivial objections that don't address the main topic.

The law of triviality is C. Northcote Parkinson's 1957 argument that people within an organization commonly give disproportionate weight to trivial issues. Parkinson provides the example of a fictional committee whose job was to approve the plans for a nuclear power plant spending the majority of its time on discussions about relatively minor but easy-to-grasp issues, such as what materials to use for the staff bicycle shed, while neglecting the proposed design of the plant itself, which is far more important and a far more difficult and complex task.

The term bike-shedding and bike-shed effect have been used in software development to reference the same idea, brought up in a 1999 FreeBSD email thread.

Read from link

I was not aware of how much of an accessibility issue having only a dark colour scheme posed. I know there are preferences but seeing the replies on Nai's Mastodon post about the difficulty of reading white text on a dark background for some with astigmatism was surprising.

But there are some people (like me) who may be visually impaired. Astigmatism, for example, can make reading text that is white on dark a real PITA. An effect known as "halation" occurs, where each letter behaves as if it were a flashlight, gaining its own halo of light and making all text read more blurry than normal.

No matter how good your glasses are, astigmatism still causes you to see a little blurry—it's something you get used to. But this damn effect makes all the text read as if you don't have your glasses on, or even worse, leading to much more tired eyes or even pain.

Linked in the thread is a Vice article in which the author also shares similar difficulty reading dark colour schemes with astigmatism but also why dark backgrounds work for others.

My own very-astigmatic eyes are exhausted by dark mode, but for many others, dark themes are an accessibility benefit. White backgrounds emphasize floaters, those tiny spots of fibers that appear in some people’s vision. People with disorders like photophobia or keratoconus, conditions that cause high sensitivity to light, might read more easily with dark themes.

Read from link

"Simplicity is a great virtue but it requires hard work to achieve it and education to appreciate it. And to make matters worse: complexity sells better."— Edsger Dijkstra

Derek Kedziora made a some good comments on Eugene Yans post titled "Simplicity is An Advantage but Sadly Complexity Sells Better".

Removing unnecessary complexity is a thankless job.

Read from link

Philipp Keller documents three examples of defining a database schema for your tagging strategy with performance tests and sample queries. The simple "MySQLicious" solution with one table for items and tags. The "Scuttle" solution with two tables one for tags and the other for items. Finally, the classic associative tables approach, or as called by the author the "Toxi" solution, with a table for items, another for tags, and an item-mapping table.

The last approach also has a Wikipedia entry, that I sometimes refer to when building similar tables as a subtle reminder.

Read from link

A clear and illustration example of the term frequency-inverse document frequency measure to determine the importance of words within a collection of documents. Taking this one step further Jana Vembunarayanan, the blogs author, uses cosine similarity to link a search query to return the most relevant documents.

Read from link

What if I told you that by tuning a few knobs, you can configure SQLite to reach ~8,300 writes / s and ~168,000 read / s concurrently, with 0 errors

Some interesting configurations that are possible with SQLite today making it much more versatile even though it isn't designed to be a client/server SQL database. Discovered via Simon Willison's weblog.

Read from link

xz, a widely used open source compression tool, introduced a backdoor with malicious code. This in turn has affected a number of applications and distributions, the most notable of which are Fedora, Debian (unstable, experimental) and HomeBrew. Evan Boehs has pieced together a timeline of events going as far back as 2021 which tells a story of how JiaT75 using social engineering became a trusted member for the open source project. Pressure (very harshly so) was applied to the Lasse Collin the sole active maintainer at the time to add another maintainer to xz from seemly multiple people. This coordinated attempt lasting two years is honestly quite shocking.

Read from link

SVG is an interesting and versatile text-based image format. Now I know it's not the Christmas season, but Hunor Márton Borbély has put together an advent calendar for SVG examples, and I've only now started working through them. It's very interactive and informative. I know I'll definitely be using these examples as references in the future.

Read from link