a curated list of database news from authoritative sources

July 29, 2024

A Deep Dive into German Strings

A Deep Dive into German Strings

“Strings are Everywhere”! At least according to a 2018 DBTest Paper from the Hyper team at Tableau. In fact, strings make up nearly half of the data processed at Tableau. This high prevalence undoubtedly applies to many other companies as well, as the paper’s dataset consists of data analyzed by Tableau’s users. The string-heavy nature of the data makes string processing one of the most important tasks of a database system.

July 25, 2024

The Hidden Cost of Data Movement

The Hidden Cost of Data Movement

Recently, Mark Raasveldt of DuckDB wrote an excellent post about why memory management is crucial for efficient data processing. In his post, he focuses on the cost of having data on disk and moving it to memory. After all, everyone knows that having data in memory is what you want. As Jim Gray famously said in 2006:

Tape is Dead, Disk is Tape, Flash is Disk, RAM Locality is King

July 23, 2024

July 22, 2024

Optimizing aggregation in the Vitess query planner

The Vitess query planner takes multiple passes over a query plan to optimize it as much as possible before execution. A recent tricky bug report led to an improvement in how the optimizer functions.

An Interesting Optimization

Introduction # I recently encountered an intriguing bug. A user reported that their query was causing vtgate to fetch a large amount of data, sometimes resulting in an Out Of Memory (OOM) error. For a deeper understanding of grouping and aggregations on Vitess, I recommend reading this prior blog post. The Query # The problematic query was: selectsum(user.type)fromuserjoinuser_extraonuser.team_id=user_extra.idgroupbyuser_extra.idorderbyuser_extra.id;The planner was unable to delegate aggregation to MySQL, leading to the fetching of a significant amount of data for aggregation.

July 19, 2024

July 16, 2024

Why German Strings are Everywhere

German Strings

Strings are conceptually very simple: It’s essentially just a sequence of characters, right? Why, then, does every programming language have their own slightly different string implementation? It turns out that there is a lot more to a string than “just a sequence of characters”1.

We’re no different and built our own custom string type that is highly optimized for data processing. Even though we didn’t expect it when we first wrote about it in our inaugural Umbra research paper, a lot of new systems adopted our format. They are now implemented in DuckDB, Apache Arrow, Polars, and Facebook Velox.

July 11, 2024

Supabase Security Suite

Learn how to use range columns in Postgres to simplify time-based queries and add constraints to prevent overlaps.