February 03, 2026
February 02, 2026
Importance of Tuning Checkpoint in PostgreSQL
February 01, 2026
The Doctor's On-Call Shift solved with SQL Assertions
In a previous article, I explained that enforcing application-level rules, such as “each shift must always have at least one doctor on call”, typically requires either serializable isolation or explicit locking.
There's another possibility when enforcing the rules in the database, so they fall under ACID’s C instead of I, with SQL assertions. The SQL databases implemented only part of the SQL standard, limiting the constraints to single-row CHECK constraints, unique constraints between rows in the same table, or referential integrity constraints between table rows. Oracle implemented recently one missing part: SQL assertions that can express cross-row and cross-table conditions, including some joins and subqueries.
Oracle 23.26.1
Here is an example. The version where SQL assertions are available is Oracle AI Database 26ai Enterprise Edition Release 23.26.1.0.0. Here is how I started a Docker container:
# get the image (12GB)
docker pull container-registry.oracle.com/database/enterprise:23.26.1.0
# start the container
docker run -d --name ora26 container-registry.oracle.com/database/enterprise:23.26.1.0
# wait some minutes to start the database
until docker logs ora26 | grep "DATABASE IS READY TO USE" ; do sleep 1 ; done
# use a weak password for this lab
docker exec -it ora26 ./setPassword.sh "franck"
# create a user with the right privileges to test assertions
docker exec -i ora26ai sqlplus sys/franck@ORCLPDB1 as sysdba <<'SQL'
grant connect, resource to franck identified by franck;
grant create assertion to franck;
grant alter session to franck;
alter user franck quota unlimited on users;
SQL
# Start the command-line
docker exec -it ora26 sqlplus franck/franck@ORCLPDB1
Create the single-table schema
I created the same table as in the previous posts:
SQL> CREATE TABLE doctors (
shift_id INT NOT NULL,
name VARCHAR2(42) NOT NULL,
on_call BOOLEAN NOT NULL,
CONSTRAINT pk_doctors PRIMARY KEY (shift_id, name)
);
Table DOCTORS created.
Two doctors are on-call for the shift '1':
SQL> INSERT INTO doctors VALUES
(1, 'Alice', true),
(1, 'Bob', true)
;
2 rows inserted.
SQL> COMMIT
;
Commit complete.
According to SQL-92 standard, the following assertion would guarantee that for every shift, the number of doctors on_call = 'Y' is ≥ 1:
CREATE ASSERTION at_least_one_doctor_on_call_per_shift
CHECK (
NOT EXISTS (
SELECT shift_id
FROM doctors
GROUP BY shift_id
HAVING COUNT(CASE WHEN on_call THEN 1 END) < 1
)
);
Unfortunately, this is not supported yet:
SQL> CREATE ASSERTION at_least_one_doctor_on_call_per_shift
CHECK (
NOT EXISTS (
SELECT shift_id
FROM doctors
GROUP BY shift_id
HAVING COUNT(CASE WHEN on_call THEN 1 END) < 1
)
);
FROM doctors
*
ERROR at line 5:
ORA-08689: CREATE ASSERTION failed
ORA-08661: Aggregates are not supported.
Help: https://docs.oracle.com/error-help/db/ora-08689/
Oracle implements a pragmatic, performance‑oriented subset of SQL‑92 assertions. It uses internal change tracking, similar to session‑scope materialized view logs, to limit what must be validated. Aggregates are currently disallowed, and assertions are mainly implemented as anti‑joins.
Here is a more creative way to express “Every shift must have at least one on-call doctor” as a double negation: “There must not exist any doctor who belongs to a shift that has no on-call doctor”:
DROP ASSERTION IF EXISTS no_shift_without_on_call_doctor;
CREATE ASSERTION no_shift_without_on_call_doctor
CHECK (
NOT EXISTS (
SELECT 'any doctor'
FROM doctors
WHERE NOT EXISTS (
SELECT 'potential on-call doctor in same shift'
FROM doctors on_call_doctors
WHERE on_call_doctors.shift_id = doctors.shift_id
AND on_call_doctors.on_call = TRUE
)
)
);
If the inner NOT EXISTS is true, this shift has ZERO on-call doctors, so we found a doctor who belongs to a shift with no on-call doctor, and the outer NOT EXISTS becomes false, which raises the assertion violation.
I re-play the conflicting changes from the previous post. In one session, Bob removes his on-call status:
SQL> UPDATE doctors
SET on_call = false
WHERE shift_id = 1 AND name = 'Bob'
;
In another session, Alice also tries to remove her on-call status:
SQL> UPDATE doctors
SET on_call = false
WHERE shift_id = 1 AND name = 'Alice'
;
Bob’s update succeeds, but before committing, Alice’s update hangs, waiting on enq: AN - SQL assertion DDL/DML. Once Bob commits, Alice's statement fails:
SQL> update doctors set on_call = false where shift_id = 1 and name = 'Alice';
update doctors set on_call = false where shift_id = 1 and name = 'Alice'
*
ERROR at line 1:
ORA-08601: SQL assertion (FRANCK.NO_SHIFT_WITHOUT_ON_CALL_DOCTOR) violated.
Help: https://docs.oracle.com/error-help/db/ora-08601/
This prevents race condition anomalies by reading not only the current state—which, without a serializable isolation level, is vulnerable to write-skew anomalies—but also the ongoing changes to other rows from other sessions.
Some internals: change tracking and enqueue
You can trace what happens, recursive statements and locks:
ALTER SESSION SET EVENTS 'sql_trace bind=true, wait=true';
ALTER SESSION SET EVENTS 'trace[ksq] disk medium';
ALTER SESSION SET tracefile_identifier=AliceRetry;
UPDATE doctors
SET on_call = false
WHERE shift_id = 1 AND name = 'Alice'
;
The update stores the change in an internal change tracking table prefixed by ORA$SA$TE_ (SQL Assertion Table Event):
INSERT INTO "FRANCK"."ORA$SA$TE_DOCTORS"
(DMLTYPE$$, OLD_NEW$$, SEQUENCE$$, CHANGE_VECTOR$$, ROW$$ ,"SHIFT_ID","ON_CALL")
VALUES (
'U', -- update
'O', -- old value
2, -- ordered session-level sequence ORA$SA$SEQ$$
..., -- change vector bits
'0000031D.0000.0000' -- ROWID,
1, false -- column values
);
INSERT INTO "FRANCK"."ORA$SA$TE_DOCTORS"
(DMLTYPE$$, OLD_NEW$$, SEQUENCE$$, CHANGE_VECTOR$$, ROW$$ ,"SHIFT_ID","ON_CALL")
VALUES (
'U', -- update
'N', -- new value
2, -- ordered per-session sequence ORA$SA$SEQ$$
'0a', -- change vector bits to find quickly which column has changed
'0000031D.0000.0000' -- ROWID,
1, true -- column values
);
You can spot the column names used in materialized view logs—introduced in Oracle7 as a “snapshot log”. It’s an internal feature from 1992 used in 2026 to implement a 1992 SQL feature 🤓 However, in this context the table isn’t a regular MV log, but an internal GTT (global temporary table) with special read restrictions:
SQL> select * from "FRANCK"."ORA$SA$TE_DOCTORS";
select * from "FRANCK"."ORA$SA$TE_DOCTORS"
*
ERROR at line 1:
ORA-08709: Reads from SQL assertion auxiliary tables are restricted.
Help: https://docs.oracle.com/error-help/db/ora-08709/
This change-tracking table is used only to decide if revalidation is needed, as well as what to lock, and revalidate only what’s necessary, without scanning whole tables. For example, here, it can detect that it has only to validate one SHIFT_ID and use it as the resource to lock, like a range lock used by databases that provide a serializable isolation level. This is an alternative to a data-modeling solution where we have a "shifts" table and use SELECT FOR UPDATE explicit locking on it.
During the validation, the lock type is AN, different from the well-known TX and TM lock types, and the ksq trace shows an exclusive lock acquired by Bob when it has detected that the changes may necessitate a validation of the assertion:
ORCLCDB_ora_7785_BOB.trc:
2026-01-31 23:44:54.587860*:ksq.c@7635:ksqcmi():ksqcmi AN-00011B78-C8CC96CA-739F2901-00000000 mode=0 timeout=0 lockmode=6 lockreq=0
The locked resource is a hash bucket computed from the assertion definition combined with the SHIFT_ID value. Alice's session requested as similar lock and waited on enq: AN - SQL assertion DDL/DML:
ORCLCDB_ora_7592_ALICE.trc:
2026-01-31 22:55:53.607221*:ksq.c@7635:ksqcmi(): ksqcmi AN-00011B78-C8CC96CA-739F2901-00000000 mode=6 timeout=21474836 lockmode=0 lockreq=0
This is where Alice waited during her update, and before the validation of the assertion.
Conclusion
With Oracle 23.26.1, SQL assertions finally move a long‑standing SQL‑92 feature from theory into practical enforcement, combining declarative constraint semantics with change tracking and fine‑grained locking.
There are now three solution to solve the doctor's on-call shift:
- Serializable isolation level (not in Oracle Database)
- Data model and explicit locking from the application (all databases)
- SQL Assertion when the business logic is in the database (only Oracle Database)
January 31, 2026
OSTEP Chapters 4,5
I recently started reading "Operating Systems: Three Easy Pieces" (OSTEP) as part of Phil Eaton's offline reading group. We are tackling a very doable pace of 2 chapters a week.
The book is structured into three major parts: Virtualization, Concurrency, and Persistence. It is openly accessible to everyone for free, which is a tremendous contribution to computer science education by the Arpaci-Dusseau couple (Remzi and Andrea).
This is a very user-friendly book, sprinkled with a lot of jokes and asides that keep the mood light. The fourth wall is broken upfront, the authors talk directly to you, which is great. It makes it feel like we are learning together rather than being lectured at. Their approach is more than superficial; it inspires you, motivates the problems, and connects them to the big picture context. The book builds scaffolding through "The Crux" of the problem and "Aside" panels. It actively teaches you the thought processes, not just the results. I’ve talked about the importance of this pedagogical style before: Tell me about your thought process, not just the results.
Chapter 4: The Process Abstraction
In the first part, Virtualization, we start with the most fundamental abstraction: The Process. Informally, a process is simply a running program.
To understand a process, we have to look at its machine state. This includes its address space (memory), registers (like the Program Counter and Stack Pointer), and I/O information (like open files).
One distinct memory component here is the Stack versus the Heap.
- The Stack: C programs use the stack for local variables, function parameters, and return addresses. It's called a stack because it operates on a Last-In, First-Out basis, growing and shrinking automatically as functions are called and return.
- The Heap: This is used for explicitly requested, dynamically-allocated data (via malloc in C). It’s needed for data structures like linked lists or hash tables where the size isn't known at compile time.
Reading this reminded me of 30 years ago, using Turbo Pascal to debug programs line-by-line. I remember watching the call stack in the IDE, seeing exactly how variables changed and stack frames were pushed and popped as the program executed. Turbo Pascal in the 1990s was truly a thing of beauty for learning these concepts interactively.
The book provides some excellent diagrams to visualize these concepts.
This figure shows how the OS takes a lifeless program on disk (code and static data) and hydrates it into a living process in memory. Like the trisolarans dehydrating to survive dormancy, the program exists as inert structure until the OS loads its bytes into an address space, reconstructing state and enabling execution. Only after this hydration step can the program spring into action.
To track all this, the OS needs a data structure. In the xv6 operating system (a teaching OS based on Unix), this is the "struct proc".
Figure 4.5 shows the code for the process structure in xv6. It tracks the process state (RUNNING, RUNNABLE, etc.), the process ID (PID), pointers to the parent process, and the context (registers) saved when the process is stopped. In other words, this struct captures the "inventory" the OS keeps for every running program.
Chapter 5: The Process API via Fork and Exec
In Chapter 5, we get to the UNIX process API. This is where the beauty of UNIX design really shines. The way UNIX creates processes is a pair of system calls: fork() and exec(). The fork() call is weird: it creates an almost exact copy of the calling process. Then things get really weird with exec(), as exec() takes an existing process and transforms it into a different running program by overwriting its code and static data.
This code snippet demonstrates the fork() call. It shows how the child process comes to life as if it had called fork() itself, but with a return code of 0, while the parent receives the child's PID.
This figure puts it all together. It shows a child process using execvp() to run the word count program ('wc') on a source file, effectively becoming a new program entirely.
You might ask: Why not just have a single CreateProcess() system call? Why this dance of cloning and then overwriting? The separation of fork() and exec() is essential for building the UNIX Shell. It allows the shell to run code after the fork but before the exec. This space is where the magic of redirection and pipes happens. For example, if you run 'wc p3.c > newfile.txt', the shell:
- Calls fork() to create a child.
- Inside the child (before exec): Closes standard output and opens 'newfile.txt'. Because UNIX starts looking for free file descriptors at zero, the new file becomes the standard output.
- Calls 'exec()' to run 'wc'.
- The 'wc' program writes to standard output as usual, unaware that its output is now going to a file instead of the screen.
Figure 5.4 shows the code for redirection. It shows the child closing STDOUT_FILENO and immediately calling 'open()', ensuring the output of the subsequent 'execvp' is routed to the file.
This design is unmatched. It allows us to compose programs using pipes and redirection without changing the programs themselves. It is a testament to the brilliance of Thompson and Ritchie. The book refers to this as Lampson's Law: "Get it right... Neither abstraction nor simplicity is a substitute for getting it right". The UNIX designers simply got it right, and that is why this design still holds up 55 years later as the gold standard.
As a final thought to spark some discussion, it is worth remembering that science advances by challenging even its most sacred cows. While we laud fork() and exec() as brilliant design, a 2019 HotOS paper titled "A fork() in the road" offers a counterpoint, and argues that fork() was merely a "clever hack" for the constraints of the 1970s that has now become a liability. The authors contend that it is a terrible abstraction for modern programmers that compromises OS implementations, going so far as to suggest we should deprecate it and teach it only as a historical artifact. This kind of debate is exactly how science works; as the OSTEP authors remind us in the end note for the chapter.
January 30, 2026
Durastar Heat Pump Hysteresis
In which I discover that lying to HVAC manufacturers is an important life skill, and share a closely guarded secret: Durastar heat pumps like the DRADH24F2A / DRA1H24S2A with the DR24VINT2 24-volt control interface will infer the set point based on a 24-volt thermostat’s discrete heating and cooling calls, smoothing out the motor speed.
Modern heat pumps often use continuously variable inverters, so their compressors and fans can run at a broad variety of speeds. To support this feature, they usually ship with a “communicating thermostat” which speaks some kind of proprietary wire protocol. This protocol lets the thermostat tell the heat pump detailed information about the temperature and humidity indoors, and together they figure out a nice, constant speed to run the heat pump at. This is important because cycling a heat pump between “off” and “high speed” is noisy, inefficient, and wears it out faster.
Unfortunately, the manufacturer’s communicating thermostats are often Bad, Actually.™ They might be well-known lemons, or they don’t talk to Home Assistant. You might want to use a third-party thermostat like an Ecobee or a Honeywell. The problem is that there is no standard for communicating thermostats. Instead, general-purpose thermostats have just a few binary 24V wires. They can ask for three levels (off, low, and high) of heat pump cooling, of heating, and of auxiliary heat. There’s no way to ask for 53% or 71% heat.
So! How does the heat pump map these three discrete levels to continuously variable motor speeds? Does it use a bang-bang controller which jumps between, say, 30% and 100% intensity on calls for low and high heat, respectively? Or does it perform some sort of temporal smoothing, or try to guess the desire set point based on recently observed behavior?
How the heat pump interprets 24V signals is often hinted at in the heat pump’s manual. Lennox’s manuals, for instance, describe a sort of induced hysteresis mechanism where the heat pump ramps up gradually over time, rather than jumping to maximum. However, Durastar omits this information from their manuals. My HVAC contractor was also confused about this. After weeks of frustration, I tried to reach out to the manufacturer directly, and remembered that heat pump manufacturers are like paranoid wizards who refuse to disclose information about their products to everyday people. Only licensed HVAC professionals can speak to them. I wasted so, so much time on this, and have two secrets to share.
First: “licensed HVAC contractor” is not a real requirement. Many states have no licensing program, so you are just as licensed as anyone else in, say, rural Indiana. The trick that folks in construction use is to simply lie and tell them you’re an HVAC installer. As a midwesterner I do not like this, but it is apparently the only way to get things done. Durastar’s contractor support number is 877-616-2885.
Second: I talked to an actual Durastar engineer who immediately understood the question and why it was important. He explained that they use the thermistor on the air handler’s inlet as a proxy for indoor temperature, and learn the set point by tracking the 24V thermostat’s calls for heating over time. As long as the thermostat maintains a stable set point, the heat pump can run at a nice intermediate rate, trying to keep the indoor temperature close to—but not reaching—the inferred set point. That way the thermostat never stops calling for stage 1 heating/cooling, and the heat pump avoids short-cycling.
Finally, if the industry could please get its act together and make a standard protocol for communicating thermostats, we could all be free of this nonsense. I believe in you.
January 29, 2026
Rebuilding a Replica with MyDumper
Efficient String Compression for Modern Database Systems
Why you should care about strings
If dealing with strings wasn’t important, why bother thinking about compressing them? Whether you like it or not, “strings are everywhere”1. They are by far the most prominent data type in the real world. In fact, roughly 50% of data is stored as strings. This is largely because strings are flexible and convenient: you can store almost anything as a string. As a result, they’re often used even when better alternatives exist. For example, have you ever found yourself storing enum-like values in a text column? Or, even worse, using a text column for UUIDs? Guilty as charged, I’ve done it too, and it turns out this is actually very common.1
Recently, Snowflake published insights into one of their analytical workloads2. They found that string columns where not only the most common data type, but also the most frequently used in filters.
This means two things: First, it is important to store these strings efficiently, i.e., in a way that does not waste resources, including money. Second, they must be stored in a way that allows to answer queries efficiently, because that is what users want: Fast responses to their queries.
Why you want to compress (your strings)
Compression (usually) reduces the size of your data, and therefore the resources your data consumes. That can mean real money if you’re storing data in cloud object stores where you pay per GB stored, or simply less space used on your local storage device. However, in database systems, size reduction isn’t the only reason to compress. As Prof. Thomas Neumann always used to say in one of my foundational database courses (paraphrased): “In database systems, you don’t primarily compress data to reduce size, but to improve query performance.” As compression also reduces the data’s memory footprint, your data might all of a sudden fit into CPU caches where it previously only fit into RAM, cutting the access time by more than 10x. And because data must travel through bandwidth-limited physical channels from disk to RAM to the CPU registers, smaller data also means reading more information in the same amount of time, and thus better bandwidth-utilization.
String Compression in CedarDB
Before diving deeper into FSST, it makes sense to take a brief detour and look at CedarDB’s current compression suite for text columns, as this helps explain some design decisions and better contextualize the impact of FSST. Until the 22nd January 2026, CedarDB supported following compression schemes for strings:
- Uncompressed
- Single Value
- Dictionary
The first two of the schemes are special cases, and we choose them either when compression is not worth it, as all the strings are very short (Uncompressed), or if there is a single value in the domain (Single Value).
Since dictionary compression will play an important role later in this post, let’s first take a closer look at how CedarDB builds dictionaries and the optimizations they enable.
To illustrate string compression in CedarDB, let’s consider the following string data throughout the remainder of this blog post:
https://www.cit.tum.de/
https://cedardb.com/
https://www.wikipedia.org/
https://www.vldb.org/
https://cedardb.com/
…
Dictionary Compression
Dictionary compression is a well-known and widely applied technique. The basic idea is that you store all the unique input values within a dictionary, and then you compress the input by substituting the input values with smaller fixed-size integer keys that act as offsets into the dictionary. Building a CedarDB dictionary on our input data and compressing the data would look like this:
The attentive reader may have noticed two things. First, we store the offsets to the strings in our dictionary. This is necessary because we want to enable efficient random access to our dictionary. Efficient random access means that, given a key, we can directly jump to the correct string using the stored offset. Without storing the offset, we would need to read the dictionary from beginning to end until we find the desired string. Also, since strings have variable sizes, some kind of length information must be stored somewhere.
Second, our dictionary data is lexicographically ordered.
Performing insertions and deletions in such an ordered dictionary would be quite costly. CedarDB already treats compressed data as immutable3, sidestepping this issue.
An ordered dictionary provides interesting properties that will be useful when evaluating queries on the compressed representation.
For example, if str1 < str2, then key_str1 < key_str2 in the compressed representation. You’ll understand later why this comes in handy.
Additionally, we might be able to further compress the keys. Since the value range of the keys depends on the number of values in the dictionary, we might be able to represent the keys with one or two bytes instead of four bytes. This technique, called truncation, is part of our compression repertoire for integers and floats.
Evaluating Queries on Dictionary-Compressed Data
No matter the scheme, decompressing data is always more CPU-intensive than not decompressing data. Thus, we want to keep the data compressed for as long as possible. CedarDB evaluates filters directly on the compressed representation when possible.
Consider the following query on our input data:
SELECT * FROM hits WHERE url='https://cedardb.com/';
Our ordered dictionary comes in handy now. Instead of comparing against every value in the dictionary, we can perform a binary search on the dictionary values to find the key representing our search string “https://cedardb.com/". This is because the order of the keys matches the order of the uncompressed strings in the dictionary.
If we don’t find a match, we already know that none of the compressed data will equal our search string; therefore, we can stop here. If we find a match, then we know the compressed representation of the string, i.e., the index in the dictionary where the match was found. Note that we perform this procedure only once per compressed block, so the operation is amortized across 2¹⁸ tuples (the size of our compressed blocks).
Now that we have found the key, we can perform cheap integer comparisons on the compressed keys.
These small, fixed-size integer keys allow us to perform comparisons more efficiently since we can leverage modern processor features, such as SIMD (vectorized instructions),
that enable us to perform multiple comparisons with a single instruction. Note that using SIMD for string comparisons is also possible. However, the variable size of strings makes these comparisons more involved and thus less efficient.
The comparisons then produce the tuples (or rather, their index in our compressed block) that satisfy the predicate.
Then, we can decompress only the values needed for the query output or use the qualifying tuples for further processing.
For example, we can reduce the amount of work required by restricting other filters to only look at tuples 1 and 4, since the other tuples won’t make it into the result.
Why not stop here?
Hopefully, all of what I just described sounds interesting, and you also get an intuition for how this can accelerate query processing. If dictionaries were the ultimate solution for all problems without any drawbacks, thus the best compression scheme for strings, we would stop here. Unfortunately, dictionaries also have drawbacks. For one, they only perform well with data that contains few distinct values. After all, they force us to store every single distinct string in full. While real-world data is usually highly repetitive, we can’t rely on that: The number of possible distinct strings is boundless.
As you can see, and as the colors indicate, the strings share many common patterns. From an information theoretical perspective, the strings have fairly low entropy, meaning they are more predictable than completely random strings. Another compression scheme could exploit this predictability; and this is where FSST comes in!
FSST
FSST (Fast Static Symbol Table)4 operates similarly to a tokenizer in that it replaces frequently occurring substrings with short, fixed-size tokens. In FSST-Lingo, substrings are called “symbols” and can be up to eight bytes long. Tokens are called “codes” and are one byte long, meaning there can be a maximum of 256. These 1-byte codes are useful because they allow work on byte boundaries during compression and decompression. Since there can be only 256 different symbols, all the symbols easily fit into the first-level cache of modern processors (L1), allowing for very fast access (~1 ns). The table that stores the symbols is static, i.e. it won’t change after construction, and FSST’s design goals are to support fast compression and decompression; thus, the name “Fast Static Symbol Table.” Let’s look at it in a bit more detail.
FSST compression
FSST compresses in two phases. First, it creates a symbol table using a sample of the input data. Then, it uses the symbol table to tokenize the input. Working on a sample accelerates the compression process. Intuitively, sampling should work well because symbols that appear frequently in the input data will also appear frequently in a randomly selected sample. To illustrate the compression process, consider the following sample of our input data:
https://www.wikipedia.org/
https://cedardb.com/
…
During construction of the symbol table, the algorithm iteratively modifies an initially empty table.
First, it compresses the sample using the existing table and, while doing that, count the appearances of the table’s symbols and individual bytes in the sample.
To extend existing symbols, it also counts the occurrences of combinations of two symbols. Then, in a second step, it selects the 255 symbols that yield the best compression gain.
The compression gain of a symbol is the number of bytes that would be eliminated from the input by having this symbol in the table.
It is calculated as follows: gain(s1) = frequency(s1) * size(s1). This process is illustrated below using our sample.
If you’ve been paying close attention, you might have noticed that only 255 symbols are picked, even though a single byte can represent 256 values. The reason for this is that code 255 is reserved for the “escape code”. This code is necessary because the symbol table cannot (or rather, should not) hold all 256 individual byte values. Otherwise, FSST would not compress at all since one-byte symbols would be substituted by one-byte codes. Consequently, the input may contain a byte that cannot be represented by one of the codes. The “escape code” tells the decoder to interpret the next byte literally.
Now that the Symbol Table is constructed, it can be used to compress the input. This is done by scanning each string in the input and looking for the longest symbol at the current string offset. When such a symbol is found, its code is written to the output buffer and the input buffer is advanced by the length of the symbol. If there is no symbol for the current string offset byte, the escape code and the literal byte are written to the output buffer.
FSST decompression
Decompression is straightforward. For each code in the compressed input, the symbol table is consulted to find the corresponding symbol, which is then written to the output buffer. The output buffer is advanced by the length of the symbol.
Note that the first “c” is different from the second “c.” The first “c” is prepended by the escape code “ESC” and is therefore interpreted literally. The second “c,” however, is a real code that refers to the symbol at index “c” (99 in ASCII) in the symbol table.
For more information on FSST compression and decompression and its design decisions, see the paper4.
Integrating FSST into a Database System
Until now, we have discussed FSST compression and decompression, but not how to integrate it into a modern database that optimizes for very low query latencies, like CedarDB. Specifically, we have not discussed what compressed FSST data looks like when written to persistent storage so that it can be decompressed efficiently upon re-reading, so let’s do that now.
Obviously, you want to serialize the symbol table with your data, i.e., the symbols and their lengths. Without it, you won’t be able to decode the compressed strings and will lose all your data. To support efficient random access to individual strings in our compressed corpus, we also store the offsets to the compressed strings. To decompress the third string in our input, for example, we first look at the third value in the offset array. Since both the symbol table and the offsets have a fixed size, we always know where the offset is in our compressed corpus. The offset tells us the location of our string in the compressed strings. Finally, we use the symbol table to decompress the string.
We planned to use the above layout in the beginning. However, it has significant drawbacks when it comes to query processing, i.e., evaluating predicates on the data. To illustrate this, let’s evaluate our query on the above layout:
SELECT * FROM hits WHERE url='https://cedardb.com/';
One naive way to evaluate this query on the data would be to decompress each compressed string and then perform a string comparison. This process is illustrated below:
This is quite slow. Alternatively, one could be smarter and use the compressed representation of the search string as soon as the first match is found for further comparisons. Another option is to directly compress the search string and then compare the compressed strings. Note that this only works for equality comparisons, not for greater than or less than comparisons, as our symbols are not lexicographically sorted. Even if we sorted them, we’d ultimately end up with string comparisons because FSST-compressed strings are still strings, meaning they’re variable-size byte sequences. As already mentioned, this makes comparing them more difficult and slower than comparing small fixed-size integers, for example. Wait, does comparing small fixed-size integers ring a bell? This is exactly what dictionaries allowed us to do before! So is it possible to combine the advantages of a dictionary and FSST? Yes, it is!
Combining FSST with a Dictionary
The key idea is to create a dictionary from the input data and then use FSST to compress only the dictionary. This allows for efficient filtering of the compressed representation (i.e., the dictionary keys), as illustrated previously, while achieving better compression ratios than regular dictionary compression, as FSST allows us to leverage common patterns in the data for compression. Combining FSST and dictionary compression would look like this:
Evaluating predicates on this data works the same way as it does for dictionary compression. Decompression works the same way as it does for FSST. However, accessing the strings involves an additional level of indirection due to the dictionary keys.
Note that combining FSST with a dictionary is nothing new, DuckDB integrated DICT_FSST half a year ago5, which is a very similar approach.
Deciding when to choose FSST
We’re almost there, but not quite yet. The open question is still: How do we decide when to apply FSST to a text column? Before CedarDB supported FSST, it chose a scheme by compressing the input and selecting the one scheme that produced the smallest size (though there are some edge cases). However, applying FSST just to save one byte is not worth it since decompressing a FSST-compressed string is much more expensive than decompressing a dictionary-compressed string. Thus, we introduced a penalty, X, such that the FSST-compressed data must be X% smaller than the second-smallest scheme to be chosen.
We evaluated multiple penalty values by contrasting storage size and query runtime on some popular benchmarks (see the Benchmarks section). In the end, we chose a penalty of 40%. A more detailed discussion of the trade-off will follow in the next section.
Benchmarks
As CedarDB is rooted in research, let’s stop the talking and look at some data to empirically validate my claims by examining benchmark results. We’ll take a look at two popular analytical benchmarks, ClickBench and TPC-H. Why these two? TPC-H is an industry standard, but its data is artificially generated, while ClickBench is based on real-world data.
Storage Size
As one of the reasons for compressing data is reducing its storage size, let’s have a look at the impact of activating FSST in CedarDB on the data size:
Enabling FSST reduces the storage size of ClickBench by roughly 6GB, corresponding to a 20% reduction in total data size and a 35% reduction in string data size. For TPC-H, the effect is even more pronounced: total storage size is reduced by over 40%, while string data size shrinks by almost 60%. This is likely because the data is artificially generated and therefore contains patterns that are more easily captured by the symbol table.
Query Runtime
As previously mentioned, we also aim to compress data to improve query performance. To measure this, we ran all queries from ClickBench and TPC-H on CedarDB with FSST active, and compared them to runs without FSST, normalizing the results by the runtime without FSST. To provide a clearer picture, we differentiate between cold runs, where the data is stored on disk and the query is executed for the first time, and hot runs, where the data is already cached in memory.
Cold Runs
As shown, activating FSST has a positive effect on cold runs for both ClickBench and TPC-H. For ClickBench, query runtimes improve by up to 40% for queries that operate mainly on FSST-compressed data, such as queries 21, 22, 23, 24, and 28. On my machine, this 40% reduction corresponds to an absolute runtime decrease of over a second, which is a substantial impact. For TPC-H, the effect is less pronounced, likely because the queries are more complex, i.e. there is a lot of other stuff to be done, and thus loading data from disk is less of a bottleneck compared to the mostly simple queries in ClickBench. Nevertheless, we still observe a speedup of up to 10% for query 13. Moreover, not only is the impact on individual TPC-H queries smaller, but activating FSST also affects fewer queries. This is because fewer filters are applied to FSST-compressed string columns compared to ClickBench.
Hot Runs
Looking at the hots runs, the effect is quite the opposite.
For the ClickBench queries 21,22,23,24 and 28, which showed the largest runtime improvements in the cold runs, the runtime for hot runs is higher by up to 2.8x. Since the data is already cached in memory, loading the data from disk is no longer the
bottleneck; instead, decompressing the FSST-compressed strings becomes the limiting factor. This is because all these queries need to decompress most of the strings in the text columns to evaluate the LIKE (or length in the case of Q28) predicates. And since decompressing FSST is more expensive
than simple dictionary lookups, this results in slower execution. Note that this is really the worst-case scenario for compression in database systems: very simple queries that require decompressing nearly all the data to produce a result.
As they say, there’s no free lunch, you can’t beat physics.
Queries that can operate on FSST-compressed columns without full decompression, such as query 31, continue to benefit in the hot runs, in this case achieving a notable 25% speedup.
Once again, the effect is less pronounced for TPC-H. For query 13, the runtime is 2.5× higher because a LIKE predicate is applied on almost all values of an FSST-compressed column.
Note that, although the relative runtime difference is higher than in the cold runs, the absolute difference is an order of magnitude smaller. For example, for query 22, being 2.8x slower corresponds to only about 100ms.
This is because reading data from memory is much faster than reading it from disk.
As another idea for those wanting to integrate this into their system, one way to improve the performance of simple, decompression-heavy queries might be to cache decompressed strings in the buffer manager. This way, subsequent queries touching the same data can read the decompressed strings directly without repeated decompression. However, decompressed strings take up more space than compressed strings, so there is less space in the buffer manager for other things that could help answer queries efficiently, ultimately making it a trade-off again. I don’t have any numbers on that yet because I haven’t had the chance to implement this. Let me know if you do.
Conclusions
As you might have noticed, compressing data is always a trade-off between storage size and decompression speed. After careful consideration, we decided to activate FSST to reduce CedarDB’s storage footprint and improve loading times.
While some queries may not benefit, or even slightly suffer from FSST, working on better-compressed strings improves overall resource usage and query performance, making it a net win.
If you’ve read this far, I hope I’ve given you some insight into string compression schemes and FSST, as well as what it takes to integrate such a scheme into a database system, along with the potential implications.
If you have any questions about the topics covered in this post, feel free to reach out at contact@cedardb.com or contact me directly at nicolas@cedardb.com.
To see how well CedarDB compresses your own data and how this affects query performance, download the community version from our website and examine the compression schemes applied to your data using the system table cedardb_compression_infos.