March 20, 2026
CedarDB: Catching Up on Recent Releases
CedarDB: Catching Up on Recent Releases
This post takes a closer look at some of the most impactful features we have shipped in CedarDB across our recent releases. Whether you have been following along closely or are just catching up, here is a deeper look at the additions we are most excited about.
Parquet support: Your On-Ramp to CedarDB
v2026-03-03
Due to the significant compression facilitated by its columnar format, Parquet has become quite popular in the past several years. Today we find Parquet in projects like Spark, DuckDB, Apache Iceberg, ClickHouse, and others. Because of this, Parquet has become not only a data format for running analytical queries, but also a handy format for data exchange between OLAP engines. CedarDB is now able to read Parquet data and directly load it into its own native data format with a single SQL statement:
postgres=# select * from 'data.parquet' limit 6;
id | name | email | age | city | created_at
----+---------------+-------------------+-----+-------------+------------
1 | Alice Johnson | alice@example.com | 29 | New York | 2024-01-15
2 | Bob Smith | bob@example.com | 34 | Los Angeles | 2024-02-20
3 | Clara Davis | clara@example.com | 27 | Chicago | 2024-03-05
4 | David Lee | david@example.com | 41 | Houston | 2024-04-18
5 | Eva Martinez | eva@example.com | 23 | Phoenix | 2024-05-30
6 | Frank Wilson | frank@example.com | 38 | San Antonio | 2024-06-12
(6 rows)
CREATE TABLE my_table as SELECT * from 'data.parquet';
This makes migrating from other OLAP systems, or ingesting data from your data lake, straightforward. Check out our Parquet Documentation and our hands-on comparison of CedarDB vs. ClickHouse using StackOverflow data, where CedarDB was 1.5-11x faster than ClickHouse after a Parquet-based migration.
Better Compression, Less Storage: Floats and Text
v2026-01-22 and v2025-12-10
Storage efficiency has seen a major leap on two fronts. For floating-point columns (i.e., of type REAL and DOUBLE PRECISION ), CedarDB now applies Adaptive Lossless Floating-Point compression (ALP), a state-of-the-art technique that halves the on-disk footprint (on average) - perfect for workloads heavy on sensor readings or metrics.
For TEXT columns, we’ve adopted FSST (Fast Static Symbol Tables) combined with dictionary encoding: in practice this can halve the storage size of text-heavy tables while also making queries faster, since less data needs to be read from disk. Read our deep-dive blog post on the FSST implementation here.
without DictFSST:
postgres=# SELECT 100 * ROUND(SUM(compressedSize)::float / SUM(uncompressedSize), 4) || '%' as of_original FROM cedardb_compression_info WHERE tableName = 'hits' AND attributeName = 'title';
of_original
-------------
21.83%
(1 row)
with DictFSST:
postgres=# SELECT 100 * ROUND(SUM(compressedSize)::float / SUM(uncompressedSize), 4) || '%' as of_original FROM cedardb_compression_info WHERE tableName = 'hits' AND attributeName = 'title';
of_original
-------------
13.42%
(1 row)
Improve Parallelism of DDL and Large Writes
v2026-02-03
Schema changes and parallel bulk loads no longer bring your database to a halt. CedarDB now supports DDL operations like ALTER TABLE and CREATE INDEX on different tables in parallel. Similarly, large INSERT , UPDATE and DELETE operations now also run in parallel. None of these operations impact parallel readers anymore. For teams running mixed OLTP/OLAP workloads, this is a significant quality-of-life improvement.
PostgreSQL Advisory Locks
v2025-11-06
CedarDB now has full support for PostgreSQL advisory locks, including blocking variants that wait until a resource is freed and automatic deadlock detection. This fills an important compatibility gap for operational applications that use advisory locks for custom mutual exclusion either to implement custom functionality (think job schedulers or distributed task queues) or require them for correctness guarantees (think schema migration tools). Check out our advisory locks documentation for more info.
SELECT pg_advisory_lock(123);
-- ... do work ...
SELECT pg_advisory_unlock(123);
Late Materialization
v2026-01-15
Internally, CedarDB now delays fetching column data until it’s actually needed, a technique called late materialization. When a query filters out most rows early, this means only the surviving rows ever have their full column data fetched from storage. The result is faster queries on wide tables, with less wasted I/O. This improvement is transparent: no changes required to your queries or schema.
In the example below, a table scan is reduced from reading 24 GB on disk to just 75 MB.
Before:
postgres=# explain analyze SELECT * FROM hits WHERE url LIKE '%google%' ORDER BY eventtime LIMIT 10;
plan
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
🖩 OUTPUT () +
▲ SORT (In-Memory) (Result Materialized: 626 KB, Result Utilization: 83 %, Peak Materialized: 635 KB, Peak Utilization: 83 %, Card: 10, Estimate: 10, Time: 0 ms (0 %))+
🗐 TABLESCAN on hits (num IOs: 29'046, Fetched: 24 GB, Card: 853, Estimate: 67'688, Time: 10632 ms (100 % ***))
(1 row)
Time: 10651.826 ms (00:10.652)
After:
postgres=# explain analyze SELECT * FROM hits WHERE url LIKE '%google%' ORDER BY eventtime LIMIT 10;
plan
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
🖩 OUTPUT () +
l LATEMATERIALIZATION (Card: 10, Estimate: 10) +
├───▲ SORT (In-Memory) (Result Materialized: 29 KB, Result Utilization: 58 %, Peak Materialized: 34 KB, Peak Utilization: 60 %, Card: 10, Estimate: 10, Time: 10 ms (8 % *))+
│ 🗐 TABLESCAN on hits (num IOs: 20, Fetched: 75 MB, Card: 482, Estimate: 67'688, Time: 116 ms (92 % ***)) +
└───🗐 TABLESCAN on hits (Estimate: 10)
(1 row)
Time: 139.806 ms
New Types: Unsigned Integers, UUIDv7, and Enums
v2025-11-06, v2025-10-22, and v2026-01-22
Three type additions make CedarDB more expressive:
- Unsigned integers (
UINT1 - UINT8): Useful for importing data from Parquet, which also has native unsigned types. Also the right fit for any domain where negative values are nonsensical (counts, IDs, IP ports, flags). No more workaround with larger signed types! - UUIDv7: A newcomer from Postgres 18. Unlike UUIDv4, UUIDv7 embeds a timestamp and is monotonically increasing, making it index-friendly and suitable as a primary key without the index fragmentation problems of random UUIDs. Check out our UUID Documentation for more info.
- Enum types: Columns can now be declared with a fixed set of string values, saving storage and making constraints explicit at the type level. Check out our Enum Documentation for more info.
CREATE TYPE order_status AS ENUM ('pending', 'processing', 'shipped', 'delivered', 'cancelled');
CREATE TABLE orders (
id UUID PRIMARY KEY DEFAULT uuidv7(),
status order_status NOT NULL DEFAULT 'pending',
quantity UINT4 NOT NULL,
total_cents UINT4 NOT NULL
);
INSERT INTO orders (status, quantity, total_cents) VALUES ('pending', 2, 1999), ('processing', 1, 4999), ('shipped', 5, 2495);
postgres=# select * from orders;
id | status | quantity | total_cents
--------------------------------------+------------+----------+-------------
019d0701-cb64-7c96-b90c-ebd4a4410dee | pending | 2 | 1999
019d0701-cb65-7c49-8a1a-2e4ea96a85b1 | processing | 1 | 4999
019d0701-cb65-74ce-a987-b72905eb7f7a | shipped | 5 | 2495
(3 rows)
PostgreSQL Compatibility: Moving Fast
CedarDB’s PostgreSQL compatibility continues to expand rapidly. Recent months have brought new SQL grammar support, a growing roster of system functions and un-stubbed catalog tables, and bug-for-bug compatibility fixes that allow more existing PostgreSQL tooling to work out of the box.
If a tool or library failed to connect or misbehaved in a previous version, it’s worth trying again - the list of working clients and frameworks is growing steadily. If your favorite tool isn’t working yet, please let us know by creating an issue.
That’s it for now
Questions or feedback? Join us on Slack or reach out directly.
Do you want to try CedarDB straight away? Sign up for our free Enterprise Trial below. No credit card required.
March 19, 2026
Break Paxos
As I mentioned in my previous blog post, I recently got my hands on Claude Code. In the morning, I used it to build a Hybrid Logical Clocks (HLC) visualizer. That evening, I couldn't pull myself away and decided to model something more ambitious.
I prompted Claude Code to design a Paxos tutorial game, BeatPaxos, where the player tries to "beat" the Paxos algorithm by killing one node at a time and slowing nodes down. Spoiler alert: you cannot violate Paxos safety (agreement). The best you can do is delay the decision by inducing a series of well-timed failures, and that is how you increase your score.
Thinking of the game idea and semantics was my contribution, but the screen design and animations were mostly Claude’s. I do claim credit for the "red, green, blue" theme for the nodes and the colored messages; those worked well for visualization. I also specified that double-clicking a node kills it, and that the player cannot kill another node until they double-click to recover the down node. I also instructed that the player can click and hold a node to slow its replies. These two captures the standard distributed consensus fault model; no Byzantine behavior is allowed, as that is a different problem and algorithm. I include my full prompt at the end of the post.I was surprised Claude got this mostly right in one shot. Most importantly, it got the safety-critical part of the implementation right on the first try, which is no small feat. I used the Opus model this time because I wanted the extra firepower.
Here is the game, live, for you to try.
What needed fixing
There were timing issues again. The animation progressed too quickly for the player to follow, so I asked it to slow down. But more serious problems emerged, especially around the leader timeout.
The leader timeout felt wrong. I was seeing dueling leaders even in fault-free runs. After some back and forth, I found a deeper issue: even a single leader could timeout on itself, start a new ballot, and end up dueling itself. This was clearly a bug in the timeout logic. After I pointed it out, Claude fixed it, and things were good.
My red, blue, green idea was a good design choice, but Claude did not follow my intent. It colored any message sent by blue as blue, even when it was a response to a green leader. I reminded Claude that messages should be colored by the ballot leader’s color to clearly show how different ballots interleave, and this was sufficient to fix the issue.
Takeaway
This was a lot of fun, again. I feel like a kid in a candy store. It’s now trivial to build great learning tools with Claude Code. As I mentioned in my previous post, people need hands-on, interactive learning, not static figures. They need to play with algorithms, explore corner cases, and see how they behave.
A LinkedIn follower, after seeing BeatPaxos, prompted his Claude Code to translate it to Raft. The result is BeatRaft: same logic, but with Raft messages. Check it out if that’s more your vibe.
Here is the original prompt as I promised.
let's design a paxos tutorial game, maybe we name it BeatPaxos, because the player will try to make the Paxos algorithm violate safety, spoiler alert it is not possible to.
There will be three columns, to denote N=3 nodes, let's call them red, green, blue, and we can then use the consistent color theme in the messages they send. The current leader may have a crown image on top of its box.
The single synod Paxos protocol will be implemented: p1a, p1b, p2a, p2b, p3 messages. if there is nothing learned in p1, red would want to propose red, blue would propose blue, and green would propose green. But of course the implementation follows the Paxos protocol.
The player will be given only a set of actions. It can kill one node a time, by double clicking on it. if a node is down, then doubleclicking on another node does not work. but double clicking on the down node recovers it.
The red, green, blue nodes are rounded-corner rectangular boxes in respective color. When a node gets to p2, it gets a crown, denoting it thinks it is the leader. So there can be two nodes with crown, it is possible under Paxos rules. Only one would have the highest ballot number. We can display the ballot number of a leader in its box. When a node has to return to p1, it loses the crown.
for the visualization layout I am thinking of these three lanes, lined up below red, green, blue, it just lists the messages it sent. For red, this could be p1a (ballotnum) to blue, p1a (ballotnum) to green, p2a ("red", ballotnum) to blue, p2a ("red", ballotnum) to green and blue may have p1b (ballotnum, [val]) to red etc.
The player can also click-and-hold on a node, to make that node slow to send its next message. Otherwise, messages scheduled according to Paxos, goes 5sec interval from each other.
If the leader node is down, or the player delays it by click-holding it, then a random timeout may make another node propose its candidacy by sending a p1 message.
The right pane may give the player the rules, and add some explanation about the protocol. And display the safety invariant and its evaluation on the current state. If the player manages to violate the safety invariant, player wins!
You can program this app on javascript on single page again, so I can deploy on Github Pages.
Supabase joins the Stripe Projects developer preview
MariaDB innovation: binlog_storage_engine, 32-core server, Insert Benchmark
MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB).
My previous post had results for sysbench with a small server. This post has results for the Insert Benchmark with a large (32-core) server. Both servers use an SSD that has has high fsync latency. This is probably a best-case comparison for the feature. If you really care, then get enterprise SSDs with power loss protection. But you might encounter high fsync latency on public cloud servers.
While throughput improves with the InnoDB doublewrite buffer disabled, I am not suggesting people do that for production workloads without understanding the risks it creates.
tl;dr for a CPU-bound workload
- throughput for write-heavy steps is larger with the InnoDB doublewrite buffer disabled
- throughput for write-heavy steps is much larger with the binlog storage engine enabled
- throughput for write-heavy steps is largest with both the binlog storage engine enabled and the InnoDB doublewrite buffer disabled. In this case it was up to 8.9X larger.
- see the tl;dr above
- the best throughput comes from enabling the binlog storage engine and disabling the InnoDB doublewrite buffer and was 3.26X.
- z12b_sync
- my.cnf.cz12b_sync_c32r128 (z12b_sync) uses sync-on-commit for the binlog and InnoDB
- z12c_sync
- my.cnf.cz12c_sync_c32r128 (z12c_sync) is like z12b_sync and then enables the binlog storage engine
- z12b_sync_dw0
- my.cnf.cz12b_sync_dw0_c32r128 (z12b_sync_dw0) is like z12b_sync and then disables the InnoDB doublewrite buffer
- z12c_sync_dw0
- my.cnf.cz12c_sync_dw0_c32r128 (z12c_sync_dw0) is like z12c_sync and then disables the InnoDB doublewrite buffer
- CPU-bound - the database is cached by InnoDB, but there is still much write IO
- IO-bound - most, but not all, benchmark steps are IO-bound
- l.i0
- insert XM rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 10M for CPU-bound and 300M for IO-bound.
- l.x
- create 3 secondary indexes per table. There is one connection per client.
- l.i1
- use 2 connections/client. One inserts XM rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. X is 16M for CPU-bound and 4M for IO-bound.
- l.i2
- like l.i1 but each transaction modifies 5 rows (small transactions) and YM rows are inserted and deleted per table. Y is 4M for CPU-bound and 1M for IO-bound.
- Wait for S seconds after the step finishes to reduce MVCC GC debt and perf variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
- qr100
- use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload. This step runs for 1800 seconds.
- qp100
- like qr100 except uses point queries on the PK index
- qr500
- like qr100 but the insert and delete rates are increased from 100/s to 500/s
- qp500
- like qp100 but the insert and delete rates are increased from 100/s to 500/s
- qr1000
- like qr100 but the insert and delete rates are increased from 100/s to 1000/s
- qp1000
- like qp100 but the insert and delete rates are increased from 100/s to 1000/s
When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures:
- insert/s for l.i0, l.i1, l.i2
- indexed rows/s for l.x
- range queries/s for qr100, qr500, qr1000
- point queries/s for qp100, qp500, qp1000
- throughput for l.i0 (load in PK order) is 3.63X larger for z12c_sync
- throughput for l.i1 (write-only, larger transactions) is 2.80X larger for z12c_sync
- throughput for l.i2 (write-only, smaller transactions) is 8.13X larger for z12c_sync
- throughput for l.i0 (load in PK order) is the same for z12b_sync and z12b_sync_dw0
- throughput for l.i1 (write-only, larger transactions) is 1.14X larger for z12b_sync_dw0
- throughput for l.i2 (write-only, smaller transactions) is 1.93X larger for z12b_sync_dw0
- throughput for l.i0 (load in PK order) is 3.61X larger for z12c_sync_dw0
- throughput for l.i1 (write-only, larger transactions) is 3.03X larger for z12b_sync_dw0
- throughput for l.i2 (write-only, smaller transactions) is 8.90X larger for z12b_sync_dw0
| dbms | l.i0 | l.x | l.i1 | l.i2 | qr100 | qp100 | qr500 | qp500 | qr1000 | qp1000 |
|---|---|---|---|---|---|---|---|---|---|---|
| ma120301_rel_withdbg.cz12b_sync_c32r128 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| ma120301_rel_withdbg.cz12c_sync_c32r128 | 3.63 | 1.00 | 2.80 | 8.13 | 1.01 | 1.01 | 1.02 | 1.01 | 1.02 | 1.02 |
| ma120301_rel_withdbg.cz12b_sync_dw0_c32r128 | 1.00 | 1.00 | 1.14 | 1.93 | 1.01 | 0.99 | 1.01 | 1.00 | 1.01 | 0.99 |
| ma120301_rel_withdbg.cz12c_sync_dw0_c32r128 | 3.61 | 0.86 | 3.03 | 8.90 | 1.01 | 1.00 | 1.01 | 1.00 | 1.01 | 1.01 |
- throughput for l.i0 (load in PK order) is 3.05X larger for z12c_sync
- throughput for l.i1 (write-only, larger transactions) is 1.22X larger for z12c_sync
- throughput for l.i2 (write-only, smaller transactions) is 1.58X larger for z12c_sync
- throughput for l.i0 (load in PK order) is the same for z12b_sync and z12b_sync_dw0
- throughput for l.i1 (write-only, larger transactions) is 2.06X larger for z12b_sync_dw0
- throughput for l.i2 (write-only, smaller transactions) is 1.59X larger for z12b_sync_dw0
- throughput for l.i0 (load in PK order) is 3.01X larger for z12c_sync_dw0
- throughput for l.i1 (write-only, larger transactions) is 3.26X larger for z12b_sync_dw0
- throughput for l.i2 (write-only, smaller transactions) is 2.78X larger for z12b_sync_dw0
| dbms | l.i0 | l.x | l.i1 | l.i2 | qr100 | qp100 | qr500 | qp500 | qr1000 | qp1000 |
|---|---|---|---|---|---|---|---|---|---|---|
| ma120301_rel_withdbg.cz12b_sync_c32r128 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| ma120301_rel_withdbg.cz12c_sync_c32r128 | 3.05 | 0.96 | 1.22 | 1.58 | 1.04 | 1.71 | 0.98 | 1.30 | 0.99 | 1.23 |
| ma120301_rel_withdbg.cz12b_sync_dw0_c32r128 | 1.01 | 0.94 | 2.06 | 1.59 | 1.05 | 2.92 | 1.16 | 1.54 | 1.02 | 1.86 |
| ma120301_rel_withdbg.cz12c_sync_dw0_c32r128 | 3.01 | 1.03 | 3.26 | 2.78 | 1.08 | 3.76 | 1.43 | 2.87 | 1.02 | 2.64 |
Developer Spotlight: Somtochi Onyekwere from Fly.io
This is an external post of mine. Click here if you are not redirected.
March 18, 2026
MariaDB innovation: binlog_storage_engine, 48-core server, Insert Benchmark
MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB). See this blog post for more details on the new feature.
My previous post had results for sysbench with a small server. This post has results for the Insert Benchmark with a large (48-core) server. Storage on this server has a low fsync latency while the small server has high fsync latency.
tl;dr
- binlog storage engine makes some things better without making other things worse
- binlog storage engine doesn't make all write-heavy steps faster because the commit path isn't the bottleneck in all cases on a server with storage that has low fsync latency
tl;dr for a CPU-bound workload
- the l.i0 step (load in PK order) is ~1.3X faster with binlog storage engine
- the l.i2 step (write-only with smaller transactions) is ~1.5X faster with binlog storage engine
- the l.i0 step (load in PK order) is ~1.08X faster with binlog storage engine
- z12b_sync
- my.cnf.cz12b_sync_c32r128 (z12b_sync) is like z12b except it enables sync-on-commit for the binlog and InnoDB
- z12c_sync
- my.cnf.cz12c_sync_c32r128 (z12c_sync) is like cz12c except it enables sync-on-commit for InnoDB. Note that InnoDB is used to store the binlog so there is nothing else to sync on commit.
- z12b_sync_dw0
- my.cnf.cz12b_sync_dw0_c32r128 (z12b_sync_dw0) is like z12b_sync but disables the InnoDB doublewrite buffer
- z12c_sync_dw0
- my.cnf.cz12c_sync_dw0_c32r128 (z12c_sync_dw0) is like cz12c_sync but disables the InnoDB doublewrite buffer
- CPU-bound - the database is cached by InnoDB, but there is still much write IO
- IO-bound - most, but not all, benchmark steps are IO-bound
- l.i0
- insert XM rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 10M for CPU-bound and 200M for IO-bound.
- l.x
- create 3 secondary indexes per table. There is one connection per client.
- l.i1
- use 2 connections/client. One inserts XM rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. X is 40M for CPU-bound and 4M for IO-bound.
- l.i2
- like l.i1 but each transaction modifies 5 rows (small transactions) and YM rows are inserted and deleted per table. Y is 10M for CPU-bound and 1M for IO-bound.
- Wait for S seconds after the step finishes to reduce MVCC GC debt and perf variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
- qr100
- use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload. This step runs for 3600 seconds.
- qp100
- like qr100 except uses point queries on the PK index
- qr500
- like qr100 but the insert and delete rates are increased from 100/s to 500/s
- qp500
- like qp100 but the insert and delete rates are increased from 100/s to 500/s
- qr1000
- like qr100 but the insert and delete rates are increased from 100/s to 1000/s
- qp1000
- like qp100 but the insert and delete rates are increased from 100/s to 1000/s
When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures:
- insert/s for l.i0, l.i1, l.i2
- indexed rows/s for l.x
- range queries/s for qr100, qr500, qr1000
- point queries/s for qp100, qp500, qp1000
- Enabling the InnoDB doublewrite buffer doesn't improve performance.
- l.i0, load in PK order, gets ~1.3X more throughput
- when the binlog storage engine is enabled (see here)
- storage writes per insert (wpi) are reduced by about 1/2
- KB written to storage per insert (wkbpi) is a bit smaller
- context switches per insert (cspq) are reduced by about 1/3
- l.x, create secondary indexes, is unchanged
- when the binlog storage engine is enabled (see here)
- storage writes per insert (wpi) are reduced by about 4/5
- KB written to storage per insert (wkbpi) are reduced almost in half
- context switches per insert (cspq) are reduced by about 1/4
- l.i1, write-only with larger tranactions, is unchanged
- l.i2, write-only with smaller transactions, gets ~1.5X more throughput
| dbms | l.i0 | l.x | l.i1 | l.i2 | qr100 | qp100 | qr500 | qp500 | qr1000 | qp1000 |
|---|---|---|---|---|---|---|---|---|---|---|
| ma120301_rel_withdbg.cz12b_sync_c32r128 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| ma120301_rel_withdbg.cz12c_sync_c32r128 | 1.32 | 1.02 | 0.99 | 1.52 | 1.01 | 1.02 | 1.01 | 1.02 | 1.01 | 1.01 |
| ma120301_rel_withdbg.cz12b_sync_dw0_c32r128 | 1.00 | 0.94 | 1.00 | 1.03 | 1.03 | 1.02 | 1.03 | 1.02 | 1.03 | 1.02 |
| ma120301_rel_withdbg.cz12c_sync_dw0_c32r128 | 1.31 | 1.04 | 1.00 | 1.55 | 1.01 | 1.02 | 1.02 | 1.02 | 1.02 | 1.02 |
- For the read-write steps the insert SLA was not met for qr500, qp500, qr1000 and qp1000 as those steps needed more IOPs than the storage devices can provide.
- Enabling the InnoDB doublewrite buffer improves throughput by ~1.25X on the l.i2 step (write-only with smaller transactions) but doesn't change performance on the other steps.
- as expected there is a large reduction in KB written to storage (see wkbpi here)
- Enabling the binlog storage engine improves throughput by 9% and 8% on the l.i0 step (load in PK order) but doesn't have a significant impact on other steps.
- with the binlog storage engine there is a large reduction in storage writes per insert (wpi), a small reduction in KB written to storage per insert (wkbpi) and small increases in CPU per insert (cpupq) and contex switches per insert (cspq) -- see here
| dbms | l.i0 | l.x | l.i1 | l.i2 | qr100 | qp100 | qr500 | qp500 | qr1000 | qp1000 |
|---|---|---|---|---|---|---|---|---|---|---|
| ma120301_rel_withdbg.cz12b_sync_c32r128 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| ma120301_rel_withdbg.cz12c_sync_c32r128 | 1.09 | 1.01 | 0.99 | 1.01 | 0.99 | 0.99 | 0.97 | 0.97 | 1.00 | 0.98 |
| ma120301_rel_withdbg.cz12b_sync_dw0_c32r128 | 1.01 | 1.01 | 1.01 | 1.25 | 1.01 | 1.04 | 0.80 | 1.31 | 0.94 | 0.90 |
| ma120301_rel_withdbg.cz12c_sync_dw0_c32r128 | 1.08 | 1.01 | 1.00 | 1.26 | 0.99 | 1.04 | 0.68 | 1.31 | 0.93 | 0.90 |
Rate Limiting Strategies with Valkey/Redis
Automated parameter and option group change monitoring in Amazon RDS and Amazon Aurora
Claude Code experiment: Visualizing Hybrid Logical Clocks
Yesterday morning I downloaded Claude Code, and wanted to see what this bad boy can do. What better way to learn how this works than coding up a toy example with it. The first thing that occurred to me was to build a visualizer for Hybrid Logical Clocks (HLC).
HLC is a simple idea we proposed in 2014: combine physical time with a logical counter to get timestamps that are close to real time but still safe under clock skew. With HLC, you get the best of both worlds: real-time affinity augmented with causality when you need it. Since then HLC has been adopted by many distributed databases, including MongoDB, CockroachDB, Amazon Aurora DSQL, YugabyteDB, etc.
This felt well scoped for a first project with Claude Code. Choosing Javascript enabled me to host this on Github (Git Pages) for free. Easy peezy way of sharing something small yet useful with people.
Claude Code is a clever idea. It is essentially an agent wrapped around the Claude LLM. Chat works well for general Q/A, but it falls short for coding. In Claude Code, the agent wraps the LLM with a UNIX terminal abstraction, and voila, we are set up for coding. Simple and effective. It is the right interface. The terminal reduces everything to small modular, composable commands. Terminal commands were the original microservices! You compose these well-defined tools with pipes to build larger workflows. Everything is text, which plays to the strengths of LLMs. Add git for versioning, and you get a tight development loop.
The process went smoothly (very smoothly, in fact) despite this being my first time using Claude Code. I described Claude what I wanted to build. Since this was my first time, I gave a very brief description, I didn't expect it to jump to action based on this minimal description. (More on this at the end of the post.)
Claude created the hlc-visualizer directory, produced the first version of index.html, and opened it in Chrome to show me the demo. I was impressed by the speed. Wow. It almost got there on the first shot. The liberties it took with the layout were smart. It used a vertical time diagram and three nodes at the top. Both made it to the final version.
The initial visualization used buttons for send and local events. These sat on the right pane, away from the action, and were not easy to use. I prompted Claude to switch to double click on a node timeline for local events, and drag and drop between timelines for send events. This felt more natural, but I was not sure Claude could pull it off. It did, on the first try.
For the record, I used Sonnet. Since this was my first experiment with Claude, I did not want to use Opus, the more expensive model. Maybe Opus would have produced a better first version. I do not know.
Agents do not seem good with timing. The simulation ran too fast for human interaction. I kept tuning it to a tolerable speed. The screen also did not follow the timeline as new events extended beyond the view. See the commit log for how this evolved. I was not very efficient because this was my first time using Claude.
I think my biggest contribution to this collaboration was to notice that we need a snapshot feature. That's the killer app for HLC. So I explained Claude how snapshot should be taken, and after a couple of iterations, that worked. After that, I focused on improving the interaction and visuals.
Here is the end product. Try it here. Feedback is welcome in comments.
All in all, this was delightful. I wish I had this when I was teaching. It would help create visualizations for algorithms quickly. Students need hands-on interactive learning, not static figures. They need to play with the algorithms, explore corner cases, and see how the algorithms behave. I used TLA+ for teaching my distributed systems class, but visualizations like this are the real deal. I will do my usual plug for Spectacle, browser based TLA+ trace explorer and visualizer. But even with the manual animation mode, I think it will hard to code this time diagram visual and snapshots there.
A final note on personality. Claude has high energy. It is a go-getter. It skips small talk, like a seasoned developer. It does not ramble like ChatGPT. Gemini Pro comes across as sound, but it sounds too uptight and uncreative when writing prose. Claude Code feels smart and sharp when coding.
OSTEP Chapter 10: MultiProcessor Scheduling
This chapter from Operating Systems: Three Easy Pieces explores multiprocessor scheduling as we transition from the simpler world of single-CPU systems to the challenges of modern multicore architectures.
This is part of our series going through OSTEP book chapters. The OSTEP textbook is freely available at Remzi's website if you like to follow along.
Core Challenges in Multiprocessor Scheduling
The shift to multiple CPUs introduces several hardware challenges that the operating system must manage:
- Cache Coherence: Hardware caches improve performance by storing frequently used data. In multiprocessor systems, if one CPU modifies data in its local cache without updating main memory immediately, other CPUs may read "stale" (incorrect) data.
- Synchronization: Accessing shared data structures across multiple CPUs requires mutual exclusion (e.g., locks). Without these, concurrent operations can lead to data corruption, such as double frees in a linked list.
- Cache Affinity: Processes run faster when they stay on the same CPU because they can reuse state already built up in that CPU's local cache. Frequent migration across CPUs forces the system to reload this state, degrading performance.
Yep. Now we are talking about some distributed systems concepts, even inside a single computer.
Scheduling Strategies
The chapter compares two primary architectural approaches to scheduling:
- Single-Queue Multiprocessor Scheduling (SQMS): All jobs are placed into a single global queue. This is simple to implement and inherently balances the load across all available CPUs, but it does not scale well due to lock contention on the single queue and often ignores cache affinity.
- Multi-Queue Multiprocessor Scheduling (MQMS): The system maintains multiple queues, typically one per CPU. This is highly scalable and naturally preserves cache affinity since jobs stay in their assigned queue, but is vulnerable to load imbalance, where one CPU may become idle while another is overloaded.
To address load imbalances in MQMS, systems use "work stealing", where an under-utilized CPU peeks at another queue and steals jobs to balance the workload.
Modern Linux schedulers have utilized both approaches:
- O(1) Scheduler: Multi-queue.
- Completely Fair Scheduler (CFS): Multi-queue.
- BF Scheduler (BFS): Single-queue.
March 17, 2026
Measuring Agents in Production
When you are in TPOT echo chamber, you would think fully autonomous AI agents are running the world. But this 2025 December paper, "Measuring Agents in Production" (MAP), cuts through the reality behind the hype. It surveys 306 practitioners and conducts 20 in-depth case studies across 26 domains to document what is actually running in live environments. The reality is far more basic, constrained, and human-dependent than TPOT suggest.
The Most Surprising Findings
Simplicity and Bounded Autonomy: 80% of case studies use predefined structured workflows rather than open-ended autonomous planning, and 68% execute fewer than 10 steps before requiring human intervention. Frankly, these systems sound to me less "autonomous agent" than glorified state machine or multi-step RAG pipeline.
Prompting Beats Fine-Tuning: Despite the academic obsession with reinforcement learning and fine-tuning, 70% of teams building production agents simply prompt off-the-shelf proprietary models. Custom-tuned models are often too brittle, and they break when foundation model providers update their models.
Tolerance for Latency: While in database systems and distributed systems we focus on shaving milliseconds and microseconds off response times, in the agent world 66% of deployed systems take minutes or even longer to respond. I am not comparing or criticizing because of the intrinsically different nature of the work, I am just stating how vastly different the latency expectations are.
Custom Infrastructure Over Heavy Frameworks: Though many developers experiment with frameworks like LangChain, 85% of the detailed production case studies ended up building their systems completely in-house using direct API calls. Teams actively migrate away from heavy frameworks to reduce dependency bloat and maintain the flexibility to integrate with their own proprietary enterprise infrastructure.
Benchmarks are Abandoned: 75% of production teams skip formal benchmarking entirely. Because real-world tasks are incredibly messy and domain-specific, teams rely instead on A/B testing, production monitoring, and human-in-the-loop evaluation (which a massive 74% of systems use as their primary check for correctness).
Reliability (consistent correct behavior over time) remains the primary bottleneck and challenge. OK, this one was not really a surprising finding.
The paper says agents in production deliver tangible value: 80% of practitioners explicitly deploy them for productivity gains, and 72% use them to drastically reduce human task-hours. This would have been a great place to be concrete, and dive deeper into a couple of these cases, because there is a lot of incentive for companies to exaggerate their agent use. I think the closest I have seen paper go here in the appendix, Table 3.
Discussion
So the data says that the state of multi-agent systems in production is exaggerated. Everyone says they are doing it, but only a few actually are. And those who are doing it are keeping it basic.
This feels familiar.
Remember 2018? IBM published a whitepaper stating that "7 in 10 consumer industry executives expect to have a blockchain production network in 3 years". They famously claimed blockchains would cure almost every business ailment, reducing 9 distinct frictions including "inaccessible marketplaces", "restrictive regulations", "institutional inertia", "invisible threats", and "imperfect information". Ha, "invisible threats", it cracks me up every time!
Of course, Pepperidge Farm remembers the massive 2018 hype about Walmart tracking lettuce on the blockchain to pinpoint E. Coli contamination events. We were promised a decentralized revolution, but we only got shitcoins.
But, comparing AI agents to blockchains is unfair. Agents actually have a couple killer applications already. They have also made it into deployment despite in very basic and constrained manner. It's just that they aren't the fully autonomous hyper-intelligent multi-agent swarms that people claim they are. They remain basic, human-supervised, highly constrained tools.
March 16, 2026
CPU efficiency for MariaDB, MySQL and Postgres on TPROC-C with a small server
I started to use TPROC-C from HammerDB to test MariaDB, MySQL and Postgres and published results for MySQL and Postgres on small and large servers. This post provides more detail on CPU overheads for MariaDB, MySQL and Postgres on a small server.
tl;dr
- Postgres get the most throughput and the difference is large.
- MariaDB gets more throughput than MySQL
- Throughput improves for MariaDB and MySQL but not for Postgres when stored procedures are enabled. It is possible that the stored procedure support in MariaDB and MySQL is more CPU efficient than in Postgres.
- Postgres uses ~2X to ~4X more CPU for background tasks than InnoDB. The largest contributor is (auto)vacuum. But the total amount of CPU for background tasks is not significant relative to other CPU consumers.
For MySQL the config file is named my.cnf.cz12a_c8r32.
The benchmark was run for one workload, the working set is cached and there is only one user:
- vu=1, w=100 - 1 virtual user, 100 warehouses
- stored procedures are enabled
- partitioning is used for when the warehouse count is >= 1000
- a 5 minute rampup is used
- then performance is measured for 120 minutes
- Postgres sustains the most throughput with and without stored procedures
- MariaDB sustains more throughput than MySQL
- Stored procedures help MariaDB and MySQL, but do not improve Postgres throughput
* sp0 - stored procedures disabled
- The ratio of us to sy is almost 2X larger in Postgres than in MariaDB and MySQL with stored procedures disabled. But the ratios are similar with stored procedures enabled.
- The context switch rate is about 5X larger in MariaDB and MySQL vs Postgres with stored procedures disabled before normalizing by thoughput, with normalization the difference would be even larger. But the difference is smaller with stored procedures enabled.
- Postgres has better throughput because MariaDB and MySQL use more CPU per NOPM. The diference is larger with stored procedures disabled. Perhaps the stored prcoedure evaluator in MariaDB and MySQL is more efficient than in Postgres.
- the CPU distributions by area are mostly similar for MariaDB, MySQL and Postgres
- Postgres uses 2X to 4X more CPU for background work (vacuum)
- the CPU distributions by area are mostly similar for MariaDB and MySQL
- Postgres uses 2X to 4X more CPU for background work (vacuum)
What Is in pg_gather Version 33 ?
Modeling Token Buckets in PlusCal and TLA+
Retry storms are infamous in distributed systems. It is easy to run into them. Inevitably, a downstream service experiences a hiccup, so your clients automatically retry their failed requests. Those retries add more load to the struggling service, causing more failures, which trigger more retries. Before you know it, the tiny unavailability cascades into a full-blown self-inflicted denial of service.
Token Bucket is a popular technique that helps with gracefully avoiding retry storms. Here is how the token bucket algorithm works for retries:
- Sending is always free. When a client sends a brand-new request, it doesn't need a token. It just sends it.
- Successes give you credit. Every time a request succeeds, the client deposits a small fraction of a token into its bucket (up to a maximum capacity).
- Retries costs you credit. If a request fails, the client must spend a whole token (or a large fraction of one) to attempt a retry.
If the downstream service is healthy, the bucket stays full. But if the service goes down, the initial failures quickly drain the bucket. Once the bucket is empty, the client stops retrying. It gracefully degrades, protecting the downstream service until it recovers and new requests start succeeding again.
It is an elegant solution. But when implementing it in a client driver, there is a subtle concurrency trap that is easy to fall into. We model it below to show exactly where things go wrong.
The Sequential Processing Trap
When an engineer sits down to write a client driver, they usually think sequentially: Pull a request from a queue, send it, check the response, and retry if it fails. We can model this sequential control flow using PlusCal (the pseudoalgorithm DSL that compiles down to TLA+). In this first model below, we simulate a thread iterating over a queue of requests. If a request fails, the thread checks the token bucket. If the bucket is empty, the thread naturally awaits for tokens to become available.
(Aside: Before we look at the results, notice the placement of labels in the model. In PlusCal, labels are not just aesthetic markers or simple goto targets. They are structurally vital because they define the boundaries of atomic steps. Label to label is the unit of atomic execution, so we have to place them very deliberately. For example, making a request to a server and evaluating its response cannot be a single, instantaneous action. It violates the physical reality of the network boundary. By placing the ProcessResponse: label right before the if server_succeeds check, we correctly split this into two atomic actions. As a fun piece of TLA+ trivia: if we omitted that label, the model checker would evaluate the network call and the await statement simultaneously. To avoid getting stuck at an invalid await state, TLC would magically "time travel" and force the server to succeed every single time, completely hiding our bug! )
If we run this model, the TLC model checker immediately throws a Termination violation. Why?
We accidentally built a Head-of-Line (HOL) blocking bug. This is how it happens. A burst of failures drains the bucket. The thread hits the "await condition" and goes to sleep, waiting for tokens. But because the thread is blocked, it can't loop around to send new requests. Since only new requests can earn tokens, the bucket will never refill, and the client driver is permanently deadlocked.
The Fix: Fail-Fast
To fix this, we have to change how the client driver reacts to an empty bucket. Instead of waiting, it must fail-fast. So we remove the "await" and simply drop the request if we can't afford the retry. We unblock the queue as the thread moves on to the next request. TLC confirms that this model passes with flying colors. starvation, no deadlocks.
The Pure TLA+ Perspective
It is worth noting that if we had modeled this system using pure TLA+ from the start, we wouldn't have stumbled into this specific deadlock. Instead of sequential thread logic, Pure TLA+ models state machines and event-driven logic. In the below TLA+ model, Send, ServerRespond, Retry, and Drop are independent actions. If the bucket is empty, the Retry action simply becomes disabled. But because there is no while loop tying actions together, the Send action remains perfectly valid for any new incoming requests. The guarded-command TLA+ model below naturally avoids the head-of-line blocking problem.
To run this TLA+ model at Spectacle web tool for visualization, simply click this link. The Spectacle tool loads the TLA+ spec from GitHub, interprets it using JavaScript interpreter, and visualizes step-by-step state changes as you press a button corresponding to an enabled action. You can step backwards and forwards, and explore different executions. This makes model outputs accessible to engineers unfamiliar with TLA+. You can share a trace simply by sending a URL as I did above.
This brings us to an important lesson about formal verification and system design: the paradigm gap. Pure TLA+ is a beautiful event-driven way to describe the mathematically correct state of your system. However, the environments where these systems actually live (Java, Go, C++, or Rust) are fundamentally built around sequential threads, loops, and queues, just like our PlusCal model. The impedance mismatch between an event-driven specification and a sequential implementation introduces the risk of HOL blocking. Because modern programming languages make it so effortless to pause a thread and wait for a resource, it is incredibly easy for a system to fall into the blocking trap. We should be cognizant of this pitfall when implementing our designs.
March 15, 2026
The Serial Safety Net: Efficient Concurrency Control on Modern Hardware
This paper proposes a way to get serializability without completely destroying your system's performance. I quite like the paper, as it flips the script on how we think about database isolation levels.
The Idea
In modern hardware setups (where we have massive multi-core processors, huge main memory, and I/O is no longer the main bottleneck), strict concurrency control schemes like Two-Phase Locking (2PL) choke the system due to contention on centralized structures. To keep things fast, most systems default to weaker schemes like Snapshot Isolation (SI) or Read Committed (RC) at the cost of allowing dependency cycles and data anomalies. Specifically, RC leaves your application vulnerable to unrepeatable reads as data shifts mid-flight, while SI famously opens the door to write skew, where two concurrent transactions update different halves of the same logical constraint.
Can we have our cake and eat it too? The paper introduces the Serial Safety Net (SSN), as a certifier that sits entirely on top of fast weak schemes like RC or SI, tracking the dependency graph and blessing a transaction only if it is serializable with respect to others.
Figure 1 shows the core value proposition of SSN. By layering SSN onto high-concurrency but weak schemes like RC or SI, the system eliminates all dependency cycles to achieve serializability without the performance hits seen in 2PL or Serializable Snapshot Isolation (SSI).SSN implementation
When a transaction T tries to commit, SSN calculates a low watermark $\pi(T)$ (the oldest transaction in the future that depends on T) and a high watermark $\eta(T)$ (the newest transaction in the past that T depends on). If $\pi(T) \le \eta(T)$, it means the past has collided with the future, and a dependency cycle has closed. SSN aborts the transaction.
Because SSN throws out any transaction that forms a cycle, the final committed history is mathematically guaranteed to be cycle-free, and hence Serializable (SER).
Figure 2 illustrates how SSN detects serialization cycles using a serial-temporal graph. The x-axis represents the dependency order, while the y-axis tracks the global commit order. Forward dependency edges point upward, and backward edges (representing read anti-dependencies) point downward. Subfigures (a) and (b) illustrate a transaction cycle closing and the local exclusion window violation that triggers an abort: transaction T2 detects that its predecessor T1 committed after T2's oldest successor, $\pi(T2)$. This overlap proves T1 could also act as a successor, forming a forbidden loop.Subfigures (c) and (d) demonstrate SSN's safe conditions and its conservative trade-offs. In (c), the exclusion window is satisfied because the predecessor T3 committed before the low watermark $\pi(Tx)$, making it impossible for T3 to loop back as a successor. Subfigure (d), however, shows a false positive where transaction T3 is aborted because its exclusion window is violated, even though no actual cycle exists yet. This strictness is necessary, though: allowing T3 to commit would be dangerous, as a future transaction could silently close the cycle later without triggering any further warnings. Since SSN summarizes complex graphs into just two numbers ($\pi$ and $\eta$), it will sometimes abort a transaction simply because the exclusion window was violated, even if a true cycle hasn't formed yet.
SSN vs. Pure OCC
Now, you might be asking: Wait, this sounds a lot like Optimistic Concurrency Control (OCC), so why not just use standard OCC for Serializability?
Yes, SSN is a form of optimistic certification, but the mechanisms are different, and the evaluation section of the paper exposes exactly why SSN is a superior architecture for high-contention workloads.
Standard OCC does validation by checking exact read/write set intersections. If someone overwrote your data, you abort. The problem is the OCC Retry Bloodbath! When standard OCC aborts a transaction, retrying it often throws it right back into the exact same conflict because the overwriting transaction might still be active. In the paper's evaluation, when transaction retries were enabled, the standard OCC prototype collapsed badly, wasting over 60% of its CPU cycles just fighting over index insertions.
SSN, however, possesses the "Safe Retry" property. If SSN aborts your transaction T because a predecessor U violated the exclusion window, U must have already committed. When you immediately retry, the conflict is physically in the past; your new transaction simply reads $U$'s freshly committed data, bypassing the conflict entirely. SSN's throughput stays stable under pressure while OCC falls over.
Discussion
So what do we have here? SSN offers a nice way to get to SER, while keeping decent concurrency. It proves that with a little bit of clever timestamp math, you can turn a dirty high-speed concurrency scheme into a serializable one.
Of course, no system is perfect. If you are going to deploy SSN, you have to pay the piper. Here are some critical trade-offs.
To track these dependencies, SSN requires you to store extra timestamps on every single version of a tuple in your database. In a massive in-memory system, this metadata bloat is a significant cost compared to leaner OCC implementations.
SSN is also not a standalone silver bullet for full serializability. While it is great at tracking row-level dependencies on existing records, it does not natively track phantoms (range-query insertions). Because an acyclic dependency graph only guarantees serializability in the absence of phantoms , you cannot just drop SSN onto vanilla RC or SI; you must actively extend the underlying CC scheme with separate mechanisms like index versioning or key-range locking to prevent them.
To bring closure on the SSN approach, let's address one final architectural puzzle. If you've been following the logic so far, you might have noticed a glaring question. The paper demonstrates that layering SSN on top of Read Committed guarantees serializability (RC + SSN = SER). It also shows that doing the exact same thing with Snapshot Isolation gets you to the exact same destination (SI + SSN = SER). If both combinations mathematically yield a serializable database, why would we ever willingly pay the higher performance overhead of Snapshot Isolation? Why would we want SI+SSN when we have RC+SSN at home?
While layering SSN on top of Read Committed (RC) guarantees a serializable outcome, it exposes your application to in-flight problems. Under RC, reads simply return the newest committed version of a record and never block. This means the underlying data can change right under your application's feet while the transaction is running. Your code might read Account A, and milliseconds later read Account B after a concurrent transfer committed, seeing a logically impossible total, an inconsistent snapshot. Even though SSN will ultimately catch this dependency cycle and safely abort the transaction during the pre-commit phase, your application logic might crash before it ever reaches that protective exit door. Furthermore, even if your code survives the run, this late abort mechanism hides a big performance penalty: your system might burn a lot of CPU and memory executing a complex doomed transaction, only for SSN to throw all that wasted work at the final commit check.
This is why we gladly pay the extra concurrency control overhead for SI. Under SI, each transaction reads from a perfectly consistent snapshot of the database taken at its start time. From your application's perspective, time stops, completely shielding your code from ever seeing those transiently broken states mid-flight. However, as we mentioned in the beginning, SI still allows write skews, and pairing it with SSN covers for that to guarantee serializability.
If you like to dive into this more, the authors later published a 20 page journal version here. I also found a recent follow up by Japanese researchers here.