database.news | database technology news aggregator

March 19, 2026

Small Datum - Mark Callaghan

MariaDB innovation: binlog_storage_engine, 32-core server, Insert Benchmark

MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB).

My previous post had results for sysbench with a small server. This post has results for the Insert Benchmark with a large (32-core) server. Both servers use an SSD that has has high fsync latency. This is probably a best-case comparison for the feature. If you really care, then get enterprise SSDs with power loss protection. But you might encounter high fsync latency on public cloud servers.

While throughput improves with the InnoDB doublewrite buffer disabled, I am not suggesting people do that for production workloads without understanding the risks it creates.

tl;dr for a CPU-bound workload

throughput for write-heavy steps is larger with the InnoDB doublewrite buffer disabled
throughput for write-heavy steps is much larger with the binlog storage engine enabled
throughput for write-heavy steps is largest with both the binlog storage engine enabled and the InnoDB doublewrite buffer disabled. In this case it was up to 8.9X larger.

tl;dr for an IO-bound workload

see the tl;dr above
the best throughput comes from enabling the binlog storage engine and disabling the InnoDB doublewrite buffer and was 3.26X.

Builds, configuration and hardware

I compiled MariaDB 12.3.1 from source.

The server has 32-cores and 128G of RAM. Storage is 1 NVMe device with ext-4 and discard enabled. The OS is Ubuntu 24.04. AMD SMT is disabled. The SSD has high fsync latency.

I tried 4 my.cnf files:

z12b_sync

my.cnf.cz12b_sync_c32r128 (z12b_sync) uses sync-on-commit for the binlog and InnoDB

z12c_sync

my.cnf.cz12c_sync_c32r128 (z12c_sync) is like z12b_sync and then enables the binlog storage engine

z12b_sync_dw0

my.cnf.cz12b_sync_dw0_c32r128 (z12b_sync_dw0) is like z12b_sync and then disables the InnoDB doublewrite buffer

z12c_sync_dw0

my.cnf.cz12c_sync_dw0_c32r128 (z12c_sync_dw0) is like z12c_sync and then disables the InnoDB doublewrite buffer

The Benchmark

The benchmark is explained here. It was run with 12 clients for two workloads:

CPU-bound - the database is cached by InnoDB, but there is still much write IO
IO-bound - most, but not all, benchmark steps are IO-bound

The benchmark steps are:

l.i0

insert XM rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 10M for CPU-bound and 300M for IO-bound.

l.x

create 3 secondary indexes per table. There is one connection per client.

l.i1

use 2 connections/client. One inserts XM rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. X is 16M for CPU-bound and 4M for IO-bound.

l.i2

like l.i1 but each transaction modifies 5 rows (small transactions) and YM rows are inserted and deleted per table. Y is 4M for CPU-bound and 1M for IO-bound.
Wait for S seconds after the step finishes to reduce MVCC GC debt and perf variance during the read-write benchmark steps that follow. The value of S is a function of the table size.

qr100

use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload. This step runs for 1800 seconds.

qp100

like qr100 except uses point queries on the PK index

qr500

like qr100 but the insert and delete rates are increased from 100/s to 500/s

qp500

like qp100 but the insert and delete rates are increased from 100/s to 500/s

qr1000

like qr100 but the insert and delete rates are increased from 100/s to 1000/s

qp1000

like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Results: summary

The performance reports are here for CPU-bound and IO-bound.

The summary sections from the performance reports have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA. And from the third table for the IO-bound workload I see that there were failures to meet the SLA for qp500, qr500, qp1000 and qr1000.

I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from the base version.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures:

insert/s for l.i0, l.i1, l.i2
indexed rows/s for l.x
range queries/s for qr100, qr500, qr1000
point queries/s for qp100, qp500, qp1000

Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements.

I often use context switch rates as a proxy for mutex contention.

Results: CPU-bound

The summary is here.

Some of the improvements here are huge courtesy of storage with high fsync latency.

Throughput is much better with the binlog storage engine enabled when the InnoDB doublewrite buffer is also enabled. Comparing z12b_sync and z12c_sync (z12c_sync uses the binlog storage engine):

throughput for l.i0 (load in PK order) is 3.63X larger for z12c_sync
throughput for l.i1 (write-only, larger transactions) is 2.80X larger for z12c_sync
throughput for l.i2 (write-only, smaller transactions) is 8.13X larger for z12c_sync

There is a smaller benefit from only disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12b_sync_dw0:

throughput for l.i0 (load in PK order) is the same for z12b_sync and z12b_sync_dw0
throughput for l.i1 (write-only, larger transactions) is 1.14X larger for z12b_sync_dw0
throughput for l.i2 (write-only, smaller transactions) is 1.93X larger for z12b_sync_dw0

The largest benefits come from using the binlog storage engine and disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12c_sync_dw0:

throughput for l.i0 (load in PK order) is 3.61X larger for z12c_sync_dw0
throughput for l.i1 (write-only, larger transactions) is 3.03X larger for z12b_sync_dw0
throughput for l.i2 (write-only, smaller transactions) is 8.90X larger for z12b_sync_dw0

The second table from the summary section has been inlined below. That table shows relative throughput which is: (QPS for my config / QPS for z12b_sync).

dbms	l.i0	l.x	l.i1	l.i2	qr100	qp100	qr500	qp500	qr1000	qp1000
ma120301_rel_withdbg.cz12b_sync_c32r128	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
ma120301_rel_withdbg.cz12c_sync_c32r128	3.63	1.00	2.80	8.13	1.01	1.01	1.02	1.01	1.02	1.02
ma120301_rel_withdbg.cz12b_sync_dw0_c32r128	1.00	1.00	1.14	1.93	1.01	0.99	1.01	1.00	1.01	0.99
ma120301_rel_withdbg.cz12c_sync_dw0_c32r128	3.61	0.86	3.03	8.90	1.01	1.00	1.01	1.00	1.01	1.01

Results: IO-bound

The summary is here.

For the read-write steps the insert SLA was not met for qr500, qp500, qr1000 and qp1000 as those steps needed more IOPs than the storage devices can provide. So I ignore those steps.

Some of the improvements here are huge courtesy of storage with high fsync latency.

Throughput is much better with the binlog storage engine enabled when the InnoDB doublewrite buffer is also enabled. Comparing z12b_sync and z12c_sync (z12c_sync uses the binlog storage engine):

throughput for l.i0 (load in PK order) is 3.05X larger for z12c_sync
throughput for l.i1 (write-only, larger transactions) is 1.22X larger for z12c_sync
throughput for l.i2 (write-only, smaller transactions) is 1.58X larger for z12c_sync

There is a smaller benefit from only disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12b_sync_dw0:

throughput for l.i0 (load in PK order) is the same for z12b_sync and z12b_sync_dw0
throughput for l.i1 (write-only, larger transactions) is 2.06X larger for z12b_sync_dw0
throughput for l.i2 (write-only, smaller transactions) is 1.59X larger for z12b_sync_dw0

The largest benefits come from using the binlog storage engine and disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12c_sync_dw0:

throughput for l.i0 (load in PK order) is 3.01X larger for z12c_sync_dw0
throughput for l.i1 (write-only, larger transactions) is 3.26X larger for z12b_sync_dw0
throughput for l.i2 (write-only, smaller transactions) is 2.78X larger for z12b_sync_dw0

The second table from the summary section has been inlined below. That table shows relative throughput which is: (QPS for my config / QPS for z12b_sync).

dbms	l.i0	l.x	l.i1	l.i2	qr100	qp100	qr500	qp500	qr1000	qp1000
ma120301_rel_withdbg.cz12b_sync_c32r128	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
ma120301_rel_withdbg.cz12c_sync_c32r128	3.05	0.96	1.22	1.58	1.04	1.71	0.98	1.30	0.99	1.23
ma120301_rel_withdbg.cz12b_sync_dw0_c32r128	1.01	0.94	2.06	1.59	1.05	2.92	1.16	1.54	1.02	1.86
ma120301_rel_withdbg.cz12c_sync_dw0_c32r128	3.01	1.03	3.26	2.78	1.08	3.76	1.43	2.87	1.02	2.64

by Mark Callaghan (noreply@blogger.com)

Phil Eaton

Developer Spotlight: Somtochi Onyekwere from Fly.io

This is an external post of mine. Click here if you are not redirected.

March 18, 2026

Small Datum - Mark Callaghan

MariaDB innovation: binlog_storage_engine, 48-core server, Insert Benchmark

MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB). See this blog post for more details on the new feature.

My previous post had results for sysbench with a small server. This post has results for the Insert Benchmark with a large (48-core) server. Storage on this server has a low fsync latency while the small server has high fsync latency.

tl;dr

binlog storage engine makes some things better without making other things worse
binlog storage engine doesn't make all write-heavy steps faster because the commit path isn't the bottleneck in all cases on a server with storage that has low fsync latency

tl;dr for a CPU-bound workload

the l.i0 step (load in PK order) is ~1.3X faster with binlog storage engine
the l.i2 step (write-only with smaller transactions) is ~1.5X faster with binlog storage engine

tl;dr for an IO-bound workload

the l.i0 step (load in PK order) is ~1.08X faster with binlog storage engine

Builds, configuration and hardware

I compiled MariaDB 12.3.1 from source.

The server has 48-cores and 128G of RAM. Storage is 2 NVMe device with ext-4, discard enabled and RAID. The OS is Ubuntu 22.04. AMD SMT is disabled. The SSD has low fsync latency.

I tried 4 my.cnf files:

z12b_sync

my.cnf.cz12b_sync_c32r128 (z12b_sync) is like z12b except it enables sync-on-commit for the binlog and InnoDB

z12c_sync

my.cnf.cz12c_sync_c32r128 (z12c_sync) is like cz12c except it enables sync-on-commit for InnoDB. Note that InnoDB is used to store the binlog so there is nothing else to sync on commit.

z12b_sync_dw0

my.cnf.cz12b_sync_dw0_c32r128 (z12b_sync_dw0) is like z12b_sync but disables the InnoDB doublewrite buffer

z12c_sync_dw0

my.cnf.cz12c_sync_dw0_c32r128 (z12c_sync_dw0) is like cz12c_sync but disables the InnoDB doublewrite buffer

The Benchmark

The benchmark is explained here. It was run with 20 clients for two workloads:

CPU-bound - the database is cached by InnoDB, but there is still much write IO
IO-bound - most, but not all, benchmark steps are IO-bound

The benchmark steps are:

l.i0

insert XM rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 10M for CPU-bound and 200M for IO-bound.

l.x

create 3 secondary indexes per table. There is one connection per client.

l.i1

use 2 connections/client. One inserts XM rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. X is 40M for CPU-bound and 4M for IO-bound.

l.i2

like l.i1 but each transaction modifies 5 rows (small transactions) and YM rows are inserted and deleted per table. Y is 10M for CPU-bound and 1M for IO-bound.
Wait for S seconds after the step finishes to reduce MVCC GC debt and perf variance during the read-write benchmark steps that follow. The value of S is a function of the table size.

qr100

use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload. This step runs for 3600 seconds.

qp100

like qr100 except uses point queries on the PK index

qr500

like qr100 but the insert and delete rates are increased from 100/s to 500/s

qp500

like qp100 but the insert and delete rates are increased from 100/s to 500/s

qr1000

like qr100 but the insert and delete rates are increased from 100/s to 1000/s

qp1000

like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Results: summary

The performance reports are here for CPU-bound and IO-bound.

The summary sections from the performance reports have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA. And from the third table for the IO-bound workload I see that there were failures to meet the SLA for qp500, qr500, qp1000 and qr1000.

I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from the base version.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures:

insert/s for l.i0, l.i1, l.i2
indexed rows/s for l.x
range queries/s for qr100, qr500, qr1000
point queries/s for qp100, qp500, qp1000

Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements.

I often use context switch rates as a proxy for mutex contention.

Results: CPU-bound

The summary is here.

Enabling the InnoDB doublewrite buffer doesn't improve performance.

With and without the InnoDB doublewrite buffer enabled, enabling the binlog storage engine improves throughput a lot for two of the write-heavy steps while there are only small changes on the other two write-heavy steps:

l.i0, load in PK order, gets ~1.3X more throughput

when the binlog storage engine is enabled (see here)

storage writes per insert (wpi) are reduced by about 1/2
KB written to storage per insert (wkbpi) is a bit smaller
context switches per insert (cspq) are reduced by about 1/3

l.x, create secondary indexes, is unchanged

when the binlog storage engine is enabled (see here)

storage writes per insert (wpi) are reduced by about 4/5
KB written to storage per insert (wkbpi) are reduced almost in half
context switches per insert (cspq) are reduced by about 1/4

l.i1, write-only with larger tranactions, is unchanged
l.i2, write-only with smaller transactions, gets ~1.5X more throughput

The second table from the summary section has been inlined below. That table shows relative throughput which is: (QPS for my config / QPS for z12b_sync)

dbms	l.i0	l.x	l.i1	l.i2	qr100	qp100	qr500	qp500	qr1000	qp1000
ma120301_rel_withdbg.cz12b_sync_c32r128	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
ma120301_rel_withdbg.cz12c_sync_c32r128	1.32	1.02	0.99	1.52	1.01	1.02	1.01	1.02	1.01	1.01
ma120301_rel_withdbg.cz12b_sync_dw0_c32r128	1.00	0.94	1.00	1.03	1.03	1.02	1.03	1.02	1.03	1.02
ma120301_rel_withdbg.cz12c_sync_dw0_c32r128	1.31	1.04	1.00	1.55	1.01	1.02	1.02	1.02	1.02	1.02

Results: IO-bound

The summary is here.

For the read-write steps the insert SLA was not met for qr500, qp500, qr1000 and qp1000 as those steps needed more IOPs than the storage devices can provide.
Enabling the InnoDB doublewrite buffer improves throughput by ~1.25X on the l.i2 step (write-only with smaller transactions) but doesn't change performance on the other steps.

as expected there is a large reduction in KB written to storage (see wkbpi here)

Enabling the binlog storage engine improves throughput by 9% and 8% on the l.i0 step (load in PK order) but doesn't have a significant impact on other steps.

with the binlog storage engine there is a large reduction in storage writes per insert (wpi), a small reduction in KB written to storage per insert (wkbpi) and small increases in CPU per insert (cpupq) and contex switches per insert (cspq) -- see here

The second table from the summary section has been inlined below. That table shows relative throughput which is: (QPS for my config / QPS for z12b_sync)

dbms	l.i0	l.x	l.i1	l.i2	qr100	qp100	qr500	qp500	qr1000	qp1000
ma120301_rel_withdbg.cz12b_sync_c32r128	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
ma120301_rel_withdbg.cz12c_sync_c32r128	1.09	1.01	0.99	1.01	0.99	0.99	0.97	0.97	1.00	0.98
ma120301_rel_withdbg.cz12b_sync_dw0_c32r128	1.01	1.01	1.01	1.25	1.01	1.04	0.80	1.31	0.94	0.90
ma120301_rel_withdbg.cz12c_sync_dw0_c32r128	1.08	1.01	1.00	1.26	0.99	1.04	0.68	1.31	0.93	0.90

by Mark Callaghan (noreply@blogger.com)

Percona Database Performance Blog

Rate Limiting Strategies with Valkey/Redis

Rate limiting is one of those topics that looks simple until you’re actually doing it in production. Implement a counter with the INCR command and a TTL and away you go. But when you ask questions like “what happens at the boundary?”, “should I use a Valkey/Redis cluster?”, or “why are we getting twice the […]

by Martin Visser

AWS Database Blog - Amazon Aurora

Automated parameter and option group change monitoring in Amazon RDS and Amazon Aurora

In this post, you will learn how to build a serverless monitoring solution sending detailed alerts whenever Amazon RDS parameter groups are modified, including which databases are affected and whether a restart is required.

by Uttirna Datta

Murat Demirbas

Claude Code experiment: Visualizing Hybrid Logical Clocks

Yesterday morning I downloaded Claude Code, and wanted to see what this bad boy can do. What better way to learn how this works than coding up a toy example with it. The first thing that occurred to me was to build a visualizer for Hybrid Logical Clocks (HLC).

HLC is a simple idea we proposed in 2014: combine physical time with a logical counter to get timestamps that are close to real time but still safe under clock skew. With HLC, you get the best of both worlds: real-time affinity augmented with causality when you need it. Since then HLC has been adopted by many distributed databases, including MongoDB, CockroachDB, Amazon Aurora DSQL, YugabyteDB, etc.

This felt well scoped for a first project with Claude Code. Choosing Javascript enabled me to host this on Github (Git Pages) for free. Easy peezy way of sharing something small yet useful with people.

Claude Code is a clever idea. It is essentially an agent wrapped around the Claude LLM. Chat works well for general Q/A, but it falls short for coding. In Claude Code, the agent wraps the LLM with a UNIX terminal abstraction, and voila, we are set up for coding. Simple and effective. It is the right interface. The terminal reduces everything to small modular, composable commands. Terminal commands were the original microservices! You compose these well-defined tools with pipes to build larger workflows. Everything is text, which plays to the strengths of LLMs. Add git for versioning, and you get a tight development loop.

The process went smoothly (very smoothly, in fact) despite this being my first time using Claude Code. I described Claude what I wanted to build. Since this was my first time, I gave a very brief description, I didn't expect it to jump to action based on this minimal description. (More on this at the end of the post.)

Claude created the hlc-visualizer directory, produced the first version of index.html, and opened it in Chrome to show me the demo. I was impressed by the speed. Wow. It almost got there on the first shot. The liberties it took with the layout were smart. It used a vertical time diagram and three nodes at the top. Both made it to the final version.

The initial visualization used buttons for send and local events. These sat on the right pane, away from the action, and were not easy to use. I prompted Claude to switch to double click on a node timeline for local events, and drag and drop between timelines for send events. This felt more natural, but I was not sure Claude could pull it off. It did, on the first try.

For the record, I used Sonnet. Since this was my first experiment with Claude, I did not want to use Opus, the more expensive model. Maybe Opus would have produced a better first version. I do not know.

Agents do not seem good with timing. The simulation ran too fast for human interaction. I kept tuning it to a tolerable speed. The screen also did not follow the timeline as new events extended beyond the view. See the commit log for how this evolved. I was not very efficient because this was my first time using Claude.

I think my biggest contribution to this collaboration was to notice that we need a snapshot feature. That's the killer app for HLC. So I explained Claude how snapshot should be taken, and after a couple of iterations, that worked. After that, I focused on improving the interaction and visuals.

Here is the end product. Try it here. Feedback is welcome in comments.

All in all, this was delightful. I wish I had this when I was teaching. It would help create visualizations for algorithms quickly. Students need hands-on interactive learning, not static figures. They need to play with the algorithms, explore corner cases, and see how the algorithms behave. I used TLA+ for teaching my distributed systems class, but visualizations like this are the real deal. I will do my usual plug for Spectacle, browser based TLA+ trace explorer and visualizer. But even with the manual animation mode, I think it will hard to code this time diagram visual and snapshots there.

A final note on personality. Claude has high energy. It is a go-getter. It skips small talk, like a seasoned developer. It does not ramble like ChatGPT. Gemini Pro comes across as sound, but it sounds too uptight and uncreative when writing prose. Claude Code feels smart and sharp when coding.

by Murat (noreply@blogger.com)

Murat Demirbas

OSTEP Chapter 10: MultiProcessor Scheduling

This chapter from Operating Systems: Three Easy Pieces explores multiprocessor scheduling as we transition from the simpler world of single-CPU systems to the challenges of modern multicore architectures.

This is part of our series going through OSTEP book chapters. The OSTEP textbook is freely available at Remzi's website if you like to follow along.

Core Challenges in Multiprocessor Scheduling

The shift to multiple CPUs introduces several hardware challenges that the operating system must manage:

Cache Coherence: Hardware caches improve performance by storing frequently used data. In multiprocessor systems, if one CPU modifies data in its local cache without updating main memory immediately, other CPUs may read "stale" (incorrect) data.
Synchronization: Accessing shared data structures across multiple CPUs requires mutual exclusion (e.g., locks). Without these, concurrent operations can lead to data corruption, such as double frees in a linked list.
Cache Affinity: Processes run faster when they stay on the same CPU because they can reuse state already built up in that CPU's local cache. Frequent migration across CPUs forces the system to reload this state, degrading performance.

Yep. Now we are talking about some distributed systems concepts, even inside a single computer.

Scheduling Strategies

The chapter compares two primary architectural approaches to scheduling:

Single-Queue Multiprocessor Scheduling (SQMS): All jobs are placed into a single global queue. This is simple to implement and inherently balances the load across all available CPUs, but it does not scale well due to lock contention on the single queue and often ignores cache affinity.
Multi-Queue Multiprocessor Scheduling (MQMS): The system maintains multiple queues, typically one per CPU. This is highly scalable and naturally preserves cache affinity since jobs stay in their assigned queue, but is vulnerable to load imbalance, where one CPU may become idle while another is overloaded.

To address load imbalances in MQMS, systems use "work stealing", where an under-utilized CPU peeks at another queue and steals jobs to balance the workload.

Modern Linux schedulers have utilized both approaches:

O(1) Scheduler: Multi-queue.
Completely Fair Scheduler (CFS): Multi-queue.
BF Scheduler (BFS): Single-queue.

by Murat (noreply@blogger.com)

March 17, 2026

Murat Demirbas

Measuring Agents in Production

When you are in TPOT echo chamber, you would think fully autonomous AI agents are running the world. But this 2025 December paper, "Measuring Agents in Production" (MAP), cuts through the reality behind the hype. It surveys 306 practitioners and conducts 20 in-depth case studies across 26 domains to document what is actually running in live environments. The reality is far more basic, constrained, and human-dependent than TPOT suggest.

The Most Surprising Findings

Simplicity and Bounded Autonomy: 80% of case studies use predefined structured workflows rather than open-ended autonomous planning, and 68% execute fewer than 10 steps before requiring human intervention. Frankly, these systems sound to me less "autonomous agent" than glorified state machine or multi-step RAG pipeline.

Prompting Beats Fine-Tuning: Despite the academic obsession with reinforcement learning and fine-tuning, 70% of teams building production agents simply prompt off-the-shelf proprietary models. Custom-tuned models are often too brittle, and they break when foundation model providers update their models.

Tolerance for Latency: While in database systems and distributed systems we focus on shaving milliseconds and microseconds off response times, in the agent world 66% of deployed systems take minutes or even longer to respond. I am not comparing or criticizing because of the intrinsically different nature of the work, I am just stating how vastly different the latency expectations are.

Custom Infrastructure Over Heavy Frameworks: Though many developers experiment with frameworks like LangChain, 85% of the detailed production case studies ended up building their systems completely in-house using direct API calls. Teams actively migrate away from heavy frameworks to reduce dependency bloat and maintain the flexibility to integrate with their own proprietary enterprise infrastructure.

Benchmarks are Abandoned: 75% of production teams skip formal benchmarking entirely. Because real-world tasks are incredibly messy and domain-specific, teams rely instead on A/B testing, production monitoring, and human-in-the-loop evaluation (which a massive 74% of systems use as their primary check for correctness).

Reliability (consistent correct behavior over time) remains the primary bottleneck and challenge. OK, this one was not really a surprising finding.

The paper says agents in production deliver tangible value: 80% of practitioners explicitly deploy them for productivity gains, and 72% use them to drastically reduce human task-hours. This would have been a great place to be concrete, and dive deeper into a couple of these cases, because there is a lot of incentive for companies to exaggerate their agent use. I think the closest I have seen paper go here in the appendix, Table 3.

Discussion

So the data says that the state of multi-agent systems in production is exaggerated. Everyone says they are doing it, but only a few actually are. And those who are doing it are keeping it basic.

This feels familiar.

Remember 2018? IBM published a whitepaper stating that "7 in 10 consumer industry executives expect to have a blockchain production network in 3 years". They famously claimed blockchains would cure almost every business ailment, reducing 9 distinct frictions including "inaccessible marketplaces", "restrictive regulations", "institutional inertia", "invisible threats", and "imperfect information". Ha, "invisible threats", it cracks me up every time!

Of course, Pepperidge Farm remembers the massive 2018 hype about Walmart tracking lettuce on the blockchain to pinpoint E. Coli contamination events. We were promised a decentralized revolution, but we only got shitcoins.

But, comparing AI agents to blockchains is unfair. Agents actually have a couple killer applications already. They have also made it into deployment despite in very basic and constrained manner. It's just that they aren't the fully autonomous hyper-intelligent multi-agent swarms that people claim they are. They remain basic, human-supervised, highly constrained tools.

by Murat (noreply@blogger.com)

March 16, 2026

Small Datum - Mark Callaghan

CPU efficiency for MariaDB, MySQL and Postgres on TPROC-C with a small server

I started to use TPROC-C from HammerDB to test MariaDB, MySQL and Postgres and published results for MySQL and Postgres on small and large servers. This post provides more detail on CPU overheads for MariaDB, MySQL and Postgres on a small server.

tl;dr

Postgres get the most throughput and the difference is large.
MariaDB gets more throughput than MySQL
Throughput improves for MariaDB and MySQL but not for Postgres when stored procedures are enabled. It is possible that the stored procedure support in MariaDB and MySQL is more CPU efficient than in Postgres.
Postgres uses ~2X to ~4X more CPU for background tasks than InnoDB. The largest contributor is (auto)vacuum. But the total amount of CPU for background tasks is not significant relative to other CPU consumers.

Builds, configuration and hardware

I compiled everything from source: MariaDB 11.8.6, MySQL 8.4.8 and Postgres 18.2.

The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here.

For Postgres 18 the config file is named conf.diff.cx10b_c8r32 and adds io_mod='sync' which matches behavior in earlier Postgres versions.

For MySQL the config file is named my.cnf.cz12a_c8r32.

For MariaDB the config file is named my.cnf.cz12b_c8r32.

For all DBMS fsync on commit is disabled to avoid turning this into an fsync benchmark. The server has an SSD with high fsync latency.

Benchmark

The benchmark is tproc-c from HammerDB. The tproc-c benchmark is derived from TPC-C.

The benchmark was run for one workload, the working set is cached and there is only one user:

vu=1, w=100 - 1 virtual user, 100 warehouses

The test was repeated with stored procedure support in HammerDB enabled and then disabled. For my previous results it was always enabled. I did this to understand the impact of stored procedures. While they are great for workloads with much concurrency because they reduce lock-hold durations, the workload here did not have much concurrency. That helps me understand the CPU efficiency of stored procedures.

The benchmark for Postgres is run by this script which depends on scripts here. The MySQL scripts are similar.

stored procedures are enabled
partitioning is used for when the warehouse count is >= 1000
a 5 minute rampup is used
then performance is measured for 120 minutes

Results: NOPM

The numbers in the table below are the NOPM (throughput) for TPROC-C.

Summary

Postgres sustains the most throughput with and without stored procedures
MariaDB sustains more throughput than MySQL
Stored procedures help MariaDB and MySQL, but do not improve Postgres throughput

Legend:
* sp0 - stored procedures disabled

* sp1 - stored procedures enabled

sp0 sp1

11975 19281 MariaDB 11.8.6

9400 16874 MySQL 8.4.8

33261 33679 Postgres 18.2

Results: vmstat

The following is computed from a sample of ~1000 lines of vmstat output collected from the middle of the benchmark run.

The ratio of us to sy is almost 2X larger in Postgres than in MariaDB and MySQL with stored procedures disabled. But the ratios are similar with stored procedures enabled.
The context switch rate is about 5X larger in MariaDB and MySQL vs Postgres with stored procedures disabled before normalizing by thoughput, with normalization the difference would be even larger. But the difference is smaller with stored procedures enabled.
Postgres has better throughput because MariaDB and MySQL use more CPU per NOPM. The diference is larger with stored procedures disabled. Perhaps the stored prcoedure evaluator in MariaDB and MySQL is more efficient than in Postgres.

Legend:

* r - average value for the r column, runnable tasks

* cs - average value for the cs column, context switches/s

* us, sy - average value for the us and sy columns, user and system CPU utilization/s

* us+sy - average value for the sum of us and sy

* cpuPer - ((us+sy) / NOPM) * 1000, smaller is better

--- sp0

r cs us sy us+sy cpuPer

1.112 54786 10.0 3.1 13.2 1.102 MariaDB 11.8.6

1.130 65413 10.8 2.9 13.7 1.457 MySQL 8.4.8

1.206 11266 12.2 1.9 14.1 0.423 Postgres 18.2

--- sp1

r cs us sy us+sy cpuPer

1.079 11739 12.0 1.2 13.1 0.679 MariaDB 11.8.6

1.043 14698 12.0 1.0 13.0 0.770 MySQL 8.4.8

1.107 9776 12.4 1.4 13.8 0.409 Postgres 18.2

Results: flamegraphs with stored procedures

The flamegraphs are here.'

The following tables summarize CPU time based on the percentage of samples that can be mapped to various tasks and processes. Note that these are absolute values. So both MySQL and Postgres have similar distributions of CPU time per area even when Postgres gets 2X or 3X more throughput.

Summary

the CPU distributions by area are mostly similar for MariaDB, MySQL and Postgres
Postgres uses 2X to 4X more CPU for background work (vacuum)

Legend

* client - time in the HammerDB benchmark client

* swap - time in kernel swap code

* db-fg - time running statements in the DBMS for the client

* db-bg - time doing background work in the DBMS

- Total

MariaDB MySQL Postgres

client 4.62 5.70 6.82

swap 5.15 7.09 5.55

db-fg 86.83 83.89 79.43

db-bg 1.43 3.00 ~6.x

- Limited to db-fg, excludes Postgres because the data is messy

MariaDB MySQL

update 22.39 21.49

insert 6.17 5.82

select 23.43 18.04

commit ~4.0 ~5.0

parse ~10.0 ~10.0

Results: flamegraphs without stored procedures

The flamegraphs are here.

Summary

the CPU distributions by area are mostly similar for MariaDB and MySQL
Postgres uses 2X to 4X more CPU for background work (vacuum)

Legend

* client - time in the HammerDB benchmark client

* swap - time in kernel swap code

* db-fg - time running statements in the DBMS for the client

* db-bg - time doing background work in the DBMS

- Total

MariaDB MySQL Postgres

client 14.29 11.92 7.52

swap 13.18 15.58 5.80

db-fg 70.39 70.14 77.24

db-bg ~1.0 ~2.0 6.55

- Limited to db-fg, excludes Postgres because the data is messy

MariaDB MySQL

update 14.19 12.04

insert 4.29 3.57

select 18.14 11.96

prepare NA 6.73

commit ~2.0 2.64

parse 8.86 9.73

network 15.47 13.73

For MySQL parse, 2.5% was from pfs_digest_end_vc and children.

by Mark Callaghan (noreply@blogger.com)

Percona Database Performance Blog

What Is in pg_gather Version 33 ?

It started as a humble personal project, few years back. The objective was to convert all my PostgreSQL notes and learning into a automatic diagnostic tool, such that even a new DBA can easily spot the problems. The idea was simple, a simple tool which don’t need any installation but do all possible analysis and […]

by Jobin Augustine

Murat Demirbas

Modeling Token Buckets in PlusCal and TLA+

Retry storms are infamous in distributed systems. It is easy to run into them. Inevitably, a downstream service experiences a hiccup, so your clients automatically retry their failed requests. Those retries add more load to the struggling service, causing more failures, which trigger more retries. Before you know it, the tiny unavailability cascades into a full-blown self-inflicted denial of service.

Token Bucket is a popular technique that helps with gracefully avoiding retry storms. Here is how the token bucket algorithm works for retries:

Sending is always free. When a client sends a brand-new request, it doesn't need a token. It just sends it.
Successes give you credit. Every time a request succeeds, the client deposits a small fraction of a token into its bucket (up to a maximum capacity).
Retries costs you credit. If a request fails, the client must spend a whole token (or a large fraction of one) to attempt a retry.

If the downstream service is healthy, the bucket stays full. But if the service goes down, the initial failures quickly drain the bucket. Once the bucket is empty, the client stops retrying. It gracefully degrades, protecting the downstream service until it recovers and new requests start succeeding again.

It is an elegant solution. But when implementing it in a client driver, there is a subtle concurrency trap that is easy to fall into. We model it below to show exactly where things go wrong.

The Sequential Processing Trap

When an engineer sits down to write a client driver, they usually think sequentially: Pull a request from a queue, send it, check the response, and retry if it fails. We can model this sequential control flow using PlusCal (the pseudoalgorithm DSL that compiles down to TLA+). In this first model below, we simulate a thread iterating over a queue of requests. If a request fails, the thread checks the token bucket. If the bucket is empty, the thread naturally awaits for tokens to become available.

(Aside: Before we look at the results, notice the placement of labels in the model. In PlusCal, labels are not just aesthetic markers or simple goto targets. They are structurally vital because they define the boundaries of atomic steps. Label to label is the unit of atomic execution, so we have to place them very deliberately. For example, making a request to a server and evaluating its response cannot be a single, instantaneous action. It violates the physical reality of the network boundary. By placing the ProcessResponse: label right before the if server_succeeds check, we correctly split this into two atomic actions. As a fun piece of TLA+ trivia: if we omitted that label, the model checker would evaluate the network call and the await statement simultaneously. To avoid getting stuck at an invalid await state, TLC would magically "time travel" and force the server to succeed every single time, completely hiding our bug! )

If we run this model, the TLC model checker immediately throws a Termination violation. Why?

We accidentally built a Head-of-Line (HOL) blocking bug. This is how it happens. A burst of failures drains the bucket. The thread hits the "await condition" and goes to sleep, waiting for tokens. But because the thread is blocked, it can't loop around to send new requests. Since only new requests can earn tokens, the bucket will never refill, and the client driver is permanently deadlocked.

The Fix: Fail-Fast

To fix this, we have to change how the client driver reacts to an empty bucket. Instead of waiting, it must fail-fast. So we remove the "await" and simply drop the request if we can't afford the retry. We unblock the queue as the thread moves on to the next request. TLC confirms that this model passes with flying colors. starvation, no deadlocks.

The Pure TLA+ Perspective

It is worth noting that if we had modeled this system using pure TLA+ from the start, we wouldn't have stumbled into this specific deadlock. Instead of sequential thread logic, Pure TLA+ models state machines and event-driven logic. In the below TLA+ model, Send, ServerRespond, Retry, and Drop are independent actions. If the bucket is empty, the Retry action simply becomes disabled. But because there is no while loop tying actions together, the Send action remains perfectly valid for any new incoming requests. The guarded-command TLA+ model below naturally avoids the head-of-line blocking problem.

To run this TLA+ model at Spectacle web tool for visualization, simply click this link. The Spectacle tool loads the TLA+ spec from GitHub, interprets it using JavaScript interpreter, and visualizes step-by-step state changes as you press a button corresponding to an enabled action. You can step backwards and forwards, and explore different executions. This makes model outputs accessible to engineers unfamiliar with TLA+. You can share a trace simply by sending a URL as I did above.

This brings us to an important lesson about formal verification and system design: the paradigm gap. Pure TLA+ is a beautiful event-driven way to describe the mathematically correct state of your system. However, the environments where these systems actually live (Java, Go, C++, or Rust) are fundamentally built around sequential threads, loops, and queues, just like our PlusCal model. The impedance mismatch between an event-driven specification and a sequential implementation introduces the risk of HOL blocking. Because modern programming languages make it so effortless to pause a thread and wait for a resource, it is incredibly easy for a system to fall into the blocking trap. We should be cognizant of this pitfall when implementing our designs.

by Murat (noreply@blogger.com)

March 15, 2026

Murat Demirbas

The Serial Safety Net: Efficient Concurrency Control on Modern Hardware

This paper proposes a way to get serializability without completely destroying your system's performance. I quite like the paper, as it flips the script on how we think about database isolation levels.

The Idea

In modern hardware setups (where we have massive multi-core processors, huge main memory, and I/O is no longer the main bottleneck), strict concurrency control schemes like Two-Phase Locking (2PL) choke the system due to contention on centralized structures. To keep things fast, most systems default to weaker schemes like Snapshot Isolation (SI) or Read Committed (RC) at the cost of allowing dependency cycles and data anomalies. Specifically, RC leaves your application vulnerable to unrepeatable reads as data shifts mid-flight, while SI famously opens the door to write skew, where two concurrent transactions update different halves of the same logical constraint.

Can we have our cake and eat it too? The paper introduces the Serial Safety Net (SSN), as a certifier that sits entirely on top of fast weak schemes like RC or SI, tracking the dependency graph and blessing a transaction only if it is serializable with respect to others.

Figure 1 shows the core value proposition of SSN. By layering SSN onto high-concurrency but weak schemes like RC or SI, the system eliminates all dependency cycles to achieve serializability without the performance hits seen in 2PL or Serializable Snapshot Isolation (SSI).

SSN implementation

When a transaction T tries to commit, SSN calculates a low watermark $\pi(T)$ (the oldest transaction in the future that depends on T) and a high watermark $\eta(T)$ (the newest transaction in the past that T depends on). If $\pi(T) \le \eta(T)$, it means the past has collided with the future, and a dependency cycle has closed. SSN aborts the transaction.

Because SSN throws out any transaction that forms a cycle, the final committed history is mathematically guaranteed to be cycle-free, and hence Serializable (SER).

Figure 2 illustrates how SSN detects serialization cycles using a serial-temporal graph. The x-axis represents the dependency order, while the y-axis tracks the global commit order. Forward dependency edges point upward, and backward edges (representing read anti-dependencies) point downward. Subfigures (a) and (b) illustrate a transaction cycle closing and the local exclusion window violation that triggers an abort: transaction T2 detects that its predecessor T1 committed after T2's oldest successor, $\pi(T2)$. This overlap proves T1 could also act as a successor, forming a forbidden loop.

Subfigures (c) and (d) demonstrate SSN's safe conditions and its conservative trade-offs. In (c), the exclusion window is satisfied because the predecessor T3 committed before the low watermark $\pi(Tx)$, making it impossible for T3 to loop back as a successor. Subfigure (d), however, shows a false positive where transaction T3 is aborted because its exclusion window is violated, even though no actual cycle exists yet. This strictness is necessary, though: allowing T3 to commit would be dangerous, as a future transaction could silently close the cycle later without triggering any further warnings. Since SSN summarizes complex graphs into just two numbers ($\pi$ and $\eta$), it will sometimes abort a transaction simply because the exclusion window was violated, even if a true cycle hasn't formed yet.

SSN vs. Pure OCC

Now, you might be asking: Wait, this sounds a lot like Optimistic Concurrency Control (OCC), so why not just use standard OCC for Serializability?

Yes, SSN is a form of optimistic certification, but the mechanisms are different, and the evaluation section of the paper exposes exactly why SSN is a superior architecture for high-contention workloads.

Standard OCC does validation by checking exact read/write set intersections. If someone overwrote your data, you abort. The problem is the OCC Retry Bloodbath! When standard OCC aborts a transaction, retrying it often throws it right back into the exact same conflict because the overwriting transaction might still be active. In the paper's evaluation, when transaction retries were enabled, the standard OCC prototype collapsed badly, wasting over 60% of its CPU cycles just fighting over index insertions.

SSN, however, possesses the "Safe Retry" property. If SSN aborts your transaction T because a predecessor U violated the exclusion window, U must have already committed. When you immediately retry, the conflict is physically in the past; your new transaction simply reads $U$'s freshly committed data, bypassing the conflict entirely. SSN's throughput stays stable under pressure while OCC falls over.

Discussion

So what do we have here? SSN offers a nice way to get to SER, while keeping decent concurrency. It proves that with a little bit of clever timestamp math, you can turn a dirty high-speed concurrency scheme into a serializable one.

Of course, no system is perfect. If you are going to deploy SSN, you have to pay the piper. Here are some critical trade-offs.

To track these dependencies, SSN requires you to store extra timestamps on every single version of a tuple in your database. In a massive in-memory system, this metadata bloat is a significant cost compared to leaner OCC implementations.

SSN is also not a standalone silver bullet for full serializability. While it is great at tracking row-level dependencies on existing records, it does not natively track phantoms (range-query insertions). Because an acyclic dependency graph only guarantees serializability in the absence of phantoms , you cannot just drop SSN onto vanilla RC or SI; you must actively extend the underlying CC scheme with separate mechanisms like index versioning or key-range locking to prevent them.

To bring closure on the SSN approach, let's address one final architectural puzzle. If you've been following the logic so far, you might have noticed a glaring question. The paper demonstrates that layering SSN on top of Read Committed guarantees serializability (RC + SSN = SER). It also shows that doing the exact same thing with Snapshot Isolation gets you to the exact same destination (SI + SSN = SER). If both combinations mathematically yield a serializable database, why would we ever willingly pay the higher performance overhead of Snapshot Isolation? Why would we want SI+SSN when we have RC+SSN at home?

While layering SSN on top of Read Committed (RC) guarantees a serializable outcome, it exposes your application to in-flight problems. Under RC, reads simply return the newest committed version of a record and never block. This means the underlying data can change right under your application's feet while the transaction is running. Your code might read Account A, and milliseconds later read Account B after a concurrent transfer committed, seeing a logically impossible total, an inconsistent snapshot. Even though SSN will ultimately catch this dependency cycle and safely abort the transaction during the pre-commit phase, your application logic might crash before it ever reaches that protective exit door. Furthermore, even if your code survives the run, this late abort mechanism hides a big performance penalty: your system might burn a lot of CPU and memory executing a complex doomed transaction, only for SSN to throw all that wasted work at the final commit check.

This is why we gladly pay the extra concurrency control overhead for SI. Under SI, each transaction reads from a perfectly consistent snapshot of the database taken at its start time. From your application's perspective, time stops, completely shielding your code from ever seeing those transiently broken states mid-flight. However, as we mentioned in the beginning, SI still allows write skews, and pairing it with SSN covers for that to guarantee serializability.

If you like to dive into this more, the authors later published a 20 page journal version here. I also found a recent follow up by Japanese researchers here.

by Murat (noreply@blogger.com)

March 13, 2026

Percona Database Performance Blog

How to Customize PagerDuty Custom Details in Grafana: The Hidden Override Method

The Problem If you’ve integrated Grafana Alerting with PagerDuty, you’ve probably noticed something frustrating: the PagerDuty incident details are cluttered with every single label and annotation from your alerts. Here’s what you typically see: [crayon-69b41a4293131592303572/] This wall of text makes it hard for your on-call engineers to quickly identify what’s wrong. And actually, this was […]

by Tibor Korocz

PlanetScale Blog

Scaling Postgres connections with PgBouncer

PgBouncer is the perfect pairing for Postgres's biggest weakness: connection management. Tuning it just right is important to make this work well, and here we cover everything you need to know

by Ben Dicken

March 12, 2026

Percona Database Performance Blog

A Failing Unit Test, a Mysterious TCMalloc Misconfiguration, and a 60% Performance Gain in Docker

We are pleased to share the news of a recent fix, tracked as PSMDB-1824/SMDB-1868, that has delivered significant, quantifiable performance enhancements for Percona Server for MongoDB instances, particularly when running in containerized environments like Docker. Percona Server for MongoDB version 8.0.16-5, featuring this improvement, was made available on December 2, 2025. Investigation The initial issue […]

by Igor Solodovnikov