a curated list of database news from authoritative sources

April 04, 2026

CPU-bound sysbench on a large server: Postgres, MySQL and MariaDB

This post has results for CPU-bound sysbench vs Postgres, MySQL and MariaDB on a large server using older and newer releases. 

The goal is to measure:

  • how performance changes over time from old versions to new versions
  • performance between modern MySQL, MariaDB and Postgres

The context here is a collection of microbenchmarks using a large server with high concurrency. Results on other workloads might be different. But you might be able to predict performance for a more complex workload using the data I share here.

tl;dr

  • for point queries
    • Postgres is faster than MySQL, MySQL is faster than MariaDB
    • modern MariaDB suffers from huge regressions that arrived in 10.5 and remain in 12.x
  • for range queries without aggregation
    • MySQL is about as fast as MariaDB, both are faster than Postgres (often 2X faster)
  • for range queries with aggregation
    • MySQL is about as fast as MariaDB, both are faster than Postgres (often 2X faster)
  • for writes
    • Postgres is much faster than MariaDB and MySQL (up to 4X faster)
    • MariaDB is between 1.3X and 1.5X faster than MySQL
  • on regressions
    • Postgres tends to be boring with few regressions from old to new versions
    • MySQL and MariaDB are exciting, with more regressions to debug
Hand-wavy summary

My hand-wavy summary about performance over time has been the following. It needs a revision, but also needs to be concise. 

Modern Postgres is about as fast as old Postgres, with some improvements. It has done great at avoiding perf regressions.

Modern MySQL at low concurrency has many performance regressions from new CPU overheads (code bloat). At high concurrency it is faster than old MySQL because the improvements for concurrency are larger than the regressions from code bloat.

Modern MariaDB at low concurrency has similar perf as old MariaDB. But at high concurrency it has large regressions for point queries, small regressions for range queries and some large improvements for writes. Note that many things use point queries internally - range scan on non-covering index, updates, deletes. The regressions arrive in 10.5, 10.6, 10.11 and 11.4.

For results on a small server with a low concurrency workload, I have many posts including:
Builds, configuration and hardware

I compiled:
  • Postgres from source for versions 12.22, 13.23, 14.21, 15.16, 16.12, 17.8 and 18.2.
  • MySQL from source for versions 5.6.51, 5.7.44, 8.0.44, 8.4.7 and 9.5.0
  • MariaDB from source for versions 10.2.30, 10.2.44, 10.3.39, 10.4.34, 10.5.29, 10.6.25, 10.11.15, 11.4.10, 11.8.6, 12.2.2 and 12.3.1
I used a 48-core server from Hetzner
  • an ax162s with an AMD EPYC 9454P 48-Core Processor with SMT disabled
  • 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
  • 128G RAM
  • Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic). The server has since been updated to Ubuntu 24.04 and I am repeating tests.
Configuration files for Postgres:
  • the config file is named conf.diff.cx10a_c32r128 (x10a_c32r128) and is here for versions 1213141516 and 17.
  • for Postgres 18 I used conf.diff.cx10b_c32r128 (x10b_c32r128) which is as close as possible to the Postgres 17 config and uses io_method=sync
The my.cnf files for MySQL are here: 5.6.515.7.448.0.4x8.4.x9.x.0

The my.cnf files for MariaDB are here: 10.2, 10.3, 10.4, 10.5, 10.6, 10.11, 11.4, 11.8, 12.2, 12.3.

I thought I was using the latin1 charset for all versions of MariaDB and MySQL but I recently learned I was using somehting like utf8mb4 on recent versions (maybe MariaDB 11.4+ and MySQL 8.0+). See here for details. I will soon repeat tests using latin1 for all versions. For some tests, the use of a multi-byte charset increases CPU overhead by up to 5%, which reduces throughput by a similar amount.

With Postgres I have been using a multi-byte charset for all versions.

Benchmark

I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by Postgres.

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. The benchmark is run with 40 clients and 8 tables with 10M rows per table. The database is cached.

The purpose is to search for regressions from new CPU overhead and mutex contention. I use the small server with low concurrency to find regressions from new CPU overheads and then larger servers with high concurrency to find regressions from new CPU overheads and mutex contention.

The tests can be called microbenchmarks. They are very synthetic. But microbenchmarks also make it easy to understand which types of SQL statements have great or lousy performance. Performance testing benefits from a variety of workloads -- both more and less synthetic.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries without aggregation while part 2 has queries with aggregation. 

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for base version)
When the relative QPS is > 1 then some version is faster than base version.  When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than base version.

The per-test results from vmstat and iostat can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

The spreadsheet with charts is here and in some cases is easier to read than the charts below. Files with performance summaries are archived here.

The relative QPS numbers are also here for:
Files with HW efficiency numbers, average values from vmstat and iostat normalized by QPS, are here for:
Results: MySQL vs MariaDB vs Postgres

HW efficiency metrics are here. They have metrics from vmstat and iostat normalized by QPS.

Point queries
  • Postgres is faster than MySQL is faster than MariaDB
  • MySQL gets about 2X more QPS than MariaDB on 5 of the 9 tests
  • a table for relative QPS by test is here
  • from HW efficiency metrics for the random-points.range1000 test:
    • Postgres is 1.35X faster than MySQL, MySQL is more than 2X faster than MariaDB
    • MariaDB uses 2.28X more CPU and does 23.41X more context switches than MySQL
    • Postgres uses less CPU but does ~1.93X more context switches than MySQL
Range queries without aggregation
  • MySQL is about as fast as MariaDB, both are faster than Postgres (often 2X faster)
  • MariaDB has lousy results on the range-notcovered-si test because it must do many point lookups to fetch columns not in the index and MariaDB has problems with point queries at high concurrency
  • a table for relative QPS by test is here
  • from HW efficiency metrics for the scan:
    • MySQL is 1.2X faster than Postgres and 1.5X faster than MariaDB
    • MariaDB uses 1.19X more CPU and does ~1000X more context switches than MySQL
    • Postgres uses 1.55X more CPU but does few context switches than MySQL
Range queries with aggregation
  • MySQL is about as fast as MariaDB, both are faster than Postgres (often 2X faster)
  • a table for relative QPS by test is here
  • from HW efficiency metrics for read-only-count
    • MariaDB is 1.22X faster than MySQL, MySQL is 4.2X faster than Postgres
    • MariaDB uses 1.22X more CPU than MySQL but does ~2X more context switches
    • Postgres uses 4.11X more CPU than MySQL and does 1.08X more context switches
    • Query plans are here and MySQL + MariaDB benefit from the InnoDB clustered index
  • from HW efficiency metrics for read-only.range=10
    • MariaDB is 1.22X faster than MySQL, MySQL is 4.2X fasterMySQL is 1.2X faster than Postgres and 1.5X faster than MariaDB
    • MariaDB uses 1.19X more CPU and does ~1000X more context switches than MySQL
    • Postgres uses 1.55X more CPU but does few context switches than MySQL
Writes
  • Postgres is much faster than MariaDB and MySQL (up to 4X faster)
  • MariaDB is between 1.3X and 1.5X faster than MySQL
  • a table for relative QPS by test is here
  • from HW efficiency metrics for insert
    • Postgres is 3.03X faster than MySQL, MariaDB is 1.32X faster than MySQL
    • MySQL uses ~1.5X more CPU than MariaDB and ~2X more CPU than Postgres
    • MySQL does ~1.3X more context switches than MariaDB and ~2.9X more than Postgres
Results: MySQL

HW efficiency metrics are here. They have metrics from vmstat and iostat normalized by QPS.

Point queries
  • For 7 of 9 tests QPS is ~1.8X larger or more in 5.7.44 than in 5.6.51
  • For 2 tests there are small regressions after 5.6.51 -- points-covered-si & points-notcovered-si
  • a table for relative QPS by test is here
  • from HW efficiency metrics for points-covered-si:
    • the regression is explained by an increase in CPU
Range queries without aggregation
  • there is a small regression from 5.6 to 5.7 and a larger one from 5.7 to 8.0
  • a table for relative QPS by test is here
  • from HW efficiency metrics for range-covered-pk:
    • CPU overhead grows by up to 1.4X after 5.6.51, this is true for all of the tests
Range queries with aggregation
  • regressions after 5.6.51 here are smaller than in the other groups, but 5.7 tends to do better than 8.0, 8.4 and 9.5
  • a table for relative QPS by test is here
  • HW efficiency metrics are here for read-only_range=100
    • QPS changes because CPU/query changes
Writes
  • QPS improves after 5.6 by up to ~7X
  • a table for relative QPS by test is here
  • HW efficiency metrics are here insert
    • QPS improves after 5.6.51 because CPU per statement drops
Results: MariaDB

HW efficiency metrics are here. The have metrics from vmstat and iostat normalized by QPS.

Point queries
  • QPS for 6 of 9 tests drops in half (or more) from 10.2 to 12.3
  • a table for relative QPS is here
  • most of the regressions arrive in 10.5 and the root cause might be remove support for innodb_buffer_pool_intances and only support one buffer pool instance
  • HW efficiency metrics are here for points-covered-pk
    • there are large increases in CPU overhead and the context switch rate starting in 10.5
Range queries without aggregation
  • for range-covered-* and range-notcovered-pk there is a small regression in 10.4
  • for range-not-covered-si there is a large regression in 10.5 because this query does frequent point lookups on the PK to get missing columns
  • for scan there is a regression in 10.5 that goes away, but the regressions return in 10.11 and 11.4 
  • a table for relative QPS by test is here
  • HW efficiency metrics are here
Range queries with aggregation
  • for most tests there are small regressions in 10.4 and 10.5
  • a table for relative QPS by test is here
  • HW efficiency metrics are here
Writes
  • for most tests modern MariaDB is faster than 10.2
  • table for relative QPS by test is here
  • HW efficiency metrics are here
Results: Postgres

HW efficiency metrics are here. They have metrics from vmstat and iostat normalized by QPS.

Point queries
  • QPS for hot-points increased by ~2.5X starting in Postgres 17.x
  • otherwise QPS is stable from 12.22 through 18.2
  • a table for relative QPS by test is here
  • HW efficiency metrics for the hot-points test are here
    • CPU drops by more than half starting in 17.x
Range queries without aggregation
  • QPS is stable for the range-not-covered-* and scan tests
  • QPS drops almost in half for the range-covered-* tests
  • a table for relative QPS by test is here
  • all versions use the same query plan for the range-covered-pk test
  • HW efficiency metrics are here for range-covered-pk and for range-covered-si
    • An increase in CPU overhead explains the regressions for range-covered-*
    • I hope to get flamegraphs and thread stacks for these tests to explain what happens
Range queries with aggregation
  • QPS is stable from 12.22 through 18.2
  • a table for relative QPS by test is here
  • HW efficiency metrics are here
Writes
  • QPS is stable for 5 of 10 tests
  • QPS improves by up to 1.7X for the other 5 tests, most of that arrives in 17.x
  • a table for relative QPS by test is here
  • HW efficiency metrics are here for update-index




















    April 03, 2026

    OSTEP Chapter 14: Interlude -- Memory API

    This is a short chapter covering the nuts and bolts of memory allocation in C: malloc(), free(), and the many ways programmers get them wrong.

    This is part of our series going through OSTEP book chapters. The OSTEP textbook is freely available at Remzi's website if you like to follow along.


    Stack vs. Heap

    C gives you two kinds of memory. Stack memory is automatic: the compiler allocates it when you enter a function and reclaims it when you return. Heap memory is manual: you allocate it with malloc() and free it with free(). Let's remember the layout from Chapter 13.


    The distinction is simple in principle: use the stack for short-lived local data, use the heap for anything that must outlive the current function call. The heap is where the trouble lives. It forces the programmer to reason about object lifetimes at every allocation site. The compiler won't save you; a C program with memory bugs compiles and runs just fine, until it doesn't.


    The API

    malloc(size_t size) takes a byte count and returns a void * pointer to the allocated region, or NULL on failure. The caller casts the pointer and is responsible for passing the right size. The idiomatic way is sizeof(), which is a compile-time operator, not a function: double *d = (double *) malloc(sizeof(double));

    For strings, you must use malloc(strlen(s) + 1) to account for the null terminator. Using sizeof() on a string pointer gives you the pointer size (4 or 8 bytes), not the string length. This is a classic pitfall.

    free() takes a pointer previously returned by malloc(). It does not take a size argument; the allocator tracks that internally.

    Note that malloc() and free() are library calls, not system calls. The malloc library manages a region of your virtual address space (the heap) and calls into the OS when it needs more. The underlying system calls are brk / sbrk (which move the program break, i.e., the end of the heap segment) and mmap (which creates anonymous memory regions backed by swap). You should never call brk or sbrk directly.


    The Rogues' Gallery of Memory Bugs

    The chapter catalogs the common errors. Every C programmer has hit most of these, as I did back in the day:

    • Forgetting to allocate: Using an uninitialized pointer, e.g., calling strcpy(dst, src) where dst was never allocated. Segfault.
    • Allocating too little: The classic buffer overflow. malloc(strlen(s)) instead of malloc(strlen(s) + 1). This may silently corrupt adjacent memory or crash later. This is a sneaky bug, because it can appear to work for years.
    • Forgetting to initialize: malloc() does not zero memory. You read garbage. Use calloc() if you need zeroed memory.
    • Forgetting to free: Memory leaks. Benign in short-lived programs (the OS reclaims everything at process exit), catastrophic in long-running servers and databases.
    • Freeing too early: Dangling pointers. The memory gets recycled, and you corrupt some other allocation.
    • Freeing twice: Undefined behavior. The allocator's internal bookkeeping gets corrupted.
    • Freeing wrong pointers: Passing free() an address it didn't give you. Same result: corruption.

    The compiler catches none of these. You need runtime tools: valgrind for memory error detection, gdb for debugging crashes (oh, noo!!), purify for leak detection.

    A while ago, I had a pair of safety goggles sitting on my computer desk (I guess I had left them there after some DIY work). My son asked me what they are for. At the spur of the moment, I told him, they are for when I am writing C code. Nobody wants to get stabbed in the eye by a rogue pointer.


    Discussion

    This chapter reads like a war story. Every bug it describes has brought down production systems. The buffer overflow alone has been responsible for decades of security vulnerabilities. The fact that C requires manual memory management, and that the compiler is silent about misuse, is simultaneously the language's power and its curse. In case you haven't read this by now, do yourself a favor and read "Worse is Better". It highlights a fundamental tradeoff in system architecture: do you aim for theoretical correctness and perfect safety, or do you prioritize simplicity to ensure practical evolutionary survival? It argues that intentionally accepting a few rough/unsafe edges and building a lightweight practical system is often the smarter choice, as these simple good enough tools are the ones that adapt the fastest, survive, and run the world. This is a big and contentious discussion point, where it is possible to defend both sides equally vigorously. The debate is far from over, and LLMs bring a new dimension to it.  

    Anyhoo, the modern response to the dangers of C programming has been to move away from manual memory management entirely. Java and Go use garbage collectors. Python uses reference counting plus a cycle collector. These eliminate use-after-free and double-free by design, at the cost of runtime overhead and unpredictable latency, which make them not as applicable for systems programming.

    The most interesting recent response is Rust's ownership model. Rust enforces memory safety at compile time through ownership rules: every value has exactly one owner, ownership can be transferred (moved) or borrowed (referenced), and the compiler inserts free calls automatically when values go out of scope. This eliminates the entire gallery of memory bugs (no dangling pointers, no double frees, no leaks for owned resources, no buffer overflows) without garbage collection overhead. Rust achieves the performance of manual memory management with the safety of a managed language. But, the tradeoff is a steep learning curve; the borrow checker forces you to think about lifetimes explicitly, which is the same reasoning C requires but now enforced by the Rust compiler rather than left to hope and valgrind.

    There has also been a push from the White House and NSA toward memory-safe languages for critical infrastructure. The argument is straightforward: roughly 70% of serious security vulnerabilities in large C/C++ codebases (Chrome, Windows, Android) are memory safety bugs. The industry is slowly moving toward this direction: Android's new code is increasingly Rust, Linux has accepted Rust for kernel modules, and the curl project has been rewriting components in Rust and memory-safe C.

    For those of us working on distributed systems and databases, memory management remains a concern. Database buffer pools, memory-mapped I/O, custom allocators for hot paths all require the kind of low-level control and care when wielding that low-level control. The bugs described in this chapter can also corrupt data.

    April 02, 2026

    100,000 GitHub stars

    Supabase hits 100,000 GitHub stars. A reflection on community, open source, and what got us here.

    A Conversation with Paul Masurel, Creator of Tantivy

    We sat down with Paul Masurel (creator of Tantivy and co-founder of Quickwit) to talk about building a search engine in Rust, following the competitive happy-path with Lucene, open-source maintenance, and the lessons learned along the way.

    April 01, 2026

    Percona Operator for PostgreSQL 2.9.0: PostgreSQL 18 Default, PVC Snapshot Backups, LDAP Support, and More!

    We are excited to announce Percona Operator for PostgreSQL 2.9.0! In this release, we bring significant improvements across database lifecycle management, security, backup/restore, and operational observability, making it easier than ever to run production PostgreSQL on Kubernetes. Here’s a deep dive into what’s new.   Percona Operator for PostgreSQL 2.9.0 PostgreSQL 18 Is Now the […]

    Percona Operator for PostgreSQL 2.9.0: PostgreSQL 18 Default, PVC Snapshot Backups, LDAP Support, and More!

    We are excited to announce Percona Operator for PostgreSQL 2.9.0! In this release, we bring significant improvements across database lifecycle management, security, backup/restore, and operational observability, making it easier than ever to run production PostgreSQL on Kubernetes. Here’s a deep dive into what’s new.   Percona Operator for PostgreSQL 2.9.0 PostgreSQL 18 Is Now the […]

    PolitePaxos: A New Consensus Variant Where the Proposer Just Asks Nicely

    Paxos consensus protocol, despite its many theoretical virtues, is fundamentally rude. One need only look at the way it behaves to see the problem.

    A leader seizes power. It dictates values. When two leaders happen to propose simultaneously, they do not pause, tip their hats, and work things out over tea. No,  they duel, each furiously incrementing ballot numbers at the other like barbarians engaging in a perpetual pissing contest (please excuse my language). The follower nodes, meanwhile, are reduced to the role of obedient subjects, promising their allegiance to whichever proposer shouts loudest and most recently.

    Having spent decades studying this rude behavior (with WPaxos, PigPaxos, and through a great many posts on this very blog) I became convinced that the field had made a fundamental error. We had been asking "how do we reach agreement?" when the real question we should have been asking, all along, was "would it kill you to ask nicely?"

    It is therefore with considerable pride that I present PolitePaxos, which we believe to be the first consensus protocol that could be used in polite company. In PolitePaxos the Prepare and Promise phases are replaced with a single Polite Request phase. The proposer does not demand promises. It introduces itself, compliments the acceptors on their uptime. (In testing, the compliment "you have lovely page cache hit rates" proved most effective. "Is that a new kernel?" was also well-received.) It then explains why it believes this particular value might be rather nice if it isn't too much trouble, and then waits. We formalize politeness as a tuple (courtesy_level, sincerity_score, compliment_quality) and prove that when all three exceed a threshold τ, consensus is reached in a single round, assuming no sociopathic nodes.

    Early results were extremely encouraging... right up until the moment they became catastrophic. The problem arose when two nodes happened to submit concurrent Polite Proposals. What followed was not a leader duel but something far worse: an infinite loop of mutual deference. "Oh no, please, your value is far more deserving." "I couldn't possibly, yours is clearly the superior proposal." "I absolutely insist." "No, I absolutely insist." Both nodes continued doing so with increasing sincerity until the cluster ground to a halt. Anyone who has ever tried to enter a doorway at the same time as another British person, or anyone familiar with the Persian social protocol of taarof, will recognize this failure mode immediately: an infinite, perfectly symmetrical, exquisitely polite livelock! The logs were the most courteous system failure we had ever witnessed.

    Our solution, after considerable anguish and several pots of tea, was the introduction of a Pre-Pre-Pre-Polite Vibecheck Phase. Before a node so much as thinks about proposing a value, it sends a tentative, noncommittal, almost apologetic probe to the cluster. Not a proposal, not even a pre-proposal, but a sort of gentle clearing of the throat, a conversational "I don't suppose anyone would mind terribly if I were to, at some point, perhaps, maybe, suggest something?" If the vibes come back favorable, the node proceeds to the Polite Request. If another node is already vibechecking, the node with the lower timestamp gracefully withdraws, muttering that it wasn't that important anyway.

    This resolved the taarof livelock entirely and restored liveness, though we note that the three additional sub-rounds of pre-politeness have increased latency to what we describe in the paper as leisurely. I had previously built a BeatPaxos simulator for classic Paxos. You can go play with it to see just how boorish the original protocol is. We intend to build a PolitePaxos simulator as well, just as soon as we can figure out how to animate a node looking embarrassed. The full paper is available on our website, but only if you ask nicely.

    Happy April Fools' Day.

    Benchmarking MyRocks vs. InnoDB in Memory-Constrained Environments

    Benchmarking MyRocks vs. InnoDB in Memory-Constrained Environments It is a well-known fact in the database world that InnoDB is incredibly fast when the entire database fits into memory. But what happens when your data grows beyond your available RAM? MyRocks, built on RocksDB, is frequently recommended as a superior choice for environments constrained by memory, […]

    Single-Cluster Duality View 🃏

    In DynamoDB, a single-table design stores one-to-many relationships in a single physical block while still following relational-like normal form decomposition. In MongoDB, the Single Collection Pattern unnests relationships from a single document, but goes against the general recommendation as it sacrifices one of MongoDB’s key advantages—keeping a document in a single block. In Oracle Database and MySQL, JSON-relational duality views normalize JSON documents into relational tables that span multiple blocks, also without the data-locality benefits of MongoDB. Can we turn these duality views into a document-database equivalent that keeps together the data accessed together?

    Here is the Single-Cluster Duality View 🥁.

    DynamoDB

    NoSQL started with a simple key‑value API. For example, DynamoDB can get an item using the full key, query multiple items using a partial key, or scan all items without a key. The value, in this key‑value datastore, is atomic, but sometimes you need partial reads or writes. Then you have two solutions: read or write the full item, which is inefficient if you don't need it, or change your schema design to split it into multiple items.

    The latter was known as the single table design because it not only splits items into multiple entities but also stores them in the same table, sharing a key prefix, to retrieve all items with a single query. The idea is that you can access individual objects while also benefiting from data locality when querying the full aggregate.

    The benefit of this design relies heavily on DynamoDB specifics: the storage internals, where items are partitioned and clustered by their key, and the billing model, where you pay per table request unit.

    Here is an example, using the traditional one-to-many schema with departments employees:

    • One DynamoDB table
    • Department and employees stored as separate items
    • Same partition key (DEPT#<deptno>) for both:
    aws dynamodb create-table \
      --table-name scott \
      --attribute-definitions \
        AttributeName=PK,AttributeType=S \
        AttributeName=SK,AttributeType=S \
      --key-schema \
        AttributeName=PK,KeyType=HASH \
        AttributeName=SK,KeyType=RANGE \
      --billing-mode PAY_PER_REQUEST
    

    The logical model, where the relationship is materialized by sharing the department number as the same partition key, is also the physical model in which they are stored together, since items are partitioned on this key:

    PK SK Meaning
    DEPT#10 DEPT Department row
    DEPT#10 EMP#7782 Employee
    DEPT#10 EMP#7839 Employee
    DEPT#20 DEPT Department row
    DEPT#20 EMP#7369 Employee

    I insert some data

    aws dynamodb put-item --table-name scott --item '{
      "PK": {"S": "DEPT#10"},
      "SK": {"S": "DEPT"},
      "deptno": {"N": "10"},
      "dname": {"S": "ACCOUNTING"},
      "loc": {"S": "NEW YORK"}
    }'
    
    aws dynamodb put-item --table-name scott --item '{
      "PK": {"S": "DEPT#10"},
      "SK": {"S": "EMP#7782"},
      "empno": {"N": "7782"},
      "ename": {"S": "CLARK"},
      "job": {"S": "MANAGER"},
      "sal": {"N": "2450"}
    }'
    

    Here is a query that retrieves the whole aggregate by its partition key:

    aws dynamodb query \
      --table-name scott \
      --key-condition-expression "PK = :d" \
      --expression-attribute-values '{":d":{"S":"DEPT#10"}}' \
      --consistent-read \
      --return-consumed-capacity TOTAL 
    
    
    {
        "Items": [
            {
                "deptno": { "N": "10" },
                "PK": { "S": "DEPT#10" },
                "loc": { "S": "NEW YORK" },
                "dname": { "S": "ACCOUNTING" },
                "SK": { "S": "DEPT" }
            },
            {
                "ename": { "S": "CLARK" },
                "PK": { "S": "DEPT#10" },
                "job": { "S": "MANAGER" },
                "empno": { "N": "7782" },
                "sal": { "N": "2450" },
                "SK": { "S": "EMP#7782" }
            }
        ],
        "Count": 2,
        "ScannedCount": 2,
        "ConsumedCapacity": {
            "TableName": "scott",
            "CapacityUnits": 1.0
        }
    }
    

    All items are read from a single partition with 1 consistent read capacity unit (RCU), thanks to physical colocation by key design.

    MongoDB

    MongoDB doesn't need to split documents to get partial reads and writes. The data modeling objective is to keep aggregates in a single document and use the advanced API to access individual items. For example, you can use covering indexes or search indexes with returnStoredSource to avoid reading the whole document. And you can use $push or $set with arrayFilters to update individual array items.

    Still, some users preferred to split the documents and, by analogy to DynamoDB, called it the Single Collection Pattern. In truth, it is rarely useful. MongoDB is not DynamoDB. Storing documents in one or multiple collections doesn't impact billing, and using the same key prefix doesn't co‑locate them — except in two cases: clustered collections (special‑purpose, not generally recommended), and sharing a sharding key to co‑locate data on the same shard (but still different blocks in the filesystem).

    At the storage block level, documents may be stored together only if they were inserted together.

    Although this is not a recommendation when you use the full power of the MongoDB API, here is an example using a single collection and embedding only references, following the documentation: one document per department and one document per employee:

    db.empdept.insertMany([
    {
      _id: "DEPT#10",
      doc_type: "dept",
      deptno: 10,
      dname: "ACCOUNTING",
      loc: "NEW YORK",
      links: [
        { target: "DEPT#10", doc_type: "dept" },
        { target: "EMP#7782", doc_type: "emp" },
        { target: "EMP#7839", doc_type: "emp" }
      ]
    },{
      _id: "EMP#7782",
      doc_type: "emp",
      empno: 7782,
      ename: "CLARK",
      job: "MANAGER",
      sal: 2450,
      deptno: 10,
      links: [
        { target: "EMP#7782", doc_type: "emp" },
        { target: "DEPT#10", doc_type: "dept" }
      ]
    },{
      _id: "EMP#7839",
      doc_type: "emp",
      empno: 7839,
      ename: "KING",
      job: "PRESIDENT",
      sal: 5000,
      deptno: 10,
      links: [
        { target: "EMP#7839", doc_type: "emp" },
        { target: "DEPT#10", doc_type: "dept" }
      ]
    }];
    
    

    With this schema, the following gets the full aggregate as multiple documents without an aggregation pipeline:

    
    db.empdept.find({ "links.target": "DEPT#10" })
    
    

    This can identify the document using a single index range with a multi-key index on "links.target", but it will have to fetch multiple small documents, introduces strong coupling between those documents without the possibility of validating the reference, and requires an explicit transaction and retry logic to update them.

    MongoDB is designed to store aggregates as a single document, and the right schema is simply:

    db.empdept.insertOne({
      _id: 10,
      dname: "ACCOUNTING",
      loc: "NEW YORK",
      employees: [
        { empno: 7782, ename: "CLARK", job: "MANAGER", sal: 2450 },
        { empno: 7839, ename: "KING",  job: "PRESIDENT", sal: 5000 }
      ]
    })
    

    With this schema, you can still query the full aggregate with a simple find():

    
    db.empdept.find({ _id: 10 })
    
    

    and you can still update a single employee in‑place:

    db.empdept.updateOne(
      { _id: 10, "employees.empno": 7782 },
      { $set: { "employees.$.sal": 2600 } }
    )
    

    This updates the first matching array item using the positional $ operator. You can be more precise with arrayFilters

    db.empdept.updateOne(  
      { _id: 10 },  
      { $set: { "employees.$[e].sal": 2600 } },  
      { arrayFilters: [{ "e.empno": 7782 }] }  
    )  
    

    I've written about this and measured the efficiency in a previous post:


    Emulations

    A single document per aggregate, with embedded one-to-many relationships, is efficient in MongoDB, not in emulations. For example, in Oracle Autonomous Database, where the document is stored as OSON, the update will rewrite the full document:

    ora> db.empdept.updateOne(
      { _id: 10, "employees.empno": 7782 },
      { $set: { "employees.$.sal": 2600 } }
    )
    
    {
      acknowledged: true,
      insertedId: null,
      matchedCount: 1,
      modifiedCount: 1,
      upsertedCount: 0
    }
    
    ora> db.aggregate( [ { $sql : " select * from dbms_xplan.display_cursor( format=>'BASIC') " } ] ).forEach(row => print(row.PLAN_TABLE_OUTPUT));
    
    EXPLAINED SQL STATEMENT:
    ------------------------
    update "ORA"."empdept" set "DATA" = :1 where ("RESID" = :2 )  returning "RESID", "ETAG" into :3 , :4
    
    Plan hash value: 893016358
    
    -----------------------------------------------------
    | Id  | Operation                    | Name         |
    -----------------------------------------------------
    |   0 | UPDATE STATEMENT             |              |
    |   1 |  UPDATE                      | empdept      |
    |   2 |   TABLE ACCESS BY INDEX ROWID| empdept      |
    |   3 |    INDEX UNIQUE SCAN         | SYS_C0031043 |
    -----------------------------------------------------
    
    

    You have also the possibility to store the collection with JSON-relational duality views (JDV), but this partial update is not supported:

    arrayFilters option is not supported on duality view collections
    

    However, even if the JDV doesn't allow partial updates, the aggregate is stored as individual in SQL tables, and users can switch to SQL statements for partial updates.

    Duality Views

    In relational databases, applications typically work with a logical view of the data model, while the storage layer may transparently cluster or scatter data into fixed-size blocks. Oracle Database (and MySQL) added JSON-Relational Duality Views (JDV) to present a single logical view for JSON documents, while splitting them to multiple SQL tables. As each SQL tables have their own physical segment(s), aggregates are stored across multiple physical blocks.

    If you are a fan of the single‑table or single‑collection idea, you may create a single view. However, it has no advantage in data locality, as you explicitly normalized to multiple tables:

    CREATE TABLE dept (
      deptno NUMBER CONSTRAINT dept_pk PRIMARY KEY,
      dname  VARCHAR2(14),
      loc    VARCHAR2(13)
    );
    
    CREATE TABLE emp (
      empno  NUMBER CONSTRAINT emp_pk PRIMARY KEY,
      ename  VARCHAR2(10),
      job    VARCHAR2(9),
      sal    NUMBER(7,2),
      deptno NUMBER NOT NULL,
      CONSTRAINT emp_dept_fk FOREIGN KEY (deptno) REFERENCES dept(deptno)
    );
    
    CREATE INDEX emp_dept_fk ON emp (deptno);
    
    INSERT INTO dept (deptno, dname, loc)  VALUES (10, 'ACCOUNTING', 'NEW YORK');
    INSERT INTO emp (empno, ename, job, sal, deptno)  VALUES (7782, 'CLARK', 'MANAGER', 2450, 10);
    INSERT INTO emp (empno, ename, job, sal, deptno)  VALUES (7839, 'KING', 'PRESIDENT', 5000, 10);
    -- add many departments to check the number of blocks scanned
    INSERT INTO dept(deptno) SELECT rownum+1000
     FROM xmltable('0 to 999')
    ;
    
    -- one view over multiple tables with JDV
    
    CREATE OR REPLACE JSON RELATIONAL DUALITY VIEW empdept_dv AS
    SELECT JSON {
      '_id'  : d.deptno,
      'dname': d.dname,
      'loc'  : d.loc,
      'employees' : [
        SELECT JSON {
          'empno': e.empno,
          'ename': e.ename,
          'job'  : e.job,
          'sal'  : e.sal
        }
        FROM emp e WITH INSERT UPDATE DELETE
        WHERE e.deptno = d.deptno
      ]
    }
    FROM dept d WITH INSERT UPDATE DELETE
    ;
    

    The view looks like a document collection with an embedded one-to-many relationship, similar to MongoDB’s recommended model, but it’s not truly embedded—the view actually splits it into two tables. We have lost the main advantage of MongoDB: data that's accessed together should be stored together.

    I can query it as if it were a single table:

    SELECT data 
    FROM empdept_dv
    WHERE
    JSON_EXISTS("DATA",'$?(@._id.numberOnly() == $B0)' passing 10 as "B0"
    type(strict))
    ;
    

    but it actually reads two tables:

    SQL> select * from dbms_xplan.display_cursor(format=>'ALLSTATS LAST -COST');
    
    PLAN_TABLE_OUTPUT
    ____________________________________________________________________________________________________________________________________________
    SQL_ID  gfr8jkdwatdnz, child number 0
    -------------------------------------
    SELECT data  FROM empdept_dv WHERE JSON_EXISTS("DATA",'$?(@._id.numberOnly() == $B0)' passing 10 as "B0" type(strict))
    
    Plan hash value: 2755083285
    
    ----------------------------------------------------------------------------------------------------------------------------
    | Id  | Operation                            | Name        | Starts | E-Rows | A-Rows | Buffers |  OMem |  1Mem | Used-Mem |
    ----------------------------------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT                     |             |      1 |        |      1 |       3 |       |       |          |
    |   1 |  SORT GROUP BY                       |             |      1 |      1 |      1 |       2 |  2048 |  2048 | 2048  (0)|
    |   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| EMP         |      1 |      1 |      2 |       2 |       |       |          |
    |*  3 |    INDEX RANGE SCAN                  | EMP_DEPT_FK |      1 |      1 |      2 |       1 |       |       |          |
    |   4 |  TABLE ACCESS BY INDEX ROWID         | DEPT        |      1 |      1 |      1 |       3 |       |       |          |
    |*  5 |   INDEX UNIQUE SCAN                  | DEPT_PK     |      1 |      1 |      1 |       2 |       |       |          |
    ----------------------------------------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       3 - access("E"."DEPTNO"=:B1)
       5 - access("D"."DEPTNO"=10)
    
    

    The access is efficient, using the index on the foreign key, but it doesn't colocate a one-to-many relationship as we expect when storing an aggregate into a single document, or when using the single-table design.

    You can update through the view, but the following will update a lot more than what you expect, as it rewrites the full document:

    UPDATE empdept_dv
    SET data = JSON_TRANSFORM(
      data,
      SET '$.employees[0].sal' = 2600
    )
    WHERE
    JSON_EXISTS("DATA",'$?(@._id.numberOnly() == $B0)' passing 10 as "B0"
    type(strict))
    ;
    

    Using duality views instead of OSON storage allows you to fall back to SQL to update a single row, similar to a single-table design. However, unlike DynamoDB, you cannot group different item types in the same table. These are SQL tables with a fixed schema and no polymorphism, so you cannot store both department and employee attributes in a single table. JSON-Relational Duality Views don’t replace document databases: they act like an object-relational mapper (ORM) with all logic deployed in the SQL database.

    Cluster

    To get closer to the performance of a document database, I need a single-table duality view capable of storing documents that are split across multiple SQL tables, yet behave as if they were in a single physical table. In Oracle Database, this corresponds to a CLUSTER, where only the key is declared, like DynamoDB tables, and an estimated value size are predefined to fill the fixed-size blocks:

    CREATE CLUSTER dept_cluster (
      deptno NUMBER ------------------------------------- clustering key
    ) 
    SIZE 1024 --  expected bytes per cluster key
    ;
    
    CREATE INDEX dept_cluster_idx ON CLUSTER dept_cluster 
    ;
    

    The logical SQL tables are stored in the cluster under a common clustering key, instead of allocating their own physical segment:

    CREATE TABLE dept ( 
      deptno NUMBER CONSTRAINT dept_pk PRIMARY KEY,   --- clustering key
      dname  VARCHAR2(14), 
      loc    VARCHAR2(13)  
     )
     CLUSTER dept_cluster (deptno) ---------------------- clustering key
    ; 
    
    CREATE TABLE emp (
      empno  NUMBER CONSTRAINT emp_pk PRIMARY KEY, 
      ename  VARCHAR2(10),  
      job    VARCHAR2(9), 
      sal    NUMBER(7,2), 
      deptno NUMBER NOT NULL,      ---------------------- clustering key
      CONSTRAINT emp_dept_fk 
                 FOREIGN KEY (deptno) 
                 REFERENCES dept(deptno)
     )
     CLUSTER dept_cluster (deptno) ---------------------- clustering key
    ;
    
    INSERT INTO dept (deptno, dname, loc)  VALUES (10, 'ACCOUNTING', 'NEW YORK');
    INSERT INTO emp (empno, ename, job, sal, deptno)  VALUES (7782, 'CLARK', 'MANAGER', 2450, 10);
    INSERT INTO emp (empno, ename, job, sal, deptno)  VALUES (7839, 'KING', 'PRESIDENT', 5000, 10);
    -- add many departments to check the number of blocks scanned
    INSERT INTO dept(deptno) SELECT rownum+1000
     FROM xmltable('0 to 999')
    ;
    

    I created the same duality view as before, as the logical tables are the same, and query it:

    SELECT data 
    FROM empdept_dv
    WHERE
    JSON_EXISTS("DATA",'$?(@._id.numberOnly() == $B0)' passing 10 as "B0"
    type(strict))
    ;
    

    Physically, it accesses the logical tables in the pre-joined cluster, via its index:

    SQL> select * from dbms_xplan.display_cursor(format=>'ALLSTATS LAST');
    
    PLAN_TABLE_OUTPUT
    ------------------------------------------------------------------------------------------------------------------------
    SQL_ID  26py2vsmch5kx, child number 0
    -------------------------------------
    SELECT data FROM <... (truncated)
                                        

    March 31, 2026

    gcc vs clang for sysbench on a small server with Postgres, MySQL and MariaDB

    This has results for sysbench on a small server and compares performanc for Postgres, MySQL and MariaDB compiled using clang vs using gcc.

    tl;dr

    • Throughput with clang and gcc is similar

    Builds, configuration and hardware

    I compiled Postgres 18.3, MySQL 8.4.8 and MariaDB 11.8.6 from source. The server has 8 AMD cores with SMT disabled and 32G of RAM. The OS is Ubuntu 24.04, gcc is version 13.3.0 and clang is version 18.1.3. Storage is ext-4 with discard enabled and an NVMe SSD.

    Benchmark

    I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by InnoDB.

    The tests are run using 1 client and 1 table with 50M rows. The read-heavy microbenchmarks run for 630 seconds and the write-heavy for 930 seconds.

    Results

    The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

    I provide tables below with relative QPS. When the relative QPS is > 1 then some version is faster than the base version. When it is < 1 then there might be a regression. The number below are the relative QPS computed as: (QPS with a gcc build / QPS with a clang build)

    Legend:
    * pg - for Postgres 18.3, (QPS with gcc / QPS with clang)
    * my - for MySQL 8.4.8, (QPS with gcc / QPS with clang)
    * ma - for MariaDB 11.8.6, (QPS with gcc / QPS with clang)

    -- point queries
    pg      my      ma
    1.02    1.00    0.99    hot-points
    1.02    0.98    1.02    point-query
    0.95    1.01    1.02    points-covered-pk
    0.96    1.02    1.02    points-covered-si
    0.97    1.00    1.02    points-notcovered-pk
    0.96    1.03    1.02    points-notcovered-si
    0.97    1.01    1.01    random-points_range=1000
    0.98    1.01    1.01    random-points_range=100
    1.00    0.99    1.00    random-points_range=10

    -- range queries without aggregation
    pg      my      ma
    1.01    0.98    1.03    range-covered-pk
    1.00    0.98    1.05    range-covered-si
    0.99    0.98    1.04    range-notcovered-pk
    0.99    1.02    0.97    range-notcovered-si
    1.02    1.06    1.03    scan

    -- range queries with aggregation
    pg      my      ma
    1.01    0.96    1.05    read-only-count
    0.99    0.99    1.01    read-only-distinct
    0.99    1.00    1.00    read-only-order
    0.99    1.00    1.01    read-only_range=10000
    1.00    0.98    1.00    read-only_range=100
    1.01    0.97    1.00    read-only_range=10
    0.99    0.97    1.03    read-only-simple
    1.02    0.98    1.02    read-only-sum

    -- writes
    pg      my      ma
    1.03    0.98    1.00    delete
    1.01    1.00    1.00    insert
    1.00    0.98    1.00    read-write_range=100
    1.00    0.98    1.00    read-write_range=10
    0.99    1.01    0.97    update-index
    0.96    1.01    0.99    update-inlist
    0.99    0.99    0.99    update-nonindex
    1.02    0.98    0.99    update-one
    0.98    0.98    0.99    update-zipf
    1.00    0.99    0.99    write-only

    Graceful degradation in Postgres

    Not all traffic is created equal.When a database is overwhelmed, you want the important queries to keep executing, even if that means shedding lower-priority work.This is a much better outcome than the alternative: a total database outage.

    March 30, 2026

    Read Concern "snapshot" for snapshot isolation outside explicit transactions

    TL;DR: I a previous post I explained Why isn't "majority" the default read concern in MongoDB. If you’re used to SQL databases, you’ll likely prefer the snapshot read concern, which guarantees all rows are read from the same snapshot. Alternatively, run your operations in an explicit transaction, as transactions are ACID.

    With majority, you avoid dirty or uncommitted reads because you only see data acknowledged by a majority. However, a scan can yield while the majority state advances, so some rows may come from a newer majority snapshot.

    By default, MongoDB’s consistency boundary is the document—similar to an aggregate in domain‑driven design. For multi‑document consistency, use explicit transactions or the snapshot read concern. This is MongoDB’s tunable consistency model, whose defaults often suit event‑driven architectures more than traditional SQL workloads.

    To illustrate this, here’s a demo.

    I start by setting up the collection, the document count, and other parameters:

    
    const NUM_ACCOUNTS = 1_000_000;
    const INITIAL_BALANCE = 10000;  // cents
    const BATCH_SIZE = 10_000;
    const TOTAL_TRANSFERS = 100_000;
    const NUM_WRITERS = 5;
    const mydb = db.getSiblingDB("demo");
    const coll = mydb.accounts;
    
    

    I load one million accounts, all with the same amount:

    
    for (let i = 0; i < NUM_ACCOUNTS; i += BATCH_SIZE) {
        const batch = [];
        for (let j = i; j < Math.min(i + BATCH_SIZE, NUM_ACCOUNTS); j++) {
            batch.push({ _id: j, balance: INITIAL_BALANCE });
        }
        coll.insertMany(batch, { ordered: false });
        if (i % 100_000 === 0) print("  " + (i / 1000) + "k...");
    }
    print("✅ Loaded " + NUM_ACCOUNTS.toLocaleString() + " accounts\n");
    
    

    This function checks the total balance every second:

    
    async function periodicReader(readConcern) {
        while (readingActive) {
            const result = coll.aggregate([
                { $group: { _id: null, total: { $sum: "$balance" } } }
            ], { readConcern: { level: readConcern } }
                                       // -> majority or snapshot
            ).toArray();
            const total = result.length ? result[0].total : "N/A";
            print("  📊 Aggregate read — total balance: " + total +
                  " | transfers so far: " + transfersDone);
            // Wait ~1 second before next read
            await new Promise(r => setTimeout(r, 1000));
        }
    }
    // run it:
    let transfersDone = 0;
    let readingActive = true;
    const readerPromise = periodicReader("majority");
    
    

    With no writes and one million accounts initialized with ten thousand each, the total balance is 10,000,000,000:

      📊 Aggregate read — total balance: 10000000000 | transfers so far: 0
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 0
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 0
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 0
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 0
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 0
    

    I perform random account-to-account transfers that debit one account and credit another for the same amount in a single transaction, so the total balance stays the same, and run it in five threads:

    
    print("⏳ Launching " + NUM_WRITERS + " writers + 1 reader...\n");
    let writingDone = false;
    let transfersDone = 0;
    async function writer(id, count) {
        const s = mydb.getMongo().startSession();
        const sc = s.getDatabase("demo").accounts;
        for (let t = 0; t < count; t++) {
            // pick two random accounts
            const from = Math.floor(Math.random() * NUM_ACCOUNTS);
            let to =     Math.floor(Math.random() * NUM_ACCOUNTS);
            while (to === from) to = Math.floor(Math.random() * NUM_ACCOUNTS);
            // same amount to debit on one, credit on the other
            const amount = Math.floor(Math.random() * 1000);
            // do that in a transaction
            try {
                s.startTransaction();
                sc.updateOne({ _id: from }, { $inc: { balance: -amount } });
                sc.updateOne({ _id: to },   { $inc: { balance:  amount } });
                s.commitTransaction();
                transfersDone++;
            } catch (e) {
                try { s.abortTransaction(); } catch (_) {}
            }
        }
        s.endSession();
        print("  ✅ Writer " + id + " done");
    }
    // Run in threads:
    const writerPromises = [];
    for (let i = 0; i < NUM_WRITERS; i++) {
        writerPromises.push(writer(i + 1, TOTAL_TRANSFERS / NUM_WRITERS));
    }
    
    

    The reader thread is still running, but it now reports an inconsistent total balance due to the "majority" read concern:

      📊 Aggregate read — total balance: 10000002595 | transfers so far: 634
      📊 Aggregate read — total balance: 10000003902 | transfers so far: 1177
      📊 Aggregate read — total balance: 9999999180 | transfers so far: 1742
      📊 Aggregate read — total balance: 10000002564 | transfers so far: 2325
      📊 Aggregate read — total balance: 9999995030 | transfers so far: 2900
      📊 Aggregate read — total balance: 10000001154 | transfers so far: 3462
      📊 Aggregate read — total balance: 9999996910 | transfers so far: 4029
      📊 Aggregate read — total balance: 9999992085 | transfers so far: 4655
      📊 Aggregate read — total balance: 9999995372 | transfers so far: 5215
      📊 Aggregate read — total balance: 9999999916 | transfers so far: 5792
      📊 Aggregate read — total balance: 9999998316 | transfers so far: 6396
      📊 Aggregate read — total balance: 9999997128 | transfers so far: 6976
      📊 Aggregate read — total balance: 10000006447 | transfers so far: 7516
      📊 Aggregate read — total balance: 9999998330 | transfers so far: 8091
      📊 Aggregate read — total balance: 10000001286 | transfers so far: 8656
      📊 Aggregate read — total balance: 10000001899 | transfers so far: 9240
      📊 Aggregate read — total balance: 9999996708 | transfers so far: 9845
      📊 Aggregate read — total balance: 10000005159 | transfers so far: 10444
      📊 Aggregate read — total balance: 10000002749 | transfers so far: 11012
      📊 Aggregate read — total balance: 9999999925 | transfers so far: 11623
    

    If you’re coming from SQL databases, it may be surprising: in SQL, every statement runs in an explicit transaction with a defined isolation level. MongoDB instead offers several consistency boundaries:

    • Document-level consistency by default
    • Statement-level consistency with the snapshot read concern
    • Transaction-level consistency with explicit transactions

    I’ve been using the default document-level consistency, so the total was inconsistent, and I’ll now show the other levels that provide stronger read-time consistency.

    I stop the reading thread:

    readingActive = false;
    

    Then I restart it with a snapshot read concern:

    let readingActive = true;
    const readerPromise = periodicReader("snapshot");
    

    Now the results are consistent: the total balance stays the same while money is being transferred between accounts.

      📊 Aggregate read — total balance: 10000000000 | transfers so far: 92845
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 93439
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 94022
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 94590
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 95161
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 95737
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 96307
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 96835
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 97353
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 97920
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 98478
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 99038
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 99621
      ✅ Writer 3 done
      ✅ Writer 1 done
      ✅ Writer 4 done
      ✅ Writer 2 done
      ✅ Writer 5 done
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 99999
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 99999
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 99999
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 99999
      📊 Aggregate read — total balance: 10000000000 | transfers so far: 99999
    

    This doesn't lock anything because it uses WiredTiger Multi-Version Concurrency Control (MVCC) and may even be faster as it doesn't have to wait for the majority committed snapshot that may be delayed if secondaries are not available.

    I start the writes again, as they ended, without starting the reader thread as I'll run it manually:

    let readingActive = false;
    for (let i = 0; i < NUM_WRITERS; i++) {
        writerPromises.push(writer(i + 1, TOTAL_TRANSFERS / NUM_WRITERS));
    }
    
    

    I run a manual aggregate with the default read concern, and it shows an inconsistent balance:

    const s = mydb.getMongo().startSession();
    const sc = s.getDatabase("demo").accounts;
    sc.aggregate([
            { $group: { _id: null, total: { $sum: "$balance" } } }
            ] 
    )
    
    test> sc.aggregate([
    ...         { $group: { _id: null, total: { $sum: "$balance" } } }
    ...         ] 
    ... )
    [ { _id: null, total: 9999997810 } ]
    
    

    Instead of specifying "snapshot" read concern, I run it in a transaction:

    s.startTransaction();
    sc.aggregate([
            { $group: { _id: null, total: { $sum: "$balance" } } }
            ] 
    )
    s.commitTransaction();
    
    test> s.startTransaction();
    
    test> sc.aggregate([
    ...         { $group: { _id: null, total: { $sum: "$balance" } } }
    ...         ] 
    ... )
    [ { _id: null, total: 10000000000 } ]
    test> s.commitTransaction();
    

    As you would expect, transactions are ACID and consistent. It's another way to get snapshot isolation.

    Before giving more explanation I stop all:

    (async () => {
    await Promise.all(writerPromises);
    readingActive = false;
    await readerPromise;
    })();
    
    

    All databases provide different levels of consistency, and “consistency” actually means two things:

    • Snapshot consistency: all data is read from a single commit point in time. The result is then stale because the read timestamp is fixed at the start of the statement (e.g., SQL Read Committed) or at the start of the transaction (e.g., SQL Repeatable Read). This is the consistency as defined by the ACID properties.

    • Read-your-writes consistency: you see the latest committed state, and the read time can advance during the scan to reduce staleness. This is the consistency as defined by the CAP theorem.

    SQL databases don’t always use the first model. Isolation levels guarantee that you see committed changes, but not that all rows were committed at the same instant. MVCC databases usually provide snapshot consistency by setting a read timestamp in the past, but some may re-read rows later if they change and are about to be updated, to avoid restarting the whole query at a later point in time.

    To validate a total balance, you need a consistent snapshot. In some other cases, reading the latest values is preferable to using a stale snapshot. Neither approach is always best.

    In the SQL bubble, inconsistent snapshots often seem unacceptable, whereas in event-driven NoSQL systems, reading the latest writes makes more sense. Any database can work well if you understand it, and poorly if you ignore its behavior under race conditions.

    Non-MVCC databases can return the latest writes within a single, consistent state but require heavy locking to do so. In contrast, MVCC databases avoid blocking reads, but must choose between a stable, timeline-consistent snapshot that may be stale or a latest-write view whose read times keep moving.

    For SQL databases, you need to understand the isolation levels:

    • Per-statement read time in Read Committed Snapshot Isolation.
    • Per-transaction read time in Snapshot Isolation

    For MongoDB, you should understand the consistency boundaries:

    • Document-level consistency by default
    • Statement-level consistency with the "snapshot" read concern
    • Transaction-level consistency with explicit transactions

    High memory usage in Postgres is good, actually

    A high memory percentage in PlanetScale Postgres is not necessarily a problem. Let's compare how memory and CPU usage are different, how not all memory usage is created equal, and which signals actually require attention.

    March 29, 2026

    The insert benchmark on a small server : Postgres 12.22 through 18.3

    This has results for Postgres versions 12.22 through 18.3 with the Insert Benchmark on a small server. My previous post for the same hardware with results up to Postgres 18.1 is here. This post also has results for:

    • all 17.x releases from 17.0 through 17.9 
    • 18.2 with and without full page writes enabled
    • both 1 and 4 users

    Postgres continues to be boring in a good way. It is hard to find performance regressions. Performance wasn't always stable, but I am reluctant to expect it to show no changes because there are sources of variance beyond the DBMS, especially HW (a too-hot SSD or CPU will run slower). Sometimes perf changes because there are obvious perf bugs, sometimes it changes for other reasons.

     tl;dr for a  CPU-bound workload

    • performance is stable from Postgres 12 through 18
    • performance is stable from Postgres 17.0 through 17.9
    • disabling full-page writes improves throughput on write-heavy benchmark steps
    tl;dr for an IO-bound workload
    • performance is mostly stable from Postgres 12 through 18
    • performance is stable from Postgres 17.0 through 17.9
    • disabling full-page writes improves throughput on write-heavy benchmark steps
    • in a few cases there are large improvements to point-query throughput on the qp1000 benchmark step. I will try to explain that soon.
    Builds, configuration and hardware

    I compiled Postgres from source using -O2 -fno-omit-frame-pointer for versions 12.22, 13.23, 14.22, 15.17, 16.13, 17.0 to 17.9, 18.2 and 18.3.

    The server is an Beelink SER7 with a Ryzen 7 7840HS CPU with 8 cores and AMD SMT disabled, 32G of RAM. Storage is one SSD for the OS and an NVMe SSD for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04.

    For versions prior to 18, the config file is named conf.diff.cx10a_c8r32 and they are as similar as possible and here for versions 1213141516 and 17.

    For Postgres 18 in most cases I used a config named conf.diff.cx10b_c8r32 (aka cx10b) which is as similar as possible to the configs for versions 17 and earlier. But for tests with full-page writes disabled I used additional configs to compare with results from the cx10b config.
    The Benchmark

    The benchmark is explained here and is run with 1 client 4 clients. In each case each client uses a separate table. I repeated it with two workloads:
    • CPU-bound
      • for 1 user the values for X, Y, Z are 30M, 40M, 10M
      • for 4 users the values for X, Y, Z are 10M, 16M, 4M
    • IO-bound
      • for 1 user the values for X, Y, Z are 800M, 4M, 1M
      • for 4 users the values for X, Y, Z are 200M, 4M, 1M
    The point query (qp100, qp500, qp1000) and range query (qr100, qr500, qr1000) steps are run for 1800 seconds each.

    The benchmark steps are:

    • l.i0
      • insert X rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
    • l.x
      • create 3 secondary indexes per table. There is one connection per client.
    • l.i1
      • use 2 connections/client. One inserts Y rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
    • l.i2
      • like l.i1 but each transaction modifies 5 rows (small transactions) and Z rows are inserted and deleted per table.
      • Wait for S seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
    • qr100
      • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload.
    • qp100
      • like qr100 except uses point queries on the PK index
    • qr500
      • like qr100 but the insert and delete rates are increased from 100/s to 500/s
    • qp500
      • like qp100 but the insert and delete rates are increased from 100/s to 500/s
    • qr1000
      • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
    • qp1000
      • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
    Results: overview

    The performance reports are here for:
    The summary sections from the performances report have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

    Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version. The base version is Postgres 12.22 for the latest point releases comparison, 17.0 for the 17.x releases comparison and 18.2 with the cx10b config for the full-page writes comparison. 

    When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
    • insert/s for l.i0, l.i1, l.i2
    • indexed rows/s for l.x
    • range queries/s for qr100, qr500, qr1000
    • point queries/s for qp100, qp500, qp1000
    This statement doesn't apply to this blog post, but I keep it here for copy/paste into future posts. Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

    Results: CPU-bound

    The performance summaries are here for:
    For latest point releases at 1 user
    • there is either no change or a small improvement for l.i0 (load in PK order), l.x (create indexes) and the read-write tests (qr*, qp*).
    • for l.i1 and l.i2 (random write-only) throughput drops by 5% to 10% from 12.22 to 13.23 and has been stable since then (throughput in 18.2 is similar to 13.23. The CPU per operation overhead (cpupq here) increases after 12.22 for the l.i2 step but there wasn't an obvious increase for the l.i1 step - but the way I measure this is far from perfect. The results I share here are worse than what I measured in December 2025.
    For latest point releases at 4 users
    • there might be a small (3%) regression for l.i0 (load in PK order) in 18.2 vs 12.22. Perhaps this is noise. From vmstat and iostat metrics there aren't obvious changes.
    • throughput in 18.2 is better than 12.22 for all other benchmark steps
    For all 17.x releases at 1 user
    • throughput is stable from 17.0 to 17.9 for all benchmark steps except l.i1 and l.i2 (random writes) where there might be a 5% regression late in 17.x. This might be from new CPU overhead - see cpupq here.
    For all 17.x releases at 4 users
    • throughput is stable with small improvements from 17.0 to 17.9
    For full-page writes at 1 user
    • throughput improves by ~5% for l.i1 and l.i2 (random writes) when full-page writes are disabled  and KB written to storage per commit drops by ~20% -- see wkbpi here.
    • enabling wal_compression=lz4 decreases write throughput for all write-heavy steps when full-page writes are enabled. The impact is smaller when full page writes are disabled.
    For full-page writes at 4 users
    • throughput improves by <= 5% for all write-heavy steps when full-page writes are disabled
    • the impact from wal_compression=lz4 isn't obvious
    Results: IO-bound

    The performance summaries are here for:
    For latest point releases at 1 user
    • there are small (<= 10%) improvements for l.i0 (load in PK order) and l.x (create index). I don't see anything obvious in vmstat and iostat metrics to explain this.
    • there are small (<= 10%) regressions for l.i1 and l.i2 (random writes) that might be from a sequence of small regressions from 13.x through 18.x. I don't see anything obvious in vmstat and iostat metrics to explain this.
    • throughput is unchanged for the range-query read+write tests (qr*)
    • throughput improves by ~1.4X for the point-query read+write tests (qp*). This improvement arrived in 13.x. This can be explained by large drops in CPU overhead (cpupq) and context switch rates (cspq) -- see here.
    • the results here are similar to what I measured in December 2025
    For latest point releases at 4 users
    • there are small (~10%) regressions for l.i0 (load in PK order) that arrived in 17.x. The context switch rate (cspq) increases in 17.x
    • there are small (<= 20%) improvements for l.x (create index) that arrived in 13.x
    • there are large regressions for l.i1 and l.i2 (random writes) that arrive in 15.x through 18.x. There are large increases in CPU overhead (cpupq) -- see here.
    • throughput is unchanged for the range-query read+write tests (qr*)
    • throughput improves for the point-query read+write tests (qp*) at higher write rates (qp500, qp1000).
    For all 17.x releases at 1 user
    • throughput is stable with a few exceptions
    • for qp1000 (point-query, read+write) it improves by ~5% in 17.1 and is then stable to 17.9
    • in 17.9 there are large (~1.4x) improvements for all of the point-query, read+write tests
    • the changes in throughput for qp1000 might be explained by a small drop in CPU overhead per query (cpupq) that arrived in 17.1 and a large drop that arrived in 17.9 -- see here.
    For all 17.x releases at 4 users
    • throughput for most steps (l.i0, l.x, qr*, qp100, qp500) is stable
    • throughput for l.i1 and l.i2 (random writes) has more variance
    • throughput for qp1000 drops by up to 10% from 17.3 through 17.8 and in those cases the CPU overhead increased -- see cpupq here.
    For full-page writes at 1 user
    • throughput improves by 6% for l.i1 (random writes) when full-page writes are disabled
    • throughput improved for qp* tests when either full-page writes were disabled or lz4 was used for log_compression. That is harder to explain, perhaps it is noise.
    For full-page writes at 4 users
    • throughput improves by 20% for l.i1 (random writes) when full-page writes are disabled
    • throughput improved for qp* tests when either full-page writes were disabled or lz4 was used for log_compression. That is harder to explain, perhaps it is noise.

    n_dead_tup vs n_live_tup

    The tables below show the ratio: n_dead_tup / (n_dead_tup + n_live_tup) for the CPU-bound and IO-bound workloads using 1 user (and one table). These were measured at the end of each benchmark step.

    CPU-bound
            12.22   18.3
    l.i0    0.000   0.000
    l.x     0.000   0.000
    l.i1    0.065   0.035
    l.i2    0.045   0.020
    qr100   0.006   0.006
    qp100   0.012   0.012
    qr500   0.040   0.040
    qp500   0.021   0.024
    qr1000  0.031   0.036
    qp1000  0.040   0.003

    IO-bound
            12.22   18.3
    l.i0    0.000   0.000
    l.x     0.000   0.000
    l.i1    0.005   0.005
    l.i2    0.006   0.006
    qr100   0.000   0.000
    qp100   0.000   0.000
    qr500   0.002   0.002
    qp500   0.003   0.003
    qr1000  0.005   0.005
    qp1000  0.007   0.007

















     

    SQL Assertions, ANSI join, and ORA-08697

    In the previous post on consistency boundaries, we saw that an updatable join view can hide a write-skew anomaly when the developer assumes the consistency boundary is a single row, even though there are actually two underlying rows and only one is locked. I wanted to see how the new SQL Assertions introduced in "Oracle AI Database 26ai Release 23.26.1.0.0" handle this, since they use a different locking mechanism.

    TL;DR: they handle it correctly, but without ANSI joins.

    Without assertions, for an employee with a salary of 1000 and a commission of 100, two concurrent users could each add 42 to both values because they couldn’t see each other’s changes, and updating one column didn’t lock the other. I then added a SQL assertion that doesn’t allow the commission to exceed 1150, so that only one session can add 42. This assertion works as expected, waiting on the other transaction before checking the sum:

    CREATE ASSERTION salary_plus_commission_le_1150
    CHECK (
      NOT EXISTS (
        SELECT 'violation'
        FROM emp_salary s, emp_commission c
        WHERE s.empno = c.empno
          AND s.salary + c.commission > 1150
      )
    );
    

    When the first transaction committed, the second transaction detected the write skew:

    ERROR at line 1:
    ORA-08601: SQL assertion (FRANCK.SALARY_PLUS_COMMISSION_LE_1150) violated.
    
    

    I'm writing this blog post because I had trouble finding the correct declaration for this assertion, and I hope it helps others. My first attempt was simply:

    CREATE ASSERTION salary_plus_commission_le_1150
    CHECK (
      NOT EXISTS (
        SELECT 'violation'
        FROM emp_salary s join emp_commission c USING (empno)
        WHERE s.salary + c.commission > 1150
      )
    );
    

    This joins the two tables and raises a violation is the sum of salary and commission is higher than 1150, but the CREATE ASSERTION failed with:

    ORA-08689: CREATE ASSERTION failed
    ORA-08697: SYS owned tables are not supported.
    
    

    I'm not using SYS, and cannot use SYS because I'm on the managed service "Autonomous Database" which doesn't give SYSDBA privileges.

    I reproduced this in a local environment to trace the SQL statements (ALTER SESSION SET EVENTS 'sql_trace bind=false, wait=false') and found that Oracle internally checked this constraint with:

    /* SQL Analyze(250,0) */
    SELECT /*+ ALL_ROWS BYPASS_RECURSIVE_CHECK */
    1 FROM "SYS"."DUAL" WHERE (
      NOT EXISTS (
        SELECT 1
        FROM emp_salary s
        JOIN emp_commission c ON s.empno = c.empno
        WHERE s.salary + c.commission > 1150
      )
    )
    

    The good old DUAL table is owned by SYS and cannot be used in the assertion.

    After trying many variations, I realized that SQL assertions are always evaluated with a SELECT from "SYS"."DUAL", even for cases that work, like in my previous post on SQL Assertions.

    Then, I remembered what I’d learned from 30 years of working with Oracle Databases: if something fails with an ANSI join in Oracle, try the legacy join syntax (no JOIN clauses, only WHERE conditions). When ANSI joins were added in 9i, Oracle implemented them with internal transformations that affected other features. Most of those issues were fixed over the 20 years that followed, but SQL assertions seem to have fallen into that trap.

    With the legacy join syntax, the internal query still reads "SYS"."DUAL", but it works:

    =====================
    PARSING IN CURSOR #140737366953320 len=208 dep=0 uid=136 oct=290 lid=136 tim=6202261308 hv=3501011950 ad='7ffff8be4960'
     sqlid='fzcfcfv8auczf'
    CREATE ASSERTION salary_plus_commission_le_1150
    CHECK (
      NOT EXISTS (
        SELECT 'violation'
        FROM emp_salary s, emp_commission c
        WHERE s.empno = c.empno
          AND s.salary + c.commission > 1150
      )
    )
    END OF STMT
    
    ...
    
    =====================
    PARSING IN CURSOR #140737364327008 len=286 dep=1 uid=136 oct=3 lid=0 tim=6202334328 hv=225567679 ad='783b9f10' sqlid='fv1va6n6r3sxz'
     /* SQL_ASSERTION obj#=72600 */ 
    SELECT /*+  ALL_ROWS BYPASS_RECURSIVE_CHECK */ 1 
    FROM "SYS"."DUAL" 
    WHERE (   NOT EXISTS (SELECT 'violation' "'VIOLATION'" FROM "FRANCK"."EMP_SALARY" "S","FRANCK"."EMP_COMMISSION" "C" WHERE "S"."EMPNO"="C"."EMPNO" AND "S"."SALARY"+"C"."COMMISSION">1150) )
    END OF STMT
    

    Two things to remember:

    • SQL Assertions provide correct consistency boundaries to avoid write skew in the absence of a serializable isolation level.
    • ANSI joins are nice, but Oracle was designed before this standard, with another syntax, and some bugs persist. Falling back to the old syntax may be safer.

    Consistency boundaries in SQL databases vs. MongoDB

    Two common myths surround database consistency:

    • Myth 1: Only SQL databases provide strong consistency. In fact, MongoDB—a NoSQL database—supports ACID transactions with strong consistency, even in sharded clusters, with clear serializable boundaries: the document.
    • Myth 2: SQL consistency is simple and fully abstracted, so you can treat views like tables and run transactions as if race-condition anomalies never occur. In reality, no RDBMS fully achieves this ideal, so you must understand each system’s consistency boundaries (column?, scanned rows?, returned rowset?, range?, partition?, table?).

    With MongoDB, the consistency boundaries follow domain driven design aggregates: concurrent writes to the same document conflict, even if they effectively read or write different fields. This guarantees that the database cannot break the business invariants set by the application. I've detailed an example in Serializable Transactions in MongoDB: The Doctor's On-Call Shift example. Document-level operations provide stronger integrity constraints, since schema validation can cover the entire aggregate, including one-to-many relationships. SQL databases need assertions to provide the same guarantees, but these are rarely implemented in RDBMSs, so only a limited subset of integrity constraints are available, like foreign keys.

    Relational databases and SQL are meant to hide locking internals from developers, and full support for the serializable isolation level would make this feasible. In practice, though, serializable is often incomplete or hurts scalability, so developers use weaker isolation levels and must consider what is locked. They commonly add explicit locking, like SELECT FOR UPDATE, to prevent race-condition anomalies.

    Before looking at an example, ask yourself whether you understand the consistency boundaries of your databases. Updating a single column often locks the entire row of that column. Some databases lock at a finer granularity (for example, YugabyteDB’s column-level locking), while others may lock more via lock escalation.

    Normalization also affects this by spreading a logical aggregate—which in Domain-Driven Design corresponds to a consistency boundary—across multiple tables and rows. If you run a SELECT ... FOR UPDATE with a JOIN, do you expect all the tables to be locked? The FOR UPDATE clause’s OF option lets PostgreSQL specify tables and Oracle Database specify columns, but in both cases, the underlying effect is to lock all the rows involved.

    Defaults may be inconsistent. SELECT FOR UPDATE without OF locks all joined rows, but UPDATE on a join behaves differently as we will demonstrate below. Column-level locking also raises another issue without joins: check constraints must see a consistent state, even for columns that were not updated (see the now-fixed YugabyteDB issue: https://github.com/yugabyte/yugabyte-db/issues/15196).

    SQL is an abstraction where the tables referenced in queries can also be views. In Oracle Database, join views can be updatable, which raises an important question about consistency: when you update a single column from one underlying table through a join view, should locks be taken on both underlying tables, or only on the one that’s actually being updated?

    Demonstration

    I'll use Oracle Database for this demonstration because it’s the only one I know that supports updates on joins. Those who say “this wouldn’t happen on PostgreSQL” are right that PostgreSQL has a true serializable isolation level, but updatable views are more limited and would not allow to build such an example. There’s no single best database—only different ones with different trade-offs. Expertise means understanding those differences.

    Here is an example. I create two tables to store the employees salary and commission separately, and build a view on top of them to expose it as one:

    create table emp_salary ( 
      empno      number primary key,
      salary     number
    );
    
    create table emp_commission ( 
      empno      number primary key,
      commission number
    );
    
    insert into emp_salary     values (99, 1000);
    insert into emp_commission values (99,  100);
    commit;
    
    create view emp as 
     select * from emp_salary 
     join emp_commission 
     using ( empno )
    ;
    

    This view is updateable and exposes all information per employee:

    21:44:07 SQL> select * from emp;
    
       EMPNO    SALARY    COMMISSION
    ________ _________ _____________
          99     1,000           100
    
    

    Now, an instruction is sent to the Human Resources department to add 42 to this employee, and the HR agent can choose to add it either to the salary or to the commission.

    Alice, in HR, starts a transaction (using the serializable isolation level, which does not change the behavior here because in Oracle it only achieves Snapshot Isolation). She checks the employee’s salary and commission and decides to add 42 to the commission. As an extra safety measure, to ensure the operation is idempotent, she adds a condition on the salary and commission so that the update occurs only if neither has already been increased:

    21:43:58 SQL> set transaction isolation level serializable;
    
    Transaction ISOLATION succeeded.
    
    21:44:07 SQL> select * from emp;
    
       EMPNO    SALARY    COMMISSION
    ________ _________ _____________
          99     1,000           100
    
    21:44:11 SQL> update emp
      2   set commission=commission+42
      3   where empno=99
      4   and salary=1000 and commission=100
      5* ;
    
    1 row updated.
    
    21:44:19 SQL>
    
    

    Alice didn't commit yet, and her colleague, Bob, in another office got the same instruction. He does the same, but decides to add 42 to the salary:

    21:43:58 SQL> set transaction isolation level serializable;
    
    Transaction ISOLATION succeeded.
    
    21:44:26 SQL> select * from emp;
    
       EMPNO    SALARY    COMMISSION
    ________ _________ _____________
          99     1,000           100
    
    21:44:29 SQL> update emp
      2   set salary=salary+42
      3   where empno=99
      4   and salary=1000 and commission=100
      5* ;
    
    1 row updated.
    
    21:44:37 SQL>
    
    

    As both check the current state (salary=1000 and commission=100) only one should actually update a row. However, nothing blocks, and nothing fails. Alice is able to commit:

    21:44:19 SQL> commit;
    
    Commit complete.
    
    21:44:45 SQL> select * from emp;
    
       EMPNO    SALARY    COMMISSION
    ________ _________ _____________
          99     1,000           142
    
    

    Bob is also allowed to commit:

    21:44:37 SQL> commit;
    
    Commit complete.
    
    21:44:50 SQL> select * from emp;
    
       EMPNO    SALARY    COMMISSION
    ________ _________ _____________
          99     1,042           142
    
    

    The employee incorrectly received a raise of 42 in both salary and commission because the database failed to prevent a race condition.

    Explanation

    This is a Multi-Version Concurrency Control (MVCC) anomaly: the WHERE clause was evaluated on a consistent snapshot, stale, but the UPDATE was applied to the current state, and the resulting conflict went undetected. You can reproduce this in db<>fiddle by using an autonomous transaction to simulate a second user.

    Neither Bob or Alice are responsible for this error because both of them checked the initial salary and commission in the same statement that updated one of them. It's the database that lied to them, or rather didn't expose that it was not one consistnt state. Users are not supposed to know that they query a view where data is physically stored in different tables and that the database locked only one of the two rows that appeared as one after the join.

    Solutions

    Write to lock. For safety, users should update both columns, even when values do not change, so that write conflicts protect against concurrent updates. Hibernate does this by default, updating all mapped columns. It is less efficient, but this consistency boundary is one reason dynamic update is not the default. Object–Relational Mappers (ORMs) and document databases such as MongoDB not only map domain model entities but also help ensure their consistency when those entities are mapped to a relational model that splits aggregates behind the abstraction layer.

    Explicit locking. Another option is to add FOR UPDATE to a first SELECT in the transaction. Without an OF clause, it locks all rows in the join. Bob’s session then waits, sees Alice’s changes, and skips its own update. However, unless this is wrapped in a stored procedure, it adds an extra database round trip and holds locks across context switches and network calls, which harms scalability.

    Avoid abstraction. A third solution is to avoid using views on joins, as they create an ambiguous consistency boundary where the user can’t tell what is locked. Still, the user could update on a join and experience the same:

    update (
      select * from emp_salary join emp_commission using ( empno )
    ) set salary=salary+42
      where empno=99 and salary=1000 and commission=100
    ;
    

    You might then ask how JSON-relational Duality views work, since they are also updatable views over multiple tables. The following example is similar to a join:

    CREATE OR REPLACE JSON RELATIONAL DUALITY VIEW emp_dv AS
      emp_salary @insert @update @delete
      {
        _id      : empno,
        salary     : salary @update,
        commission : emp_commission @update
        @link ( from: empno, to: empno ) {
         empno : empno
         commission: commission
        }
      }
    /
    

    If I try to update different fields concurrently, the conflict is detected:

    UPDATE emp_dv
      SET data = JSON_TRANSFORM(data, 
                 SET '$.commission.commission' = 10)
    ;
    
    declare
     -- using autonomous transaction to simulate a concurrent transaction 
     pragma autonomous_transaction;
    begin
     UPDATE emp_dv
      SET data = JSON_TRANSFORM(data, 
                 SET '$.salary' = 100)
    ;
     commit;
    end;
    /
    
    declare
    *
    ERROR at line 1:
    ORA-40896: Cannot update JSON Relational Duality View 'EMP_DV': Concurrent modification detected to document with ID 'FB03C16400'.
    ORA-06512: at line 5
    Help: https://docs.oracle.com/error-help/db/ora-40896/
    
    

    It fails, so you must implement retry logic (MongoDB would automatically retry here). JSON-Relational Duality Views use ETAG to detect conflicts on the same document rather than using row locks.

    You may also wonder how SQL assertions work, since they require consistency across multiple rows. They use a different locking mechanism to ensure a consistent state. I’ve included an example in the next post.

    Conclusion

    Relational databases aim to hide physical storage and locking so developers don’t have to manage them directly. But to achieve good performance, and avoid consistency anomalies, developers still need to understand locking boundaries. Not all RDBMSs support updatable join views (SQL-92 only specifies updatability for single-table views), though all provide explicit locking mechanisms such as SELECT FOR UPDATE. Row-store databases usually lock at the row level, but views may aggregate multiple rows without the developer knowing it.

    By contrast, MongoDB, as a NoSQL database, exposes a simpler model: the document is the consistency boundary, like the aggregate in Domain-Driven Design. Developers update documents, not logical views, and write consistency applies to the entire document rather than a portion of it.

    Selecting a character set for MySQL and MariaDB clients

     MySQL and MariaDB have many character-set related options, perhaps too many:

    1. character_set_client
    2. character_set_connection
    3. character_set_database
    4. character_set_filesystem
    5. character_set_results
    6. character_set_server
    7. character_set_system
    This is a topic that I don't know much about and I am still far from an expert. My focus has been other DBMS topics. But I spent time recently on this topic while explaining what looked like a performance regression, but really was just a new release of MySQL using a charset that is less CPU-efficient than the previous charset that was used.

    Debugging

    The intial sequence to understand what was going on was:
    1. mysql -e 'SHOW GLOBAL VARIABLES like "character_set_%"
    2. mysql -e 'SHOW SESSION VARIABLES like "character_set_%"
    3. run "SHOW SESSION VARIABLES" from my benchmark client
    Note:
    • the output from steps 1 and 2 was different
      • with SHOW GLOBAL VARIABLES I got character_set_client =latin1 but with SHOW SESSION VARIABLES I got character_set_client =utf8mb3. This happens. One reason is that some MySQL client binaries autodetect the charset based on the value of LANG or LC_TYPE from your Linux env. Another reason is that if autodetection isn't done then the clients can use the default charset that was set at compile time. That charset is then passed to the server during connection handshake (see thd_init_client_charset). So it is likely that character_set_client as displayed by SHOW GLOBAL VARIABLES isn't what your client will use.
    • the output from steps 2 and 3 was different
      • autodetection is only done when mysql_options() is called with a certain flag (see below). And that is not done by the MySQL driver in sysbench, nor is it done by Python's MySQLdb. So my benchmark clients are likely selecting the default charset and don't do autodetection. And that default is determined by the version of the MySQL client library, meaning that default can change over the years. For the source that implements this, search for MYSQL_AUTODETECT_CHARSET_NAME and read sql-common/client.c.
    The following enables autodetection and should be called before calling mysql_real_connect():
        mysql_options(..., 
                      MYSQL_SET_CHARSET_NAME,
                      MYSQL_AUTODETECT_CHARSET_NAME);

    Note that adding the following into my.cnf isn't a workaround for clients that don't do autodetect.
        [client]
        default-character-set=...

    Notes

    These are from my usage of MySQL 5.7.44, 8.0.45 and 8.4.8 along with MariaDB 10.6.25, 10.11.16 and 11.4.10. All were compiled from source as was sysbench. I installed MySQLdb and the MySQL client library via apt for Ubuntu 24.04.

    The values for character_set_client, character_set_results and character_set_connection were measured via the MySQL command-line client running SHOW GLOBAL VARIABLES and SHOW SESSION VARIABLES and then the benchmark clients running SHOW SESSION VARIABLES.

    The reason for sharing this is to explain the many possible values your session might use for character_set_client, character_set_results and character_set_connection. And using the wrong value might waste CPU.

    What per-session values are used for character_set_client|results|connection?
    * my.cnf has character_set_server=latin1
    * per SHOW GLOBAL VARIABLES each is set to =latin1
    * values below measured via SHOW SESSION VARIABLES
    Values for character_set_client|results|connection
    ... with "mysql" command line client
    ... this is easy to change with --default-character-set command line option or equivalent option in my.cnf
    dbms
    5.7.44 utf8
    8.0.45 utf8mb4
    8.4.8 utf8mb4
    10.6.25 utf8mb3
    10.11.16 utf8mb3
    11.4.10 utf8mb3
    Values for character_set_client|results|connection
    ... with sysbench

    client library version
    dbms 5.7 8.0 8.4 10.6 10.11 11.4
    5.7.44 latin1 latin1 latin1 NA NA NA
    8.0.45 latin1 utf8mb4 utf8mb4 NA NA NA
    8.4.8 latin1 utf8mb4 utf8mb4 NA NA NA
    10.6.25 latin1 latin1 latin1 utf8mb4 utf8mb4 utf8mb4
    10.11.16 latin1 latin1 latin1 utf8mb4 utf8mb4 utf8mb4
    11.4.10 latin1 utf8mb4 utf8mb4 utf8mb4 utf8mb4 utf8mb4
    Values for character_set_client|results|connection
    ... with insert benchmark (Python MySQLdb and /lib/x86_64-linux-gnu/libmysqlclient.so.21
    ... I am not what version is libmysqlclient.so.21, this is on Ubuntu 24.04

    dbms
    5.7.44 latin1
    8.0.45 utf8mb4
    8.4.8 utf8mb4
    10.6.25 latin1
    10.11.16 latin1
    11.4.10 utf8mb4