a curated list of database news from authoritative sources

March 25, 2026

Non-First Normal Forms ( 1NF) and MongoDB: an alternative to 4NF to address 3NF anomalies

SQL databases are grounded in relational algebra, but they are not the only databases with a strong theoretical basis. MongoDB, designed as a document database to solve practical engineering problems and improve developer experience, also builds on theory. It supports non–first-normal-form schemas and extends relational operators to work with them. This foundation is described in a 1986 paper, published 23 years earlier: Theory of Non-First Normal Form Relational Databases

MongoDB's aggregation pipeline operators ($unwind, $group, $lookup, set operations) are concrete implementations of the paper's abstract algebraic operators.

I've build the following examples while reading the paper to illustrate the concepts with practical examples.

The Basic ¬1NF Data Model

The paper defines nested relations where attributes can be atomic or relation-valued:

This structure records facts about employees, like some information about their children, their skills, and exams they passed.

This maps directly to a MongoDB document schema:

// This IS the paper's example, expressed as MongoDB documents
db.employees.insertMany([
  {
    ename: "Smith",
    Children: [
      { name: "Sam", dob: "2/10/84" },
      { name: "Sue", dob: "1/20/85" }
    ],
    Skills: [
      {
        type: "typing",
        Exams: [
          { year: 1984, city: "Atlanta" },
          { year: 1985, city: "Dallas" }
        ]
      },
      {
        type: "dictation",
        Exams: [
          { year: 1984, city: "Atlanta" }
        ]
      }
    ]
  },
  {
    ename: "Watson",
    Children: [
      { name: "Sam", dob: "3/12/78" }
    ],
    Skills: [
      {
        type: "filing",
        Exams: [
          { year: 1984, city: "Atlanta" },
          { year: 1975, city: "Austin" },
          { year: 1971, city: "Austin" }
        ]
      },
      {
        type: "typing",
        Exams: [
          { year: 1962, city: "Waco" }
        ]
      }
    ]
  }
])

MongoDB's document model is essentially what Roth called a "database scheme" — a collection of rules where attributes can be zero-order (scalar fields) or higher-order (embedded arrays of documents). The paper's Figure 3-1 is literally a MongoDB collection:

When using an object-oriented language, this model looks natural but this paper is from 1986 so there was another motivation.

Motivation: ¬1NF as an alternative to 4NF decomposition

The paper's example (Figure 1-1) shows an employee relation in 1NF requires 10 rows with massive redundancy:

employee | child | skill
-----------------------------
Smith    | Sam   | typing
Smith    | Sue   | typing
Smith    | Sam   | filing
Smith    | Sue   | filing
Jones    | Joe   | typing
Jones    | Mike  | typing
Jones    | Joe   | dictation
Jones    | Mike  | dictation
Jones    | Joe   | data entry
Jones    | Mike  | data entry

The ¬1NF version needs only 2 tuples. The paper notes three problems with 1NF:

  • Insert anomaly: Adding a child for Jones requires adding 3 tuples (one per skill)
  • Update anomaly: Changing Smith's "typing" to "word processing" requires updating multiple rows
  • Decomposition cost: The 1NF solution requires splitting into two tables and joining them back

Today, these anomalies are not a major problem because a single updateMany() or insertMany operation in a MongoDB transaction is atomic, just like an UPDATE or INSERT in SQL databases. More importantly, the document model makes multi-document operations rare—the Cartesian explosion of redundant data remains undesirable.

The first four normal forms (1NF, 2NF, 3NF, BCNF) do not address this, because there are no non-trivial functional dependencies in this relation: all three attributes together form the only candidate key. This relation is therefore already in BCNF. This is precisely why Fagin introduced 4NF in 1977.

This example illustrates a multivalued dependency (MVD) violation: children and skills are independent multivalued attributes of an employee, but they’re stored together. This creates a redundant Cartesian product (10 rows for 2 employees). Fourth Normal Form (4NF) addresses this by splitting the independent dependencies into separate tables to remove these anomalies, but at the cost of more tables touched per transaction and more joins per query.

Even a Third Normal Form (3NF) schema can still suffer from insert and update anomalies and data redundancy. MongoDB’s document model can offer many of 4NF’s benefits without incurring the join overhead.

Here is the MongoDB Equivalent:

// The ¬1NF representation — exactly what MongoDB does naturally
db.employees.insertMany([
  {
    employee: "Smith",
    Children: [{ child: "Sam" }, { child: "Sue" }],
    Skills: [{ skill: "typing" }, { skill: "filing" }]
  },
  {
    employee: "Jones",
    Children: [{ child: "Joe" }, { child: "Mike" }],
    Skills: [
      { skill: "typing" },
      { skill: "dictation" },
      { skill: "data entry" }
    ]
  }
])

Adding a new child for Jones is just one operation, no anomaly:


db.employees.updateOne(
  { employee: "Jones" },
  { $push: { Children: { child: "Sara" } } }
)

Changing Smith's "typing" to "word processing" — one operation:


db.employees.updateOne(
  { employee: "Smith", "Skills.skill": "typing" },
  { $set: { "Skills.$.skill": "word processing" } }
)

Querying everything about an employee doesnt need a join:


db.employees.findOne({ employee: "Smith" })

MongoDB exists for a fundamental reason: its document model eliminates the need for decomposition and joins, as well as the update anomalies introduced by the 1NF constraint. Most workloads rely on single-document operations, which MongoDB physically optimizes as single-shard transactions, a single disk read or write, and a single replication operation. These inserts and updates lock the document, which serves as the consistency boundary, but modify only the fields whose values change and update only the corresponding index entries.

Nest and Unnest Operators

The paper defines nest (ν) to aggregate flat rows into nested structures, and unnest (μ) to flatten nested structures back:

  • ν_{B=(C,D)}(r): nest attributes C,D into a new nested relation B
  • μ_{B}(r): unnest nested relation B

MongoDB's aggregation pipeline has direct equivalents.

The paper's unnest (μ) operator is $unwind:

db.employees.aggregate([
  { $unwind: "$Skills" }
])

Each skill becomes its own document, duplicating employee info.

Deep unnest (μ* in the paper) is possible with multiple $unnest stages:

db.employees.aggregate([
  { $unwind: "$Children" },
  { $unwind: "$Skills" }
])

This produces the full 1NF Cartesian expansion, flattening both Children and Skills.

When starting with a flat collection, the paper's nest operator (ν) is $group:

db.flat_employees.aggregate([
  {
    $group: {
      _id: "$employee",
      Children: { $addToSet: { child: "$child" } },
      Skills:   { $addToSet: { skill: "$skill" } }
    }
  }
])

The paper proves that the order of unnesting doesn't matter (Thomas & Fischer's result). In MongoDB, these produce the same fully-flat result:

db.employees.aggregate([
  { $unwind: "$Children" },
  { $unwind: "$Skills" }
])

db.employees.aggregate([
  { $unwind: "$Skills" },
  { $unwind: "$Children" }
])

The paper also notes that nest is not always an inverse for unnest in general, but IS an inverse for Partitioned Normal Form (PNF) relations.

Partitioned Normal Form (PNF)

PNF requires that the atomic attributes of each relation (and each nested relation) form a key. The paper shows a pathological relation (Figure 1-3) that violates PNF:

Smith, the same employee, has two different skill sets, and Jones has a duplicate across sets.

In MongoDB, PNF corresponds to the fundamental design principle that _id determines the document. The pathological case above would mean having two documents for "Smith" with different Skills arrays — which is exactly what MongoDB's _id uniqueness constraint prevents at the collection level.

Without using _id as the employee identifier, we could have two employees with the same name:

db.bad_design.insertMany([
  { employee: "Smith", Skills: ["typing", "filing"] },
  { employee: "Smith", Skills: ["sorting", "mailing"] }
])

The correct design if we don't want to use _id is enforcing PNF with a unique index:

db.good_design.createIndex({ employee: 1 }, { unique: true })

db.good_design.insertOne({
  employee: "Smith",
  Skills: ["typing", "filing", "sorting", "mailing"]
})

Like in many papers, the employee name is used as an identifier (two "Smith" is one person) but obviously in real life there's a generated identifier to identify the physical person.

The paper, in theorem 5-1, proves that PNF is closed under unnesting. In MongoDB terms: if your documents are well-designed (one document per logical entity), then $unwind won't create ambiguous or inconsistent flat results.

For nested relations, PNF also applies recursively.

MongoDB schema validation can enforce the PNF constraint by comparing the size of an array with the size when duplicates are ignored (using $setUnion with an empty array):

db.createCollection("employees", {
  validator: {
    $and: [
      // JSON Schema for structure and types
      {
        $jsonSchema: {
          bsonType: "object",
          required: ["ename"],
          properties: {
            ename: { bsonType: "string" },
            Children: { bsonType: "array", items: { bsonType: "object", required: ["name"], properties: { name: { bsonType: "string" }, dob: { bsonType: "string" } } } },
            Skills: { bsonType: "array", items: { bsonType: "object", required: ["type"], properties: { type: { bsonType: "string" },
             Exams: { bsonType: "array", items: { bsonType: "object", required: ["year"], properties: { year: { bsonType: "int" }, city: { bsonType: "string" } } } } } } }
          }
        }
      },
      // PNF uniqueness constraints using $expr
      {
        $expr: {
          $and: [
            // Level 1: Children.name must be unique within the document
            {
              $eq: [
                { $size: { $ifNull: ["$Children.name", []] } },
                { $size: { $setUnion: [{ $ifNull: ["$Children.name", []] }] } }
              ]
            },
            // Level 1: Skills.type must be unique within the document
            {
              $eq: [
                { $size: { $ifNull: ["$Skills.type", []] } },
                { $size: { $setUnion: [{ $ifNull: ["$Skills.type", []] }] } }
              ]
            },
            // Level 2: Within EACH Skills element, Exams.year must be unique
            {
              $allElementsTrue: {
                $map: {
                  input: { $ifNull: ["$Skills", []] },
                  as: "skill",
                  in: { $eq: [ { $size: { $ifNull: ["$$skill.Exams.year", []] } }, { $size: { $setUnion: [{ $ifNull: ["$$skill.Exams.year", []] }] } } ] }
                }
              }
            }
          ]
        }
      }
    ]
  },
  validationLevel: "strict",
  validationAction: "error"
});

The fact that nest is not always the inverse of unnest is a well-known MongoDB pitfall:

db.employees.drop()  
db.employees.insertMany([  
  {  
    _id: 1,  
    employee: "Smith",  
    Children: [{ child: "Sam" }, { child: "Sue" }],  
    Skills: [{ skill: "typing" }]  
  },  
  {  
    _id: 2,  
    employee: "Smith",  
    Children: [{ child: "Tom" }],  
    Skills: [{ skill: "filing" }, { skill: "sorting" }]  
  },  
  {  
    _id: 3,  
    employee: "Jones",  
    Children: [{ child: "Joe" }],  
    Skills: [{ skill: "typing" }]  
  }  
])  

db.employees.aggregate([  
  // unnest (μ) 
  { $unwind: "$Children" },  
  // nest (ν), grouping by employee, which is NOT a key  
  {  
    $group: {  
      _id: "$employee",  
      Children: { $addToSet: "$Children" },  
      Skills: { $first: "$Skills" }  
    }  
  },  
  // Clean up  
  { $project: { _id: 0, employee: "$_id", Children: 1, Skills: 1 } },  
  { $sort: { employee: 1 } }  
])  

This results in two Smiths collapsed into one:

[
  {
    Children: [ { child: 'Joe' } ],
    Skills: [ { skill: 'typing' } ],
    employee: 'Jones'
  },
  {
    Children: [ { child: 'Tom' }, { child: 'Sam' }, { child: 'Sue' } ],
    Skills: [ { skill: 'typing' } ],
    employee: 'Smith'
  }
]

The correct way is grouping by the proper key:

db.employees.aggregate([  
  // unnest (μ)
  { $unwind: "$Children" },  
  // nest (ν), grouping by _id, which IS a key (PNF)  
  {  
    $group: {  
      _id: "$_id",  
      employee: { $first: "$employee" },  
      Children: { $push: "$Children" },  
      Skills: { $first: "$Skills" }  
    }  
  },  
  // Clean up  
  { $project: { _id: 1, employee: 1, Children: 1, Skills: 1 } },  
  { $sort: { _id: 1 } }  
])  

This is the paper's Theorem 5-2 demonstrated in MongoDB: nest inverts unnest if and only if the grouping key satisfies PNF.

Extended Algebra Operators

I a previous post, From Relational Algebra to Document Semantics I explained how MongoDB extended the semantics of the relational selection Selection (σ) to non-1NF schemas, and mentioned that other relational operations are available in MongoDB. They are covered by the ¬1NF paper.

The paper defines extended union (∪ᵉ) to merge nested relations for tuples that agree on atomic attributes, rather than treating them as separate tuples. In MongoDB, this is achieved with $merge or with aggregation. Suppose we want to merge new course data into existing student records:

db.students.insertMany([
  {
    sname: "Jones",
    Courses: [
      { cname: "Math", grade: "A" },
      { cname: "Science", grade: "B" }
    ]
  },
  {
    sname: "Smith",
    Courses: [
      { cname: "Math", grade: "A" },
      { cname: "Physics", grade: "C" },
      { cname: "Science", grade: "A" }
    ]
  }
])

db.students.updateOne(
  { sname: "Jones" },
  { $addToSet: { Courses: { cname: "Physics", grade: "B" } } }
)

db.students.updateOne(
  { sname: "Smith" },
  {
    $addToSet: {
      Courses: {
        $each: [
          { cname: "Chemistry", grade: "A" },
          { cname: "English", grade: "B" }
        ]
      }
    }
  }
)

This added "Physics:B" to Jones, and "Chemistry:A", "English:B" to Smith without adding two tuples with different course sets like a standard union would do:

db.students.find()

[
  {
    _id: ObjectId('69c3a72344fc089068d4b0c2'),
    sname: 'Jones',
    Courses: [
      { cname: 'Math', grade: 'A' },
      { cname: 'Science', grade: 'B' },
      { cname: 'Physics', grade: 'B' }
    ]
  },
  {
    _id: ObjectId('69c3a72344fc089068d4b0c3'),
    sname: 'Smith',
    Courses: [
      { cname: 'Math', grade: 'A' },
      { cname: 'Physics', grade: 'C' },
      { cname: 'Science', grade: 'A' },
      { cname: 'Chemistry', grade: 'A' },
      { cname: 'English', grade: 'B' }
    ]
  }
]

The paper emphasizes that standard set union treats entire tuples as atomic — if two tuples differ at all, both appear in the result. Extended union is ... (truncated)

March 24, 2026

MongoDB Transaction Performance

Many believe MongoDB transactions are slow, but this misconception often comes from misunderstanding optimistic locking — where transactions retry on conflict rather than blocking, making perceived slowness a function of contention, not inherent overhead (see this previous post).

In MongoDB, all data manipulation operations are transactional at the storage engine level. Single-document operations (insert, update, delete) use internal storage transactions via the WriteUnitOfWork and RecoveryUnit interfaces, ensuring ACID properties for each document change and its index entries.

In MongoDB, the term transaction refers specifically to multi-document transactions, which provide atomicity across multiple documents and collections via sessions. Their performance differs from single-document operations, but they are not clearly faster or slower:

  • Atomicity: Multi-document transactions use extra memory to track uncommitted changes and maintain transaction state across operations.

  • Consistency: Both single-document and multi-document operations enforce index updates and schema validation at write time.

  • Isolation: Multi-document transactions offer snapshot isolation using optimistic concurrency control, which can lead to write conflicts. When conflicts occur, MongoDB labels the error as TransientTransactionError — clients should handle this with retry logic and exponential backoff, as described in the previous post.

  • Durability: Multi-document transactions batch changes into fewer oplog entries, so they can reduce latency compared to multiple single-document transactions.

Example: Inserting One Million Documents

Here is an example on a MongoDB Atlas free cluster where I insert one million documents with a single insertMany() call:

db.test.drop();
const start = new Date();
// Insert a million documents
const res = db.test.insertMany(
  Array.from({ length: 1e6 }, (_, i) => ({
    name: `user_${i}`,
    value: Math.random(),
  }))
);
// Show timing
const elapsed = new Date() - start;
print(`Elapsed: ${elapsed} ms`);

On a free MongoDB Atlas cluster, this operation takes Elapsed: 53025 ms. Although it's a single call, ACID properties apply per document — each document is its own unit of consistency, much like an aggregate in domain-driven design. This means another session can see some inserted documents before the operation completes.

If we want the inserted documents to be visible only when they are all there, we can run the same operation in a transaction:

db.test.drop();
// Start a transaction in a session
const se = db.getMongo().startSession();
const sessionDb = se.getDatabase(db.getName());
const start = new Date();
se.startTransaction();
// Insert a million documents
const res = sessionDb.test.insertMany(
  Array.from({ length: 1e6 }, (_, i) => ({
    name: `user_${i}`,
    value: Math.random(),
  }))
);
// Commit and show timing
se.commitTransaction();
const elapsed = new Date() - start;
print(`Elapsed: ${elapsed} ms`);
se.endSession();

On a MongoDB Atlas free cluster, this takes Elapsed: 49047 ms, which is slightly faster. Note that this approaches the default transactionLifetimeLimitSeconds of 60 seconds — larger operations or slower clusters should adjust this parameter to avoid hitting the limit. MongoDB is optimized for OLTP with short transactions — it’s better to fail than to wait an unpredictable amount of time.

You might expect a larger speedup from batched durability, but single-document transactions are already highly optimized. MongoDB can piggyback multiple inserts into a single applyOps oplog entry, and WiredTiger batches log flushes so multiple transactions can share a single disk sync. For j: true and w: majority (the defaults), MongoDB often triggers a journal flush without waiting, piggybacking on replication acknowledgment to ensure durability.

When you hear that transactions are slow, remember this is usually a misunderstanding. Use transactions as your application requires, based on its atomicity, consistency, isolation, and durability boundaries. Performance depends on whether documents are on the same shard or the transaction spans multiple shards. Transactions aren’t inherently slow—the document model is tuned to the domain model’s consistency boundaries, favoring single-document transactions.

Sysbench vs MySQL on a small server: no new regressions, many old ones

This has performance results for InnoDB from MySQL 5.6.51, 5.7.44, 8.0.X, 8.4.8 and 9.7.0 on a small server with sysbench microbenchmarks. The workload here is cached by InnoDB and my focus is on regressions from new CPU overheads. 

In many cases, MySQL 5.6.51 gets about 1.5X more QPS than modern MySQL (8.0.x thru 9.7). The root cause is new CPU overhead, possibly from code bloat.

tl;dr

  • There are too many performance regressions in MySQL 8.0.X
  • There are few performance regressions in MySQL 8.4 through 9.7.0
  • In many cases MySQL 5.6.51 gets ~1.5X more QPS than 9.7.0 because 9.7.0 uses more CPU
  • Large regressions arrived in MySQL 8.0.30 and 8.0.32, especiall for full-table scans

Builds, configuration and hardware

I compiled MySQL from source for versions 5.6.51, 5.7.44, 8.0.X, 8.4.8 and 9.7.0. For MySQL 8.0.X I used 8.0.28, 8.0.30, 8.0.31, 8.0.32, 8.0.33, 8.0.34, 8.0.35, 8.0.36 and 8.0.45.

The server is an ASUS ExpertCenter PN53 with AMD Ryzen 7 7735HS, 32G RAM and an m.2 device for the database. More details on it are here. The OS is Ubuntu 24.04 and the database filesystem is ext4 with discard enabled.

The my.cnf files are here for 5.6, 5.7, 8.4 and 9.7.

There my.cnf files are here fo 8.0.28, 8.0.30, 8.0.31, 8.0.32, 8.0.33, 8.0.34, 8.0.35, 8.0.36 and 8.0.45.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by InnoDB.

The tests are run using 1 table with 50M rows. The read-heavy microbenchmarks run for 630 seconds and the write-heavy for 930 seconds.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

I provide tables below with relative QPS. When the relative QPS is > 1 then some version is faster than the base version. When it is < 1 then there might be a regression.  The relative QPS is below where the base version is either MySQL 5.6.51 or 8.0.28:
(QPS for some version) / (QPS for base version) 
Values from iostat and vmstat divided by QPS are here for 5.6.51 as the base version and then here for 8.0.28 as the base version. These can help to explain why something is faster or slower because it shows how much HW is used per request.

Results: point queries

Summary:
  • there are large regressions from 5.6.51 to 5.7.44
  • there are larger regressions from 5.7.44 to 8.0.45
  • the regressions from 8.0.45 to 9.7.0 are small
  • the regressions in the random-points tests are larger for range=10 than range=1000 (larger when the range is smaller). So the regressions are more likely to be in places other than InnoDB. The problem is new CPU overhead (see cpu/o here) which is 1.55X larger in 9.7.0 vs 5.6.51 for random-points_range=10 but only 1.19X larger in 9.7.0 for random-points_range=1000.
Relative to: 5.6.51
col-1 : 5.7.44
col-2 : 8.0.45
col-3 : 8.4.8
col-4 : 9.7.0

col-1   col-2   col-3   col-4
0.87    0.65    0.65    0.64    hot-points
0.87    0.69    0.67    0.63    point-query
0.87    0.72    0.72    0.71    points-covered-pk
0.90    0.78    0.78    0.76    points-covered-si
0.89    0.73    0.72    0.71    points-notcovered-pk
0.89    0.77    0.76    0.75    points-notcovered-si
1.00    0.84    0.83    0.83    random-points_range=1000
0.89    0.72    0.72    0.72    random-points_range=100
0.87    0.69    0.68    0.66    random-points_range=10

Summary:
  • The large regressions in 8.0.x for point queries (see above) occur prior to 8.0.28
Relative to: 8.0.28
col-1 : 8.0.30
col-2 : 8.0.31
col-3 : 8.0.32
col-4 : 8.0.33
col-5 : 8.0.34
col-6 : 8.0.35
col-7 : 8.0.36
col-8 : 8.0.45

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8
0.92    1.14    1.14    1.12    1.17    1.16    1.16    1.16    hot-points
0.97    0.97    0.95    0.96    0.95    0.95    0.95    0.95    point-query
0.94    1.09    1.09    1.08    1.12    1.12    1.11    1.15    points-covered-pk
0.90    1.08    1.07    1.07    1.12    1.13    1.12    1.16    points-covered-si
0.91    1.04    1.04    1.03    1.07    1.07    1.06    1.11    points-notcovered-pk
0.88    0.96    0.96    0.95    1.00    1.01    1.00    1.06    points-notcovered-si
0.79    2.35    2.42    2.37    2.45    2.45    2.47    2.56    random-points_range=1000
0.94    1.07    1.06    1.06    1.09    1.08    1.10    1.12    random-points_range=100
0.93    0.94    0.93    0.93    0.94    0.94    0.93    0.95    random-points_range=10

Results: range queries without aggregation

Summary:
  • there are large regressions from 5.6.51 to 5.7.44
  • there are larger regressions from 5.7.44 to 8.0.45
  • the regressions from 8.0.45 to 9.7.0 are small
  • the problem is new CPU overhead and for the scan test the CPU overhead per query is about 1.5X larger in modern MySQL (8.0 thru 9.7) relative to MySQL 5.6.51 (see cpu/o here)
Relative to: 5.6.51
col-1 : 5.7.44
col-2 : 8.0.45
col-3 : 8.4.8
col-4 : 9.7.0

col-1   col-2   col-3   col-4
0.83    0.68    0.66    0.65    range-covered-pk
0.83    0.70    0.69    0.67    range-covered-si
0.84    0.66    0.65    0.64    range-notcovered-pk
0.88    0.74    0.73    0.73    range-notcovered-si
0.84    0.67    0.66    0.67    scan

Summary:
  • There is a large regression in 8.0.30 and a larger one in 8.0.32
  • The scan test is the worst case for the regression.
Relative to: 8.0.28
col-1 : 8.0.30
col-2 : 8.0.31
col-3 : 8.0.32
col-4 : 8.0.33
col-5 : 8.0.34
col-6 : 8.0.35
col-7 : 8.0.36
col-8 : 8.0.45

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8
0.95    0.94    0.92    0.92    0.92    0.93    0.93    0.96    range-covered-pk
0.96    0.96    0.94    0.93    0.93    0.94    0.93    0.95    range-covered-si
0.94    0.94    0.93    0.93    0.94    0.94    0.93    0.93    range-notcovered-pk
0.89    0.87    0.87    0.86    0.89    0.91    0.89    0.95    range-notcovered-si
0.93    0.92    0.79    0.82    0.83    0.77    0.82    0.80    scan

Results: range queries with aggregation

Summary:
  • there are large regressions from 5.6.51 to 5.7.44
  • there are larger regressions from 5.7.44 to 8.0.45
  • the regressions from 8.0.45 to 9.7.0 are small
Relative to: 5.6.51
col-1 : 5.7.44
col-2 : 8.0.45
col-3 : 8.4.8
col-4 : 9.7.0

col-1   col-2   col-3   col-4
0.86    0.70    0.69    0.68    read-only-count
1.42    1.27    1.24    1.23    read-only-distinct
0.91    0.75    0.74    0.73    read-only-order
1.23    1.01    1.01    1.01    read-only_range=10000
0.93    0.77    0.76    0.74    read-only_range=100
0.86    0.69    0.68    0.66    read-only_range=10
0.83    0.68    0.68    0.66    read-only-simple
0.83    0.67    0.67    0.66    read-only-sum

Summary:
  • There are significant regressions in 8.0.30 and 8.0.32
Relative to: 8.0.28
col-1 : 8.0.30
col-2 : 8.0.31
col-3 : 8.0.32
col-4 : 8.0.33
col-5 : 8.0.34
col-6 : 8.0.35
col-7 : 8.0.36
col-8 : 8.0.45

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8
0.95    0.94    0.87    0.87    0.88    0.87    0.89    0.91    read-only-count
0.97    0.96    0.94    0.95    0.96    0.95    0.95    0.96    read-only-distinct
0.97    0.96    0.93    0.95    0.95    0.94    0.95    0.95    read-only-order
0.96    0.95    0.93    0.94    0.95    0.95    0.96    0.98    read-only_range=10000
0.96    0.96    0.94    0.95    0.95    0.94    0.95    0.94    read-only_range=100
0.96    0.97    0.95    0.95    0.95    0.94    0.95    0.94    read-only_range=10
0.94    0.94    0.92    0.93    0.93    0.93    0.94    0.94    read-only-simple
0.94    0.94    0.89    0.91    0.92    0.90    0.93    0.91    read-only-sum

Results: writes

Summary:
  • there are large regressions from 5.6.51 to 5.7.44
  • there are larger regressions from 5.7.44 to 8.0.45
  • the regressions from 8.0.45 to 9.7.0 are small
  • the insert test is the worst case and a big part of that is new CPU overhead, see cpu/o here, where it is 2.13X larger in 9.7.0 than 5.6.51. But for update-one the problem is writing more to storage per commit (see wkbpi here) rather than new CPU overhead.
Relative to: 5.6.51
col-1 : 5.7.44
col-2 : 8.0.45
col-3 : 8.4.8
col-4 : 9.7.0

col-1   col-2   col-3   col-4
0.85    0.60    0.59    0.55    delete
0.81    0.55    0.54    0.52    insert
0.93    0.75    0.74    0.71    read-write_range=100
0.87    0.70    0.68    0.66    read-write_range=10
1.20    0.88    0.89    0.91    update-index
1.04    0.74    0.73    0.71    update-inlist
0.87    0.62    0.61    0.57    update-nonindex
0.87    0.62    0.60    0.57    update-one
0.87    0.63    0.61    0.58    update-zipf
0.93    0.69    0.68    0.66    write-only

Summary:
  • There are significant regressions in 8.0.30 and 8.0.32
Relative to: 8.0.28
col-1 : 8.0.30
col-2 : 8.0.31
col-3 : 8.0.32
col-4 : 8.0.33
col-5 : 8.0.34
col-6 : 8.0.35
col-7 : 8.0.36
col-8 : 8.0.45

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8
0.96    0.95    0.92    0.92    0.92    0.91    0.91    0.91    delete
0.94    0.93    0.91    0.91    0.91    0.90    0.90    0.90    insert
0.96    0.96    0.94    0.94    0.94    0.94    0.94    0.93    read-write_range=100
0.96    0.96    0.94    0.94    0.94    0.94    0.94    0.93    read-write_range=10
0.91    0.91    0.84    0.84    0.86    0.85    0.86    0.79    update-index
0.94    0.95    0.92    0.91    0.92    0.91    0.91    0.91    update-inlist
0.95    0.96    0.92    0.92    0.92    0.91    0.91    0.90    update-nonindex
0.96    0.96    0.93    0.92    0.92    0.91    0.92    0.91    update-one
0.96    0.96    0.92    0.92    0.92    0.91    0.91    0.90    update-zipf
0.94    0.94    0.91    0.91    0.91    0.91    0.91    0.89    write-only

SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems

This paper presents SysMoBench, a benchmark designed to evaluate generative AI's ability to formally model complex concurrent and distributed systems. Although the paper is published on January 2026, the AI landscape moves so fast that the models evaluated (like Claude-Sonnet-4 and GPT-5) already feel dated, after the release of heavy hitters like Claude 3.5 Opus or OpenAI's Codex. 

The paper draws a distinction between algorithms/protocols and system modeling. As the paper (somewhat circularly) defines it, "system models enable verification of system code via comprehensive testing and model checking". The paper says: Modeling these systems requires the AI to deeply understand the system design, reason about safety and liveness under unexpected faults, and abstract system behavior into an executable program.


This paper is a great paper to follow up my post yesterday on TLA+ Mental Models. It is instructive to see how the paper corroborates my claims in that post. My post identified deciding what to abstract as the hardest skill in TLA+. Rather than letting the AI decide what to ignore, the benchmark tasks in Section 3.1 give the AI detailed instructions about what core actions to model and what implementation details to explicitly exclude.

The paper also validates my observation about invariants being the most signal-heavy part of a spec and, hence, being most difficult for AI and people to write. The authors abandoned having the AI write invariants from scratch, and instead they provide invariant template that contain both a natural language description and a formal TLA+ example. The AI is only asked to concretize these templates by mapping them to its own variable names. 

I'll be criticizing the paper in several places in my writeup below, so just to make it clear upfront, I really like the paper. This paper is a strong accept from me for any conference.

SysMoBench uses the following automated quality metrics for evaluating TLA+ models:

  • Syntax correctness: Statically checked using the SANY Syntactic Analyzer.
  • Runtime correctness: Checked by running the TLC model checker as a proxy for logical self-consistency.
  • Conformance: Measured via trace validation to see if the model conforms to the actual system implementation.
  • Invariant correctness: Model-checked against system-specific safety and liveness properties.

Here "conformance" look contentious to me because of the uncertainty around how refinement mappings are handled and how step-size impedance mismatches are resolved. The paper circumvents this by instrumenting the system code to generate trace logs, and then uses an LLM to automatically map the TLA+ variables and actions to the code traces. If AI-generated models inherently lean towards line-by-line code translation, are we risking losing the cognitive and design benefits of abstraction? Does the Conformance metric inherently bias the benchmark against good abstraction? How could a benchmark be designed to automatically evaluate and reward an AI for successfully cutting away irrelevant mechanics rather than matching them?


Looking at Table 3, the LLMs fail miserably on complex systems, with only Claude-Sonnet-4 managing to piece together anything functional (and even then, scoring very low on conformance for Etcd Raft). The paper's Analysis on LLMs highlights fundamental weaknesses. LLMs constantly introduce syntax errors (as shown in Figure 4a). For instance, DeepSeek-R1 frequently hallucinates mathematical symbols (like $\cap, \forall$) instead of standard ASCII TLA+ operators. GPT-5 and Gemini-2.5-Pro frequently mix TLA+ syntax with Python. Regarding runtime errors (Figure 4b), LLMs frequently generate inconsistent TLC configurations or fundamentally misunderstand TLA+ data structures (e.g., trying to apply set operations to records). I think that things have improved significantly with newer models, so an update from the team would be really great! 

The paper notes that LLMs violated 41.9% of liveness properties but only 8.3% of safety properties (as seen in Figure 4c). This indicates a severe limitation in temporal reasoning. The violations are largely due to missing or incorrectly specified fairness assumptions, though logical structural errors often block the model from making progress before the fairness issues even manifest.

The Trace Learning Agent, which attempts to infer the TLA+ model purely from system execution traces rather than source code, also failed miserably. And I think this is another indication of the temporal reasoning problem LLMs have for modeling. The trace learning agent underperformed compared to the basic modeling and code translation agents, failing to pass runtime checks entirely. Even Claude-Sonnet-4 achieved extremely low syntax scores when using the Trace Learning Agent.

The paper is a great start for an LLM system modeling benchmark. But it currently lacks diversity as the dataset leans heavily on Consensus protocols. Out of the 7 distributed systems artifacts, a huge portion (Etcd Raft, Redis Raft, Xline CURP, PGo raftkvs, ZooKeeper FLE) are essentially variations of distributed consensus/leader election.  The authors mention that adding new systems to SYSMOBENCH requires significant effort to instrument the system to collect execution logs for trace validation. I think they need a "better sell". Why should an engineer go through the effort to instrument their system for SysMoBench if the AI is going to fail at modeling it? A clearer value proposition is needed to drive community contributions here.

Restoring a 2018 iPad Pro

This was surprisingly hard to find—hat tip to Reddit’s Nakkokaro and xBl4ck. Apple’s instructions for restoring an iPad Pro (3rd generation, 2018) seem to be wrong; both me and an Apple Store technician found that the Finder, at least in Tahoe, won’t show the iPad once it reboots in recovery mode. The trick seems to be that you need to unplug the cable, start the reset process, and during the reset, plug the cable back in:

  1. Unplug the USB cable from the iPad.
  2. Tap volume-up
  3. Tap volume-down
  4. Begin holding the power button
  5. After two roughly two seconds of holding the power button, plug in the USB cable.
  6. Continue holding until the iPad reboots in recovery mode.

Hopefully this helps someone else!

March 23, 2026

TLA+ mental models

In the age of LLMs, syntax is no longer the bottleneck for writing, reading, or learning TLA+. People are even getting value by generating TLA+ models and counterexamples directly from Google Docs descriptions of the algorithms. The accidental complexity of TLA+ (its syntax and tooling) is going away.

But the intrinsic complexity remains: knowing where to start a model, what to ignore, and how to choose the right abstractions. This is modeling judgment, and it is the hardest skill to teach. Engineers are trained to think in code, control flow, and local state. TLA+ forces you into a different mode: mathematical, declarative, and global. You specify what must hold, not how to achieve it. Once you get comfortable with this shift, it changes how you think about systems, even away from the keyboard.

In a companion post, I described TLA+ as a design accelerator based on lessons from 8 industry projects. Here I want to go deeper and articulate the mental models behind effective TLA+ use. These are the thinking patterns that TLA+ experts apply implicitly; these are the kind of knowledge acquired by osmosis over the years. I will try to make them explicit and actionable.

The mental models are:

  1. Abstraction, abstraction, abstraction
  2. Embrace the global shared memory model
  3. Refine to local guards and effects (slow is fast)
  4. Derive good invariants
  5. Explore alternatives through stepwise refinement
  6. Aggressively refine atomicity
  7. Share your mental models


1. Abstraction, abstraction, abstraction

Abstraction is a powerful tool for avoiding distraction. The etymology of the word abstract comes from Latin for "cut and draw away". With abstraction, you slice out the protocol from a complex system, omit unnecessary details, and simplify a complex system into a useful model. You don't have to model everything. Most of the value comes from knowing what to ignore. The omission is the default: add a component only when leaving it out breaks your reasoning goal.

Abstraction is the art of knowing what to discard. You can cross-cut to focus on the protocol you want to investigate. If you are interested in the consistency model of your distributed system, you can abstract away the mechanics of communication when that is an unnecessary distraction. If you are interested in a replication protocol, you can abstract away the client interaction. Consider two examples from our industry experience.

  • CosmosDB consistency modeling: We needed to specify CosmosDB's client-facing consistency semantics. The anti-pattern would have been modeling the distributed database engine, which would have caused an immediate state-space explosion and an unreadable spec. Instead, we modeled just the "history as a log" abstraction for client-facing behavior. The inner details of the database were "environment"; irrelevant to what we were trying to reason about. We used sort-merge to capture the internals of replication, and a read index to model consistency. This way five consistency levels became clear predicates over operation histories.
  • Secondary Index at Aurora DSQL: Here, we focused on the secondary index and abstracted away the rest of the database as a log of operations. Log is a common abstraction for State Machine Replication (SMR), which is itself a common abstraction for databases and data stores. By cutting the problem down to its essential slice, we made a complex system tractable.

Leslie Lamport gave a beautiful demonstration of this when he described how he derived Paxos. He started with the most abstract specification: defining agreement based on a vote array. In his words: "I don't remember the thought process that led me to Paxos. But I knew that execution on computers sending messages to one another was an irrelevant detail. I was thinking only about a set of processes and what they needed to know about one another. How they could get that information from messages was the easy part that came later. What I had first was what, many years later, I described as the Voting algorithm."

TLA+ is useful for teaching engineers the art of abstraction and the right way to think and reason about distributed systems. The modeling process itself trains you to ask: what is the behavioral slice I care about, and what can I safely ignore? And because TLA+ gives you a rapid prototyping tool with a tight feedback loop (write a model, check it, revise) you accumulate design experience much faster than you would by building and debugging real systems.


2. Embrace the Global Shared Memory Model

TLA+ gives you a deliberate fiction: a global shared memory that all processes can read and write. This fiction is the foundation of its computational model, and understanding it is essential to thinking in TLA+.

In TLA+, a program consists of two things: (1) a set of variables that define a global state space, and (2) a finite set of actions that transition from one state to the next. This is state-centric reasoning as everything is a predicate (a function mapping to a boolean). This approach promotes invariant-based thinking (see mental model 4).

Each action follows Sir Tony Hoare's triples: {state} action {state'}. The execution of an action is conditional on a predicate (the "guard") being true. For example: x > 0  →  x := 0; y := y + 1. Or in TLA+ notation:

/\ x > 0
/\ x' = 0
/\ y' = y + 1

The guard is a predicate on the state space. If the guard is true in some state, the action is said to be "enabled" in that state. A program begins in any state satisfying the initial predicate. Then an enabled action is nondeterministically selected and executed atomically. This process repeats infinitely. If a selected action's guard is false, it is simply a skip and it does not change the state.

The global shared memory (GSM) model fits naturally with this style of reasoning. You read the current state and install a new state; no channels, no message serialization, no process boundaries to manage. It is the most minimal way to write models that fit the guarded-command model. Be frugal in defining variables, though: each one exponentially explodes the state space. The payoff is that safety and liveness become compact predicates over global state. Your program defines an invariant set (i.e., the good states) and must never transition out of it.

You don't have to model channels, message serialization, or network topology unless those are the specific things you're reasoning about. It is possible to map GSM to message passing if you keep to "localish" guards and definitely local variable assignments. What do we mean by "local variable" in a global shared state space? A common way is to use indices per node, so vote[i] refers to node i's vote. The global variable is the vote array, and the local version is vote[i]. It's all math, and math needs abstraction. TLA+'s computational model that is shaped around the global shared memory fiction enables you reason at the right level of abstraction.


3. Refine to local guards and effects (slow is fast)

The global shared memory fiction of TLA+ is powerful for reasoning, but it creates a trap: it is easy to write guards that read global state no real process could observe atomically. This is one of the most common modeling errors. A guard that checks what three different nodes have done simultaneously is "illegal knowledge" as no single node in a real distributed system can know all of that at once. A dedicated review pass should ask, for every action: what information could a real node actually know when it decides to act?

So how do you transform your guards to be local? There is a folk theorem that reading your immediate neighbor is fine because it can be mapped to message passing (think hygenic dining philosophers). This essentially boils down to adding another level of indirection and proxy variables, which means locking. Locking is not good for performance, and in mental model 6 I will advocate that you should refine your atomicity and allow as much concurrency as possible for the implementation to enjoy freedom responsibly. So what do we do?

The key insight is to try to figure out stable, monotonic, or locally stable predicates to use for your guards. In my "Hints for Distributed Systems Design", when I tell you to "embrace monotonicity", I am thinking of exactly this: "For doing coordination efficiently and safely in a distributed system, the insight is almost always to transform the problem into a monotonic one."

This is where things get subtle, and it touches on esoteric knowledge that is rarely made explicit. This is something you get to intuit after years of working on distributed systems close to the theoretical side. Your advisor intuited it, their advisor intuited it, they had developed vocabulary shortcuts to refer to this but never made it fully explicit. UC Berkeley's efforts on CALM (Consistency As Logical Monotonicity) made this more concrete by basing it on a language like Dedalus, but the basic gist lives at the level of abstract global state reasoning.

In our "Slow is Fast" paper, we formalized this intuition by partitioning actions into "slow" and "fast". A slow action's guard remains true even if the node's information is slightly stale. This is because either the guard is a stable predicate (once true, stays true), depends only on local variables, or is a locally stable predicate (only the node's own actions can falsify it). A fast action, by contrast, requires fresh global state to evaluate its guard. The key result is this: if you can make your guards locally stable, the protocol requires less coordination and tolerates communication delays gracefully. Hence, slow is fast.

This is not easy to do. Ernie Cohen had internalized this skill, and he could quickly zero in on the monotonic or locally stable predicates to exploit in his protocol explorations. Some people live in this way of thinking so completely that they may not even know how to articulate it, like fish in water. (Case in point: the phrase "global shared memory fiction" was pointed out to me by Marc Brooker. Sometimes it takes a slight outsider to name what insiders take for granted.) The practical takeaway for TLA+ modeling is this: when you write your guards, ask yourself whether the information the guard relies on could become stale, and if so, whether the guard is still safe to act on. If you can make your guards depend on monotonic or locally stable predicates, your protocol will be more robust, more concurrent, and closer to a correct distributed implementation.

Lamport’s derivation of Paxos illustrates this beautifully. He begins with the simplest specification of consensus: chosen starts as the empty set and transitions to a singleton {v}. That is the entire next-state formula. He then refines to a voting algorithm where acceptors vote and a value is chosen if a majority votes for it, and refines further to Paxos to handle the problems that arise (what if N acceptors vote for v1, N for v2, and the remaining acceptor fails?). At each refinement step, the guards become more local. In Paxos, the guard for whether an acceptor should cast a vote depends on local knowledge: what ballots this acceptor has participated in. The monotonic structure of ballot numbers ensures that this local knowledge does not become invalid: once an acceptor knows something about the progress of voting, that fact is permanent. This is what makes Paxos work despite asynchrony and failures.


4. Derive good invariants

You did the modeling for a purpose, not for sport. You want to arrive at reasoning insights about your protocol, and invariants are the distilled version of those insights. Invariant-based reasoning is non-operational: instead of tracing execution paths and happy-path thinking, you ask "what needs to go right?" You specify the boundary conditions, and the model checker explores all possible interleavings to verify them.

Spend time distilling good invariants because they serve you in many ways. They guide you as you explore protocol variants (see mental model 5), telling you what you can and cannot change. They act as contracts when you compose components, defining what each part guarantees to the others. They translate directly into runtime assertions and test oracles for your implementation. And they are essential for adding fault-tolerance: once you know the invariant, you know where you need to recover to after a fault, and you can design recovery actions to reestablish it.

Invariant writing is still mostly manual. LLMs struggle with it because invariants are the most signal-heavy part of a spec: they distill your understanding of what this protocol must guarantee. Getting them right means you have closure on the problem.

People new to TLA+ and formal methods repeatedly fail at this step. I occasionally struggle with it too, especially when entering an unfamiliar domain I have to first pay my dues and think harder to gain understanding. The most common failure mode is writing "trivial invariants" that are always true regardless of what the protocol does; you've written the spec for naught. Another is confusing the "end state" with an invariant: an invariant must hold at every reachable state, not just the final one. We are not expecting inductive invariants (that is harder still, and more valuable since a formal proof follows easily from one). But a reasonably tight invariant that demonstrates understanding and scaffolds further exploration, and that is what you should aim for.

Don't stop at safety properties (what the system is allowed to do). Write liveness properties too (what the system must eventually do). It is important to check properties like <> Termination and Init ~> Solution. Do requests complete? Do leaders emerge? Many "correct" models quietly do nothing forever. A model that never violates safety but makes no progress is useless. Checking liveness catches paths that stall, specs that are overly constrained, and actions that never get enabled.

Returning to our running example: in Lamport's Paxos derivation, the invariants at each refinement level are instructive. At the Consensus level, the safety invariant is simply that at most one value is chosen (Cardinality(chosen) <= 1). At the Voting level, chosen is defined as the set of values for which there exists a ballot where a quorum of acceptors all voted for that value (chosen == {v \in Value : \E b \in Ballot : ChosenAt(b, v)}). The safety of the Voting algorithm rests on two rules: (1) different acceptors cannot vote for different values in the same ballot, and (2) an acceptor may only vote for value v in ballot b if v is safe at b. These invariants are tight, non-trivial, and they scaffold the entire refinement chain down to Paxos. They show genuine understanding of why consensus works, and that is what I mean by "closure on the problem".


5. Explore alternatives through stepwise refinement

One of TLA+'s greatest strengths is that it supports fast exploration of protocol variants. The key technique is stepwise refinement: start with the most abstract specification of your problem, then progressively add implementation detail, verifying at each step that the refinement preserves the invariants from the level above.

The canonical example is again Lamport's derivation of Paxos. The abstract Consensus spec defines a single variable chosen that transitions from the empty set to a singleton. Then an intermediate Voting model is defined and shown to be a refinement of Consensus. It describes what properties voting must satisfy, but not how acceptors implement them. Finally, the Paxos model refines Voting by adding message passing, ballot leaders, and the two-phase protocol. At each level, the refinement adds detail (ballots, quorums, messages) while preserving the safety property from above.

Refinement is at the heart of abstraction and a cornerstone of TLA+. In TLA+, refinement is simply implication: the concrete system's behaviors must be a subset of the abstract system's allowed behaviors. You check this by declaring an instance of the abstract spec in the concrete one and verifying via TLC that every behavior of the concrete system is an accepted behavior of the abstract system. Even invariant checking is refinement in disguise: does the system model implement this invariant formula?

A big advantage of stepwise refinement is that when you want to explore a protocol variant, you don't start from scratch. You go back up to the appropriate level of abstraction, change one refinement step, and get a different protocol that still satisfies the same high-level specification. This is systematic design space exploration. With LeaseGuard, for example, we started modeling a lease protocol for Raft early. While refining the abstract spec, we discovered two optimizations we hadn't anticipated, including inherited lease reads, which we probably wouldn't have found without thinking at multiple levels of abstraction.

A common failure pattern here is getting stuck at a level of detail, patching corner cases one by one. This is the implementation mindset leaking into modeling. When this happens, go back up. I saw this with the Secondary Index project at Aurora DSQL: an engineer's design was growing by accretion, each corner-case patch creating new corner cases. TLA+ forced a different approach: specify what the secondary index must guarantee abstractly, then search the solution space through refinement. Over a weekend, with no prior TLA+ experience, the engineer had written several variations. The lesson: specify behavior, not implementation, then explore different "how" choices through refinement.


6. Aggressively refine atomicity

Overly large atomic actions hide races. If your TLA+ action does ten things atomically in a single step, you're sweeping concurrency under the rug. The model will look correct, but it won't represent the interleavings your real system will face. Actions should be as fine-grained as correctness allows. Smaller steps expose the interleavings the protocol must tolerate and make invariants more meaningful.

A practical approach is to start coarse-grained to establish correctness, then systematically split actions into smaller steps and verify safety still holds. This is stepwise refinement (mental model 5) applied to action granularity. Each split increases the interleaving space, which is precisely where TLC earns its keep: the interference surface area may explode, but TLC will exhaustively check that your invariants hold.

The goal here is to give the implementation maximum freedom to reorder and parallelize, while verifying that the protocol remains correct under that concurrency. Fine-grained atomicity in the spec means the implementation can schedule operations in any order that respects the guard conditions.

Our StableEmptySet project is a good example. Ernie Cohen pushed atomic actions to the finest possible granularity to avoid distributed locking, and the resulting protocol was more concurrent than a lock-based design. The interleaving surface area increased, but TLC handled it without breaking a sweat. This is where the "slow is fast" insight (mental model 3) connects back: by refining guards to depend on locally stable predicates, you can split actions finely without introducing illegal knowledge: the guards remain valid even as the global state changes between steps.


7. Share your mental models

TLA+ models are also meant to be communication tools. A well-written spec serves as precise, executable documentation of a protocol, capturing design intent in a way that prose specifications and code comments cannot match.

At AWS, when I wrote the first TLA+ model of Aurora DSQL's distributed transaction protocol, the model's value quickly went beyond correctness confidence. It served as a communication anchor for a large team. When we sought further formal methods support, the TLA+ models sped up onboarding for new team members and kept everyone aligned on the protocol's design. Instead of arguing over ambiguous prose in a design document, the team could point to specific actions and invariants in the spec.

Our MongoDB distributed transactions work reinforces this. We wrote the first modular TLA+ specification of the protocol many years after it was in production. The model now serves as an authoritative description of what the protocol actually guarantees, bringing clarity to a system that had become difficult to reason about as a whole.

So, you should write your specs for humans, not just for TLC. Use clear action names, write TypeOK invariants early (they serve as executable documentation of your data model), and make every important property an explicit invariant. A well-structured spec communicates protocol intent more clearly than thousands of lines of implementation code, because it shows the essential logic without the noise of error handling, serialization, and configuration.

Tools like Spectacle can visualize TLA+ state spaces and execution traces, bridging the gap between mathematical precision and the operational intuition engineers thrive on. TLA+ is one of the purest intellectual pleasures: reducing a complex distributed system to its essential logic. Please share that pleasure with others by writing about your models, presenting them, and using them to teach.

As LLMs write more of our code, the value of TLA+ for design and reasoning will only grow. TLA+ has the potential to become a cornerstone in an AI+formal methods stack for building systems. The mental models I've described here are the foundation for that future. By mastering abstraction, embracing the global shared memory model, refining to local guards, deriving good invariants, exploring alternatives through refinement, aggressively refining atomicity, and sharing our mental models, we can unlock the full power of TLA+ to design better distributed systems in the age of AI. 

Introducing Database Traffic Control

Enforce real-time limits on your Postgres query traffic to protect your database from runaway queries and unexpected load spikes.

March 20, 2026

MongoDB Query Plan Cache Explained: Performance, Pitfalls, and Re-Planning

When MongoDB receives a query, it performs the following steps: Evaluate the available indexes that could be used. Generate and test multiple execution plans using candidate indexes. Measure their performance during a trial phase. Select the fastest plan (the winning plan) and execute the query. These steps are known as query planning, and they are […]

CedarDB: Catching Up on Recent Releases

CedarDB: Catching Up on Recent Releases

This post takes a closer look at some of the most impactful features we have shipped in CedarDB across our recent releases. Whether you have been following along closely or are just catching up, here is a deeper look at the additions we are most excited about.

Parquet support: Your On-Ramp to CedarDB

v2026-03-03

Due to the significant compression facilitated by its columnar format, Parquet has become quite popular in the past several years. Today we find Parquet in projects like Spark, DuckDB, Apache Iceberg, ClickHouse, and others. Because of this, Parquet has become not only a data format for running analytical queries, but also a handy format for data exchange between OLAP engines. CedarDB is now able to read Parquet data and directly load it into its own native data format with a single SQL statement:

postgres=# select * from 'data.parquet' limit 6;
 id | name | email | age | city | created_at
----+---------------+-------------------+-----+-------------+------------
 1 | Alice Johnson | alice@example.com | 29 | New York | 2024-01-15
 2 | Bob Smith | bob@example.com | 34 | Los Angeles | 2024-02-20
 3 | Clara Davis | clara@example.com | 27 | Chicago | 2024-03-05
 4 | David Lee | david@example.com | 41 | Houston | 2024-04-18
 5 | Eva Martinez | eva@example.com | 23 | Phoenix | 2024-05-30
 6 | Frank Wilson | frank@example.com | 38 | San Antonio | 2024-06-12
(6 rows)
CREATE TABLE my_table as SELECT * from 'data.parquet';

This makes migrating from other OLAP systems, or ingesting data from your data lake, straightforward. Check out our Parquet Documentation and our hands-on comparison of CedarDB vs. ClickHouse using StackOverflow data, where CedarDB was 1.5-11x faster than ClickHouse after a Parquet-based migration.

Better Compression, Less Storage: Floats and Text

v2026-01-22 and v2025-12-10

Storage efficiency has seen a major leap on two fronts. For floating-point columns (i.e., of type REAL and DOUBLE PRECISION ), CedarDB now applies Adaptive Lossless Floating-Point compression (ALP), a state-of-the-art technique that halves the on-disk footprint (on average) - perfect for workloads heavy on sensor readings or metrics.

For TEXT columns, we’ve adopted FSST (Fast Static Symbol Tables) combined with dictionary encoding: in practice this can halve the storage size of text-heavy tables while also making queries faster, since less data needs to be read from disk. Read our deep-dive blog post on the FSST implementation here.

without DictFSST:

postgres=# SELECT 100 * ROUND(SUM(compressedSize)::float / SUM(uncompressedSize), 4) || '%' as of_original FROM cedardb_compression_info WHERE tableName = 'hits' AND attributeName = 'title';
 of_original
-------------
 21.83%
(1 row)

with DictFSST:

postgres=# SELECT 100 * ROUND(SUM(compressedSize)::float / SUM(uncompressedSize), 4) || '%' as of_original FROM cedardb_compression_info WHERE tableName = 'hits' AND attributeName = 'title';
 of_original
-------------
 13.42%
(1 row)

Improve Parallelism of DDL and Large Writes

v2026-02-03

Schema changes and parallel bulk loads no longer bring your database to a halt. CedarDB now supports DDL operations like ALTER TABLE and CREATE INDEX on different tables in parallel. Similarly, large INSERT , UPDATE and DELETE operations now also run in parallel. None of these operations impact parallel readers anymore. For teams running mixed OLTP/OLAP workloads, this is a significant quality-of-life improvement.

PostgreSQL Advisory Locks

v2025-11-06

CedarDB now has full support for PostgreSQL advisory locks, including blocking variants that wait until a resource is freed and automatic deadlock detection. This fills an important compatibility gap for operational applications that use advisory locks for custom mutual exclusion either to implement custom functionality (think job schedulers or distributed task queues) or require them for correctness guarantees (think schema migration tools). Check out our advisory locks documentation for more info.

SELECT pg_advisory_lock(123);
-- ... do work ...
SELECT pg_advisory_unlock(123);

Late Materialization

v2026-01-15

Internally, CedarDB now delays fetching column data until it’s actually needed, a technique called late materialization. When a query filters out most rows early, this means only the surviving rows ever have their full column data fetched from storage. The result is faster queries on wide tables, with less wasted I/O. This improvement is transparent: no changes required to your queries or schema.

In the example below, a table scan is reduced from reading 24 GB on disk to just 75 MB.

Before:

postgres=# explain analyze SELECT * FROM hits WHERE url LIKE '%google%' ORDER BY eventtime LIMIT 10;
 plan 
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 🖩 OUTPUT () +
 ▲ SORT (In-Memory) (Result Materialized: 626 KB, Result Utilization: 83 %, Peak Materialized: 635 KB, Peak Utilization: 83 %, Card: 10, Estimate: 10, Time: 0 ms (0 %))+
 🗐 TABLESCAN on hits (num IOs: 29'046, Fetched: 24 GB, Card: 853, Estimate: 67'688, Time: 10632 ms (100 % ***))
(1 row)

Time: 10651.826 ms (00:10.652)

After:

postgres=# explain analyze SELECT * FROM hits WHERE url LIKE '%google%' ORDER BY eventtime LIMIT 10;
 plan 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 🖩 OUTPUT () +
 l LATEMATERIALIZATION (Card: 10, Estimate: 10) +
 ├───▲ SORT (In-Memory) (Result Materialized: 29 KB, Result Utilization: 58 %, Peak Materialized: 34 KB, Peak Utilization: 60 %, Card: 10, Estimate: 10, Time: 10 ms (8 % *))+
 │ 🗐 TABLESCAN on hits (num IOs: 20, Fetched: 75 MB, Card: 482, Estimate: 67'688, Time: 116 ms (92 % ***)) +
 └───🗐 TABLESCAN on hits (Estimate: 10)
(1 row)

Time: 139.806 ms

New Types: Unsigned Integers, UUIDv7, and Enums

v2025-11-06, v2025-10-22, and v2026-01-22

Three type additions make CedarDB more expressive:

  • Unsigned integers (UINT1 - UINT8): Useful for importing data from Parquet, which also has native unsigned types. Also the right fit for any domain where negative values are nonsensical (counts, IDs, IP ports, flags). No more workaround with larger signed types!
  • UUIDv7: A newcomer from Postgres 18. Unlike UUIDv4, UUIDv7 embeds a timestamp and is monotonically increasing, making it index-friendly and suitable as a primary key without the index fragmentation problems of random UUIDs. Check out our UUID Documentation for more info.
  • Enum types: Columns can now be declared with a fixed set of string values, saving storage and making constraints explicit at the type level. Check out our Enum Documentation for more info.
CREATE TYPE order_status AS ENUM ('pending', 'processing', 'shipped', 'delivered', 'cancelled');

CREATE TABLE orders (
 id UUID PRIMARY KEY DEFAULT uuidv7(),
 status order_status NOT NULL DEFAULT 'pending',
 quantity UINT4 NOT NULL,
 total_cents UINT4 NOT NULL
);

INSERT INTO orders (status, quantity, total_cents) VALUES ('pending', 2, 1999), ('processing', 1, 4999), ('shipped', 5, 2495);
postgres=# select * from orders;
 id | status | quantity | total_cents 
--------------------------------------+------------+----------+-------------
 019d0701-cb64-7c96-b90c-ebd4a4410dee | pending | 2 | 1999
 019d0701-cb65-7c49-8a1a-2e4ea96a85b1 | processing | 1 | 4999
 019d0701-cb65-74ce-a987-b72905eb7f7a | shipped | 5 | 2495
(3 rows)

PostgreSQL Compatibility: Moving Fast

CedarDB’s PostgreSQL compatibility continues to expand rapidly. Recent months have brought new SQL grammar support, a growing roster of system functions and un-stubbed catalog tables, and bug-for-bug compatibility fixes that allow more existing PostgreSQL tooling to work out of the box.

If a tool or library failed to connect or misbehaved in a previous version, it’s worth trying again - the list of working clients and frameworks is growing steadily. If your favorite tool isn’t working yet, please let us know by creating an issue.


That’s it for now


Questions or feedback? Join us on Slack or reach out directly.

Do you want to try CedarDB straight away? Sign up for our free Enterprise Trial below. No credit card required.

March 19, 2026

Break Paxos

As I mentioned in my previous blog post, I recently got my hands on Claude Code. In the morning, I used it to build a Hybrid Logical Clocks (HLC) visualizer. That evening, I couldn't pull myself away and decided to model something more ambitious. 

I prompted Claude Code to design a Paxos tutorial game, BeatPaxos, where the player tries to "beat" the Paxos algorithm by killing one node at a time and slowing nodes down. Spoiler alert: you cannot violate Paxos safety (agreement). The best you can do is delay the decision by inducing a series of well-timed failures, and that is how you increase your score.

Thinking of the game idea and semantics was my contribution, but the screen design and animations were mostly Claude’s. I do claim credit for the "red, green, blue" theme for the nodes and the colored messages; those worked well for visualization. I also specified that double-clicking a node kills it, and that the player cannot kill another node until they double-click to recover the down node. I also instructed that the player can click and hold a node to slow its replies. These two captures the standard distributed consensus fault model; no Byzantine behavior is allowed, as that is a different problem and algorithm. I include my full prompt at the end of the post.

I was surprised Claude got this mostly right in one shot. Most importantly, it got the safety-critical part of the implementation right on the first try, which is no small feat. I used the Opus model this time because I wanted the extra firepower.

Here is the game, live, for you to try.




What needed fixing

There were timing issues again. The animation progressed too quickly for the player to follow, so I asked it to slow down. But more serious problems emerged, especially around the leader timeout.

The leader timeout felt wrong. I was seeing dueling leaders even in fault-free runs. After some back and forth, I found a deeper issue: even a single leader could timeout on itself, start a new ballot, and end up dueling itself. This was clearly a bug in the timeout logic. After I pointed it out, Claude fixed it, and things were good.

My red, blue, green idea was a good design choice, but Claude did not follow my intent. It colored any message sent by blue as blue, even when it was a response to a green leader. I reminded Claude that messages should be colored by the ballot leader’s color to clearly show how different ballots interleave, and this was sufficient to fix the issue.


Takeaway

This was a lot of fun, again. I feel like a kid in a candy store. It’s now trivial to build great learning tools with Claude Code. As I mentioned in my previous post, people need hands-on, interactive learning, not static figures. They need to play with algorithms, explore corner cases, and see how they behave.

A LinkedIn follower, after seeing BeatPaxos, prompted his Claude Code to translate it to Raft. The result is BeatRaft: same logic, but with Raft messages. Check it out if that’s more your vibe.



Here is the original prompt as I promised. 

let's design a paxos tutorial game, maybe we name it BeatPaxos, because the player will try to make the Paxos algorithm violate safety, spoiler alert it is not possible to.

There will be three columns, to denote N=3 nodes, let's call them red, green, blue, and we can then use the consistent color theme in the messages they send. The current leader may have a crown image on top of its box.

The single synod Paxos protocol will be implemented: p1a, p1b, p2a, p2b, p3 messages. if there is nothing learned in p1, red would want to propose red, blue would propose blue, and green would propose green. But of course the implementation follows the Paxos protocol.

The player will be given only a set of actions. It can kill one node a time, by double clicking on it. if a node is down, then doubleclicking on another node does not work. but double clicking on the down node recovers it.

The red, green, blue nodes are rounded-corner rectangular boxes in respective color. When a node gets to p2, it gets a crown, denoting it thinks it is the leader. So there can be two nodes with crown, it is possible under Paxos rules. Only one would have the highest ballot number. We can display the ballot number of a leader in its box. When a node has to return to p1, it loses the crown.

for the visualization layout I am thinking of these three lanes, lined up below red, green, blue, it just lists the messages it sent. For red, this could be p1a (ballotnum) to blue, p1a (ballotnum) to green, p2a ("red", ballotnum) to blue, p2a ("red", ballotnum) to green and blue may have p1b (ballotnum, [val]) to red etc.

The player can also click-and-hold on a node, to make that node slow to send its next message. Otherwise, messages scheduled according to Paxos, goes 5sec interval from each other.

If the leader node is down, or the player delays it by click-holding it, then a random timeout may make another node propose its candidacy by sending a p1 message.

The right pane may give the player the rules, and add some explanation about the protocol. And display the safety invariant and its evaluation on the current state. If the player manages to violate the safety invariant, player wins! 

You can program this app on javascript on single page again, so I can deploy on Github Pages.

MariaDB innovation: binlog_storage_engine, 32-core server, Insert Benchmark

MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB).

My previous post had results for sysbench with a small server. This post has results for the Insert Benchmark with a large (32-core) server. Both servers use an SSD that has has high fsync latency. This is probably a best-case comparison for the feature. If you really care, then get enterprise SSDs with power loss protection. But you might encounter high fsync latency on public cloud servers.

While throughput improves with the InnoDB doublewrite buffer disabled, I am not suggesting people do that for production workloads without understanding the risks it creates.

tl;dr for a CPU-bound workload

  • throughput for write-heavy steps is larger with the InnoDB doublewrite buffer disabled
  • throughput for write-heavy steps is much larger with the binlog storage engine enabled
  • throughput for write-heavy steps is largest with both the binlog storage engine enabled and the InnoDB doublewrite buffer disabled. In this case it was up to 8.9X larger.
tl;dr for an IO-bound workload
  • see the tl;dr above
  • the best throughput comes from enabling the binlog storage engine and disabling the InnoDB doublewrite buffer and was 3.26X.
Builds, configuration and hardware

I compiled MariaDB 12.3.1 from source.

The server has 32-cores and 128G of RAM. Storage is 1 NVMe device with ext-4 and discard enabled. The OS is Ubuntu 24.04. AMD SMT is disabled. The SSD has high fsync latency.

I tried 4 my.cnf files:
The Benchmark

The benchmark is explained here. It was run with 12 clients for two workloads:
  • CPU-bound - the database is cached by InnoDB, but there is still much write IO
  • IO-bound - most, but not all, benchmark steps are IO-bound
The benchmark steps are:

  • l.i0
    • insert XM rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 10M for CPU-bound and 300M for IO-bound.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts XM rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. X is 16M for CPU-bound and 4M for IO-bound.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and YM rows are inserted and deleted per table. Y is 4M for CPU-bound and 1M for IO-bound.
    • Wait for S seconds after the step finishes to reduce MVCC GC debt and perf variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload. This step runs for 1800 seconds.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: summary

The performance reports are here for CPU-bound and IO-bound.

The summary sections from the performance reports have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA. And from the third table for the IO-bound workload I see that there were failures to meet the SLA for qp500, qr500, qp1000 and qr1000.

I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from the base version.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements.

I often use context switch rates as a proxy for mutex contention.

Results: CPU-bound

The summary is here.

Some of the improvements here are huge courtesy of storage with high fsync latency.

Throughput is much better with the binlog storage engine enabled when the InnoDB doublewrite buffer is also enabled. Comparing z12b_sync and z12c_sync (z12c_sync uses the binlog storage engine):
  • throughput for l.i0 (load in PK order) is 3.63X larger for z12c_sync
  • throughput for l.i1 (write-only, larger transactions) is 2.80X larger for z12c_sync
  • throughput for l.i2 (write-only, smaller transactions) is 8.13X larger for z12c_sync
There is a smaller benefit from only disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12b_sync_dw0:
  • throughput for l.i0 (load in PK order) is the same for z12b_sync and z12b_sync_dw0
  • throughput for l.i1 (write-only, larger transactions) is 1.14X larger for z12b_sync_dw0
  • throughput for l.i2 (write-only, smaller transactions) is 1.93X larger for z12b_sync_dw0
The largest benefits come from using the binlog storage engine and disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12c_sync_dw0:
  • throughput for l.i0 (load in PK order) is 3.61X larger for z12c_sync_dw0
  • throughput for l.i1 (write-only, larger transactions) is 3.03X larger for z12b_sync_dw0
  • throughput for l.i2 (write-only, smaller transactions) is 8.90X larger for z12b_sync_dw0
The second table from the summary section has been inlined below. That table shows relative throughput which is: (QPS for my config / QPS for z12b_sync).

dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
ma120301_rel_withdbg.cz12b_sync_c32r1281.001.001.001.001.001.001.001.001.001.00
ma120301_rel_withdbg.cz12c_sync_c32r1283.631.002.808.131.011.011.021.011.021.02
ma120301_rel_withdbg.cz12b_sync_dw0_c32r1281.001.001.141.931.010.991.011.001.010.99
ma120301_rel_withdbg.cz12c_sync_dw0_c32r1283.610.863.038.901.011.001.011.001.011.01

Results: IO-bound

The summary is here.

For the read-write steps the insert SLA was not met for qr500, qp500, qr1000 and qp1000 as those steps needed more IOPs than the storage devices can provide. So I ignore those steps.

Some of the improvements here are huge courtesy of storage with high fsync latency.

Throughput is much better with the binlog storage engine enabled when the InnoDB doublewrite buffer is also enabled. Comparing z12b_sync and z12c_sync (z12c_sync uses the binlog storage engine):
  • throughput for l.i0 (load in PK order) is 3.05X larger for z12c_sync
  • throughput for l.i1 (write-only, larger transactions) is 1.22X larger for z12c_sync
  • throughput for l.i2 (write-only, smaller transactions) is 1.58X larger for z12c_sync
There is a smaller benefit from only disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12b_sync_dw0:
  • throughput for l.i0 (load in PK order) is the same for z12b_sync and z12b_sync_dw0
  • throughput for l.i1 (write-only, larger transactions) is 2.06X larger for z12b_sync_dw0
  • throughput for l.i2 (write-only, smaller transactions) is 1.59X larger for z12b_sync_dw0
The largest benefits come from using the binlog storage engine and disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12c_sync_dw0:
  • throughput for l.i0 (load in PK order) is 3.01X larger for z12c_sync_dw0
  • throughput for l.i1 (write-only, larger transactions) is 3.26X larger for z12b_sync_dw0
  • throughput for l.i2 (write-only, smaller transactions) is 2.78X larger for z12b_sync_dw0
The second table from the summary section has been inlined below. That table shows relative throughput which is: (QPS for my config / QPS for z12b_sync).

dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
ma120301_rel_withdbg.cz12b_sync_c32r1281.001.001.001.001.001.001.001.001.001.00
ma120301_rel_withdbg.cz12c_sync_c32r1283.050.961.221.581.041.710.981.300.991.23
ma120301_rel_withdbg.cz12b_sync_dw0_c32r1281.010.942.061.591.052.921.161.541.021.86
ma120301_rel_withdbg.cz12c_sync_dw0_c32r1283.011.033.262.781.083.761.432.871.022.64










March 18, 2026

MariaDB innovation: binlog_storage_engine, 48-core server, Insert Benchmark

MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB). See this blog post for more details on the new feature.

My previous post had results for sysbench with a small server. This post has results for the Insert Benchmark with a large (48-core) server. Storage on this server has a low fsync latency while the small server has high fsync latency.

tl;dr

  • binlog storage engine makes some things better without making other things worse
  • binlog storage engine doesn't make all write-heavy steps faster because the commit path isn't the bottleneck in all cases on a server with storage that has low fsync latency

tl;dr for a CPU-bound workload

  • the l.i0 step (load in PK order) is ~1.3X faster with binlog storage engine
  • the l.i2 step (write-only with smaller transactions) is ~1.5X faster with binlog storage engine
tl;dr for an IO-bound workload
  • the l.i0 step (load in PK order) is ~1.08X faster with binlog storage engine
Builds, configuration and hardware

I compiled MariaDB 12.3.1 from source.

The server has 48-cores and 128G of RAM. Storage is 2 NVMe device with ext-4, discard enabled and RAID. The OS is Ubuntu 22.04. AMD SMT is disabled. The SSD has low fsync latency.

I tried 4 my.cnf files:
  • z12b_sync
  • z12c_sync
    • my.cnf.cz12c_sync_c32r128 (z12c_sync) is like cz12c except it enables sync-on-commit for InnoDB. Note that InnoDB is used to store the binlog so there is nothing else to sync on commit.
  • z12b_sync_dw0
  • z12c_sync_dw0
The Benchmark

The benchmark is explained here. It was run with 20 clients for two workloads:
  • CPU-bound - the database is cached by InnoDB, but there is still much write IO
  • IO-bound - most, but not all, benchmark steps are IO-bound
The benchmark steps are:

  • l.i0
    • insert XM rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 10M for CPU-bound and 200M for IO-bound.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts XM rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. X is 40M for CPU-bound and 4M for IO-bound.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and YM rows are inserted and deleted per table. Y is 10M for CPU-bound and 1M for IO-bound.
    • Wait for S seconds after the step finishes to reduce MVCC GC debt and perf variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload. This step runs for 3600 seconds.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: summary

The performance reports are here for CPU-bound and IO-bound.

The summary sections from the performance reports have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA. And from the third table for the IO-bound workload I see that there were failures to meet the SLA for qp500, qr500, qp1000 and qr1000.

I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from the base version.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements.

I often use context switch rates as a proxy for mutex contention.

Results: CPU-bound

The summary is here
  • Enabling the InnoDB doublewrite buffer doesn't improve performance.
With and without the InnoDB doublewrite buffer enabled, enabling the binlog storage engine improves throughput a lot for two of the write-heavy steps while there are only small changes on the other two write-heavy steps:
  • l.i0, load in PK order, gets ~1.3X more throughput
    • when the binlog storage engine is enabled (see here)
      • storage writes per insert (wpi) are reduced by about 1/2
      • KB written to storage per insert (wkbpi) is a bit smaller
      • context switches per insert (cspq) are reduced by about 1/3
  • l.x, create secondary indexes, is unchanged
    • when the binlog storage engine is enabled (see here)
      • storage writes per insert (wpi) are reduced by about 4/5
      • KB written to storage per insert (wkbpi) are reduced almost in half
      • context switches per insert (cspq) are reduced by about 1/4
  • l.i1, write-only with larger tranactions, is unchanged
  • l.i2, write-only with smaller transactions, gets ~1.5X more throughput
The second table from the summary section has been inlined below. That table shows relative throughput which is: (QPS for my config / QPS for z12b_sync)

dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
ma120301_rel_withdbg.cz12b_sync_c32r1281.001.001.001.001.001.001.001.001.001.00
ma120301_rel_withdbg.cz12c_sync_c32r1281.321.020.991.521.011.021.011.021.011.01
ma120301_rel_withdbg.cz12b_sync_dw0_c32r1281.000.941.001.031.031.021.031.021.031.02
ma120301_rel_withdbg.cz12c_sync_dw0_c32r1281.311.041.001.551.011.021.021.021.021.02

Results: IO-bound

The summary is here.
  • For the read-write steps the insert SLA was not met for qr500, qp500, qr1000 and qp1000 as those steps needed more IOPs than the storage devices can provide.
  • Enabling the InnoDB doublewrite buffer improves throughput by ~1.25X on the l.i2 step (write-only with smaller transactions) but doesn't change performance on the other steps.
    • as expected there is a large reduction in KB written to storage (see wkbpi here)
  • Enabling the binlog storage engine improves throughput by 9% and 8% on the l.i0 step (load in PK order) but doesn't have a significant impact on other steps.
    • with the binlog storage engine there is a large reduction in storage writes per insert (wpi), a small reduction in KB written to storage per insert (wkbpi) and small increases in CPU per insert (cpupq) and contex switches per insert (cspq) -- see here
The second table from the summary section has been inlined below. That table shows relative throughput which is: (QPS for my config / QPS for z12b_sync)

dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
ma120301_rel_withdbg.cz12b_sync_c32r1281.001.001.001.001.001.001.001.001.001.00
ma120301_rel_withdbg.cz12c_sync_c32r1281.091.010.991.010.990.990.970.971.000.98
ma120301_rel_withdbg.cz12b_sync_dw0_c32r1281.011.011.011.251.011.040.801.310.940.90
ma120301_rel_withdbg.cz12c_sync_dw0_c32r1281.081.011.001.260.991.040.681.310.930.90





Rate Limiting Strategies with Valkey/Redis

Rate limiting is one of those topics that looks simple until you’re actually doing it in production. Implement a counter with the INCR command and a TTL and away you go. But when you ask questions like “what happens at the boundary?”, “should I use a Valkey/Redis cluster?”, or “why are we getting twice the […]

Claude Code experiment: Visualizing Hybrid Logical Clocks

Yesterday morning I downloaded Claude Code, and wanted to see what this bad boy can do. What better way to learn how this works than coding up a toy example with it. The first thing that occurred to me was to build a visualizer for Hybrid Logical Clocks (HLC).

HLC is a simple idea we proposed in 2014: combine physical time with a logical counter to get timestamps that are close to real time but still safe under clock skew. With HLC, you get the best of both worlds: real-time affinity augmented with causality when you need it. Since then HLC has been adopted by many distributed databases, including MongoDB, CockroachDB, Amazon Aurora DSQL, YugabyteDB, etc.

This felt well scoped for a first project with Claude Code. Choosing Javascript enabled me to host this on Github (Git Pages) for free. Easy peezy way of sharing something small yet useful with people. 

Claude Code is a clever idea. It is essentially an agent wrapped around the Claude LLM. Chat works well for general Q/A, but it falls short for coding. In Claude Code, the agent wraps the LLM with a UNIX terminal abstraction, and voila, we are set up for coding. Simple and effective. It is the right interface. The terminal reduces everything to small modular, composable commands. Terminal commands were the original microservices! You compose these well-defined tools with pipes to build larger workflows. Everything is text, which plays to the strengths of LLMs. Add git for versioning, and you get a tight development loop.

The process went smoothly (very smoothly, in fact) despite this being my first time using Claude Code. I described Claude what I wanted to build. Since this was my first time, I gave a very brief description, I didn't expect it to jump to action based on this minimal description. (More on this at the end of the post.)

Claude created the hlc-visualizer directory, produced the first version of index.html, and opened it in Chrome to show me the demo. I was impressed by the speed. Wow. It almost got there on the first shot. The liberties it took with the layout were smart. It used a vertical time diagram and three nodes at the top. Both made it to the final version.

The initial visualization used buttons for send and local events. These sat on the right pane, away from the action, and were not easy to use. I prompted Claude to switch to double click on a node timeline for local events, and drag and drop between timelines for send events. This felt more natural, but I was not sure Claude could pull it off. It did, on the first try.

For the record, I used Sonnet. Since this was my first experiment with Claude, I did not want to use Opus, the more expensive model. Maybe Opus would have produced a better first version. I do not know.

Agents do not seem good with timing. The simulation ran too fast for human interaction. I kept tuning it to a tolerable speed. The screen also did not follow the timeline as new events extended beyond the view. See the commit log for how this evolved. I was not very efficient because this was my first time using Claude.

I think my biggest contribution to this collaboration was to notice that we need a snapshot feature. That's the killer app for HLC. So I explained Claude how snapshot should be taken, and after a couple of iterations, that worked. After that, I focused on improving the interaction and visuals.

Here is the end product. Try it here. Feedback is welcome in comments. 


All in all, this was delightful. I wish I had this when I was teaching. It would help create visualizations for algorithms quickly. Students need hands-on interactive learning, not static figures. They need to play with the algorithms, explore corner cases, and see how the algorithms behave. I used TLA+ for teaching my distributed systems class, but visualizations like this are the real deal. I will do my usual plug for Spectacle, browser based TLA+ trace explorer and visualizer. But even with the manual animation mode, I think it will hard to code this time diagram visual and snapshots there.

A final note on personality. Claude has high energy. It is a go-getter. It skips small talk, like a seasoned developer. It does not ramble like ChatGPT. Gemini Pro comes across as sound, but it sounds too uptight and uncreative when writing prose. Claude Code feels smart and sharp when coding.

OSTEP Chapter 10: MultiProcessor Scheduling

This chapter from Operating Systems: Three Easy Pieces explores multiprocessor scheduling as we transition from the simpler world of single-CPU systems to the challenges of modern multicore architectures.

This is part of our series going through OSTEP book chapters. The OSTEP textbook is freely available at Remzi's website if you like to follow along.


Core Challenges in Multiprocessor Scheduling

The shift to multiple CPUs introduces several hardware challenges that the operating system must manage:

  • Cache Coherence: Hardware caches improve performance by storing frequently used data. In multiprocessor systems, if one CPU modifies data in its local cache without updating main memory immediately, other CPUs may read "stale" (incorrect) data.
  • Synchronization: Accessing shared data structures across multiple CPUs requires mutual exclusion (e.g., locks). Without these, concurrent operations can lead to data corruption, such as double frees in a linked list.
  • Cache Affinity: Processes run faster when they stay on the same CPU because they can reuse state already built up in that CPU's local cache. Frequent migration across CPUs forces the system to reload this state, degrading performance.

Yep. Now we are talking about some distributed systems concepts, even inside a single computer.


Scheduling Strategies

The chapter compares two primary architectural approaches to scheduling:

  1. Single-Queue Multiprocessor Scheduling (SQMS): All jobs are placed into a single global queue. This is simple to implement and inherently balances the load across all available CPUs, but it does not scale well due to lock contention on the single queue and often ignores cache affinity.
  2. Multi-Queue Multiprocessor Scheduling (MQMS): The system maintains multiple queues, typically one per CPU. This is highly scalable and naturally preserves cache affinity since jobs stay in their assigned queue, but is vulnerable to load imbalance, where one CPU may become idle while another is overloaded.

To address load imbalances in MQMS, systems use "work stealing", where an under-utilized CPU peeks at another queue and steals jobs to balance the workload.

Modern Linux schedulers have utilized both approaches:

  • O(1) Scheduler: Multi-queue.
  • Completely Fair Scheduler (CFS): Multi-queue.
  • BF Scheduler (BFS): Single-queue.

March 17, 2026

Measuring Agents in Production

When you are in TPOT echo chamber, you would think fully autonomous AI agents are running the world. But this 2025 December paper, "Measuring Agents in Production" (MAP), cuts through the reality behind the hype. It surveys 306 practitioners and conducts 20 in-depth case studies across 26 domains to document what is actually running in live environments. The reality is far more basic, constrained, and human-dependent than TPOT suggest.


The Most Surprising Findings

Simplicity and Bounded Autonomy: 80% of case studies use predefined structured workflows rather than open-ended autonomous planning, and 68% execute fewer than 10 steps before requiring human intervention. Frankly, these systems sound to me less "autonomous agent" than glorified state machine or multi-step RAG pipeline. 

Prompting Beats Fine-Tuning: Despite the academic obsession with reinforcement learning and fine-tuning, 70% of teams building production agents simply prompt off-the-shelf proprietary models. Custom-tuned models are often too brittle, and they break when foundation model providers update their models.

Tolerance for Latency: While in database systems and distributed systems we focus on shaving milliseconds and microseconds off response times, in the agent world 66% of deployed systems take minutes or even longer to respond. I am not comparing or criticizing because of the intrinsically different nature of the work, I am just stating how vastly different the latency expectations are.

Custom Infrastructure Over Heavy Frameworks: Though many developers experiment with frameworks like LangChain, 85% of the detailed production case studies ended up building their systems completely in-house using direct API calls. Teams actively migrate away from heavy frameworks to reduce dependency bloat and maintain the flexibility to integrate with their own proprietary enterprise infrastructure.

Benchmarks are Abandoned: 75% of production teams skip formal benchmarking entirely. Because real-world tasks are incredibly messy and domain-specific, teams rely instead on A/B testing, production monitoring, and human-in-the-loop evaluation (which a massive 74% of systems use as their primary check for correctness).

Reliability (consistent correct behavior over time) remains the primary bottleneck and challenge. OK, this one was not really a surprising finding. 



The paper says agents in production deliver tangible value: 80% of practitioners explicitly deploy them for productivity gains, and 72% use them to drastically reduce human task-hours. This would have been a great place to be concrete, and dive deeper into a couple of these cases, because there is a lot of incentive for companies to exaggerate their agent use. I think the closest I have seen paper go here in the appendix, Table 3.


Discussion

So the data says that the state of multi-agent systems in production is exaggerated. Everyone says they are doing it, but only a few actually are. And those who are doing it are keeping it basic.

This feels familiar.

Remember 2018? IBM published a whitepaper stating that "7 in 10 consumer industry executives expect to have a blockchain production network in 3 years". They famously claimed blockchains would cure almost every business ailment, reducing 9 distinct frictions including "inaccessible marketplaces", "restrictive regulations", "institutional inertia", "invisible threats", and "imperfect information". Ha, "invisible threats", it cracks me up every time!


Of course, Pepperidge Farm remembers the massive 2018 hype about Walmart tracking lettuce on the blockchain to pinpoint E. Coli contamination events. We were promised a decentralized revolution, but we only got shitcoins.

But, comparing AI agents to blockchains is unfair. Agents actually have a couple killer applications already. They have also made it into deployment despite in very basic and constrained manner. It's just that they aren't the fully autonomous hyper-intelligent multi-agent swarms that people claim they are. They remain basic, human-supervised, highly constrained tools.