a curated list of database news from authoritative sources

February 16, 2026

MariaDB innovation: binlog_storage_engine

MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB).

In this post I have results for the performance benefit from this when using storage that has a high fsync latency. This is probably a best-case comparison for the feature. A future post will cover the benefit on servers that don't have high fsync latency.

tl;dr

  • the performance benefit from this is excellent when storage has a high fsync latency
  • my mental performance model needs to be improved. I gussed that throughput would increase by ~2X when using binlog_storage_engine relative to not using it but using sync_binlog=1 and innodb_flush_log_at_trx_commit=1. However the improvement is larger than 4X.

Builds, configuration and hardware

I compiled MariaDB 12.3.0 from source.

The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here. The storage device has high fsync latency.

I used 4 my.cnf files:
  • z12b
    • my.cnf.cz12b_c8r32 is my default configuration. Sync-on-commit is disabled for both the binlog and InnoDB so that write-heavy benchmarks create more stress.
  • z12c
  • z12b_sync
  • z12c_sync
    • my.cnf.cz12c_sync_c8r32 is like cz12c except it enables sync-on-commit for InnoDB. Note that InnoDB is used to store the binlog so there is nothing else to sync on commit.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks 
and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.

The benchmark is run with 1 client, 1 table and 50M rows. 

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation.  

But here I only report results for the write-heavy tests.

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for base version)
When the relative QPS is > 1 then some version is faster than base version.  When it is < 1 then there might be a regression. 

I present results for:
  • z12b, z12c, z12b_sync and z12c_sync with z12b as the base version
  •  z12b_sync and z12c_sync with z12b_sync as the base version
Results: z12b, z12c, z12b_sync, z12c_sync

Summary:
  • z12b_sync has the worst performance thanks to 2 fsyncs per commit
  • z12c_sync gets more than 4X the throughput vs z12b_sync. If fsync latency were the only thing that determined performance then I would expect the difference to be ~2X. There is more going on here and in the next section I mention that enabling binlog_storage_engine also reduces the CPU overhead.
  • some per-test data from iostat and vmstat is here
  • a representative sample of iostat collected at 1-second intervals during the update-inlist test is here. When comparing z12b_sync with z12c_sync
    • the fsync rate (f/s) is ~2.5X larger for z12c_sync vs z12b_sync (~690/s vs ~275/s) but fsync latency (f_await) is similar. So with binlog_storage_engine enabled MySQL is more efficient, and perhaps thanks to a lower CPU overhead, there is less work to do in between calls to fsync
Relative to: z12b
col-1 : z12c
col-2 : z12b_sync
col-3 : z12c_sync

col-1   col-2   col-3
1.06    0.01    0.05    delete
1.05    0.01    0.05    insert
1.01    0.12    0.47    read-write_range=100
1.01    0.10    0.44    read-write_range=10
1.03    0.01    0.11    update-index
1.02    0.02    0.12    update-inlist
1.05    0.01    0.06    update-nonindex
1.05    0.01    0.06    update-one
1.05    0.01    0.06    update-zipf
1.01    0.03    0.20    write-only

Results: z12b_sync, z12c_sync

Summary:
  • z12c_sync gets more than 4X the throughput vs z12b_sync. If fsync latency were the only thing that determined performance then I would expect the difference to be ~2X. There is more going on here and below I mention that enabling binlog_storage_engine also reduces the CPU overhead.
  • some per-test data from iostat and vmstat is here and the CPU overhead per operation is much smaller with binlog_storage_engine -- see here for the update-inlist test. In general, when sync-on-commit is enabled then the CPU overhead with binlog_storage_engine enabled is between 1/3 and 2/3 of the overhead without it enabled.
Relative to: z12b_sync
col-1 : z12c_sync

col-1
6.40    delete
5.64    insert
4.06    read-write_range=100
4.40    read-write_range=10
7.64    update-index
7.17    update-inlist
5.73    update-nonindex
5.82    update-one
5.80    update-zipf
6.61    write-only

Relational composition and Codd's "connection trap" in PostgreSQL and MongoDB

Relational composition is to joins what the cartesian product is to tables: it produces every result that could be true, not just what is true. This often leads to SQL mistakes and can often be suspected when a SELECT DISTINCT is added after a query starts returning more rows than expected, without the root cause being understood.

In its mathematical definition, relational composition is the derived relation obtained by existentially joining two relations on a shared attribute and projecting away that attribute. In a database, it is meaningful only when a real‑world invariant ensures that the resulting pairs reflect actual facts. Otherwise, the result illustrates what E. F. Codd, in his 1970 paper A Relational Model of Data for Large Shared Data Banks, called the connection trap.

Codd uses two relations in his example: a supplier supplies parts, and a project uses parts. At an intuitive level, this connection trap mirrors a syllogism: if a supplier supplies a part and a project uses that part, a join can derive that the supplier supplies the project—even when that conclusion was never stated as a fact.

Codd observed that the connection trap was common in pre‑relational network data models, where users navigated data by following physical pointers. Path existence was often mistaken for semantic relationship. The relational model solved this problem by replacing navigational access with declarative queries over explicitly defined relations, and modern document models now do the same.

However, while the relational model removes pointer‑based navigation, it does not eliminate the trap entirely. Joins can still compute relational compositions, and without appropriate cardinality constraints or business invariants, such compositions may represent only possible relationships rather than actual ones. In this way, the connection trap can be reintroduced at query time, even in modern relational systems such as PostgreSQL, and similarly through $lookup operations in MongoDB.

PostgreSQL — reproducing the connection trap

This model declares suppliers, parts, projects, and two independent many‑to‑many relationships:

CREATE TABLE suppliers (
    supplier_id TEXT PRIMARY KEY
);

CREATE TABLE parts (
    part_id TEXT PRIMARY KEY
);

CREATE TABLE projects (
    project_id TEXT PRIMARY KEY
);

-- Supplier supplies parts
CREATE TABLE supplier_part (
    supplier_id TEXT REFERENCES suppliers,
    part_id     TEXT REFERENCES parts,
    PRIMARY KEY (supplier_id, part_id)
);

-- Project uses parts
CREATE TABLE project_part (
    project_id TEXT REFERENCES projects,
    part_id    TEXT REFERENCES parts,
    PRIMARY KEY (project_id, part_id)
);

This follows Codd’s classic suppliers–parts–projects example, where suppliers supply parts and projects use parts as independent relationships.

The following data asserts that project Alpha uses parts P1 and P2, that supplier S1 supplies parts P1 and P2, and that supplier S2 supplies parts P2 and P3:

INSERT INTO suppliers VALUES ('S1'), ('S2');
INSERT INTO parts     VALUES ('P1'), ('P2'), ('P3');
INSERT INTO projects  VALUES ('Alpha');

-- Supplier capabilities
INSERT INTO supplier_part VALUES
('S1', 'P1'),
('S1', 'P2'),
('S2', 'P2'),
('S2', 'P3');

-- Project uses parts P1 and P2
INSERT INTO project_part VALUES
('Alpha', 'P1'),
('Alpha', 'P2');

The following query is valid SQL:

SELECT DISTINCT
    sp.supplier_id,
    pp.project_id
FROM supplier_part sp
JOIN project_part pp
  ON sp.part_id = pp.part_id;

However, this query falls into the connection trap:

 supplier_id | project_id
-------------+------------
 S2          | Alpha
 S1          | Alpha
(2 rows)

As we defined only supplier–part and project–part relationships, any derived supplier–project relationship is not a fact but a relational composition. We know that Alpha uses P1 and P2, and that part P2 can be supplied by either S1 or S2, but we have no record of which supplier actually supplies Alpha.

This query asserts “Supplier S1 supplies project Alpha”, but the data only says: “S1 and S2 supply P2” and “Alpha uses P2”.

This is the connection trap, expressed purely in SQL.

PostgreSQL — the correct relational solution

If a supplier actually supplies a part to a project, that fact must be represented directly. We need a new table:

CREATE TABLE supply (

    supplier_id TEXT,
    project_id  TEXT,
    part_id     TEXT,

    PRIMARY KEY (supplier_id, project_id, part_id),

    FOREIGN KEY (supplier_id, part_id)
        REFERENCES supplier_part (supplier_id, part_id),

    FOREIGN KEY (project_id, part_id)
        REFERENCES project_part (project_id, part_id)
);

These foreign keys encode subset constraints between relations and prevent inserting supplies of parts not supplied by the supplier or not used by the project.

This relation explicitly states who supplies what to which project. We assume that the real‑world fact is “Alpha gets part P2 from supplier S1”:

INSERT INTO supply VALUES
('S1', 'Alpha', 'P2');

The correct query reads from this relation:

SELECT supplier_id, project_id
FROM supply;

 supplier_id | project_id
-------------+------------
 S1          | Alpha
(1 row)

The relationship is now real and asserted, not inferred. In total, we have six tables:

postgres=# \d
             List of relations
 Schema |     Name      | Type  |  Owner
--------+---------------+-------+----------
 public | parts         | table | postgres
 public | project_part  | table | postgres
 public | projects      | table | postgres
 public | supplier_part | table | postgres
 public | suppliers     | table | postgres
 public | supply        | table | postgres
(6 rows)

In practice, you should either store the relationship explicitly or avoid claiming it exists. Although the relational model avoids pointers, it is still possible to join through an incorrect path, so the application must enforce the correct one.

In ad-hoc query environments such as data warehouses, data is typically organized into domains and modeled using a dimensional ("star schema") approach. Relationships like project–supplier are represented as fact tables within a single data mart, exposing only semantically valid join paths and preventing invalid joins.

MongoDB — reproducing the connection trap

The following MongoDB data mirrors the PostgreSQL example. MongoDB allows representing relationships either as separate collections or by embedding, depending on the bounded context. Here we start with separate collections to mirror the relational model:

db.suppliers.insertMany([
  { _id: "S1" },
  { _id: "S2" }
]);

db.parts.insertMany([
  { _id: "P1" },
  { _id: "P2" },
  { _id: "P3" }
]);

db.projects.insertMany([
  { _id: "Alpha" }
]);

// Supplier capabilities
db.supplier_parts.insertMany([
  { supplier: "S1", part: "P1" },
  { supplier: "S1", part: "P2" },
  { supplier: "S2", part: "P2" },
  { supplier: "S2", part: "P3" }
]);

// Project uses parts P1 and P2
db.project_parts.insertMany([
  { project: "Alpha", part: "P1" },
  { project: "Alpha", part: "P2" }
]);

Using the simple find() API, we cannot fall into the trap directly because there is no implicit connection between suppliers and projects. The application must issue two independent queries and combine the results explicitly.

Simulating the connection trap in a single query therefore requires explicit composition at the application level:

const partsUsedByAlpha = db.project_parts.find(
  { project: "Alpha" },
  { _id: 0, part: 1 }
).toArray();

const suppliersForParts = db.supplier_parts.find(
  { part: { $in: partsUsedByAlpha.map(p => p.part) } },
  { _id: 0, supplier: 1, part: 1 }
).toArray();

const supplierProjectPairs = suppliersForParts.map(sp => ({
  supplier: sp.supplier,
  project: "Alpha"
}));

print(supplierProjectPairs);

When forced by the application logic, here is the connection trap associating suppliers and projects:

[
  { supplier: 'S1', project: 'Alpha' },
  { supplier: 'S1', project: 'Alpha' },
  { supplier: 'S2', project: 'Alpha' }
]

As with SQL joins, a $lookup in an aggregation pipeline can fall in the same connection trap:

db.supplier_parts.aggregate([
  {
    $lookup: {
      from: "project_parts",
      localField: "part",
      foreignField: "part",
      as: "projects"
    }
  },
  { $unwind: "$projects" },
  {
    $project: {
      _id: 0,
      supplier: "$supplier",
      project: "$projects.project"
    }
  }
]);

The result is similar and the projection removed the intermediate attributes:

{ "supplier": "S1", "project": "Alpha" }
{ "supplier": "S1", "project": "Alpha" }
{ "supplier": "S2", "project": "Alpha" }

We reproduced the connection trap by ignoring that $lookup produces a derived relationship, not a real one, and that matching keys does not carry business meaning.

MongoDB — normalized solution

As with SQL, we can add an explicit supplies collection that stores the relationship between projects and suppliers:

db.supplies.insertOne({
  project: "Alpha",
  supplier: "S1",
  part: "P2"
});

Then we simply query this collection:

db.supplies.find(
  { project: "Alpha" },
  { _id: 0, supplier: 1, part: 1 }
);

[ { supplier: 'S1', part: 'P2' } ]

The document model is a superset of the relational model as relations can be stored as flat collections. The difference is that referential integrity is enforced by the application rather than in-database. To enforce relationships in the database, they must be embedded as sub-documents and arrays.

MongoDB — domain-driven solution

It's not the only solution in a document database, as we can store a schema based on the domain model rather than normalized. MongoDB allows representing this relationship as part of an aggregate. In a project‑centric bounded context, the project is the aggregate root, and the supplier information can be embedded as part of the supply fact:

db.projects.updateOne(
  { _id: "Alpha" },
  {
    $set: {
      parts: [
        { part: "P2", supplier: "S1" },
        { part: "P1", supplier: null }
      ]
    }
  },
  { upsert: true }
);

The query doesn't need a join and cannot fall into the connection trap:

db.projects.find(
  { _id: "Alpha" },
  { _id: 1, "parts.supplier": 1 }
);

[
  {
    _id: 'Alpha',
    parts: [ 
      { supplier: 'S1' },
      { supplier: null }
    ]
  }
]

This avoids the connection trap by construction. It may look like data duplication—the same supplier name may appear in multiple project documents—and indeed this would be undesirable in a fully normalized model shared across all business domains. However, this structure represents a valid aggregate within a bounded context.

In this context, the embedded supplier information is part of the supply fact, not a reference to a global supplier record. If a supplier’s name changes, it is a business decision, not a database decision, whether that change should be propagated to existing projects or whether historical data should retain the supplier name as it was at the time of supply.

Even when propagation is desired, MongoDB allows updating embedded data efficiently:

db.projects.updateMany(
  // document filter
  { "parts.supplier": "S1" },
  // document update using the array's item from array filter
  {
    $set: {
      "parts.$[p].supplier": "Supplier One"
    }
  },
  // array filter defining the array's item for the update
  {
    arrayFilters: [{ "p.supplier": "S1" }]
  }
);

This update is not atomic across documents, but each document update is atomic and the operation is idempotent and can be safely retried or executed within an explicit transaction if full atomicity is required.

Conclusion

The connection trap occurs whenever relationships are inferred from shared keys, at query time, instead of being explicitly represented as facts, at write time. In SQL, this means introducing explicit association tables and enforcing integrity constraints, rather than deriving then though joins. In MongoDB, it means modeling relationships as explicit documents or embedded subdocuments rather than deriving them through lookups.

In a relational database, the schema is designed to be normalized and independent of specific use cases. All many‑to‑many and fact‑bearing relationships must be declared explicitly, and queries must follow the correct relational path. Referential and cardinality constraints are essential to restrict to only actual facts.

In MongoDB, the data model is typically driven by the domain and the application’s use cases. In a domain-driven design (DDD), strong relationships are modeled as aggregates, embedding related data directly within a document in MongoDB collections. This makes the intended semantics explicit and avoids inferring relationships at query time. Apparent duplication is not a flaw here, but a deliberate modeling choice within a bounded context.

Ultimately, the connection trap is not fully avoided by the data model, but can be a query-time error with joins and projections: deriving relationships that were never asserted. Whether using normalized relations or domain‑driven documents, the rule is the same—if a relationship is a fact, it must be stored as one.

February 15, 2026

HammerDB tproc-c on a large server, Postgres and MySQL

This has results for HammerDB tproc-c on a small server using MySQL and Postgres. I am new to HammerDB and still figuring out how to explain and present results so I will keep this simple and just share graphs without explaining the results.

The comparison might favor Postgres for the IO-bound workloads because I used smaller buffer pools than normal to avoid OOM. I have to do this because RSS for the HammerDB client grows over time as it buffers more response time stats. And while I used buffered IO for Postgres, I use O_DIRECT for InnoDB. So Postgres might have avoided some read IO thanks to the OS page cache while InnoDB did not.

tl;dr for MySQL

  • With vu=40 MySQL 8.4.8 uses about 2X more CPU per transaction and does more than 2X more context switches per transaction compared to Postgres 18.1. I will get CPU profiles soon.
  • Modern MySQL brings us great improvements to concurrency and too many new CPU overheads
    • MySQL 5.6 and 8.4 have similar throughput at the lowest concurrency (vu=10)
    • MySQl 8.4 is a lot faster than 5.6 at the highest concurrency (vu=40)
tl;dr for Postgres
  • Modern Postgres has regressions relative to old Postgres
  • The regressions increase with the warehouse count, at wh=4000 the NOPM drops between 3% and 13% depending on the virtual user count (vu).
tl;dr for Postgres vs MySQL
  • Postgres and MySQL have similar throughput for the largest warehouse count (wh=4000)
  • Otherwise Postgres gets between 1.4X and 2X more throughput (NOPM)

Builds, configuration and hardware

I compiled Postgres versions from source: 12.22, 13.23, 14.20, 15.15, 16.11, 17.7 and 18.1.

I compiled MySQL versions from source: 5.6.51, 5.7.44, 8.0.45, 8.4.8, 9.4.0 and 9.6.0.

I used a 48-core server from Hetzner
  • an ax162s with an AMD EPYC 9454P 48-Core Processor with SMT disabled
  • 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
  • 128G RAM
  • Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)
Postgres configuration files:
  • prior to v18 the config file is named conf.diff.cx10a50g_c32r128 (x10a_c32r128) and is here for versions 1213141516 and 17.
  • for Postgres 18 I used conf.diff.cx10b_c32r128 (x10b_c32r128) with io_method=sync to be similar to the config used for versions 12 through 17.
MySQL configuration files
  • prior to 9.6 the config file is named my.cnf.cz12a50g_c32r128 (z12a50g_c32r128 or z12a50g) and is here for versions 5.6, 5.7, 8.0 and 8.4
  • for 9.6 it is named my.cnf.cz13a50g_c32r128 (z13a50g_c32r128 or z13a50g) and is here
For both Postgres and MySQL fsync on commit is disabled to avoid turning this into an fsync benchmark. The server has 2 SSDs with SW RAID and low fsync latency.

Benchmark

The benchmark is tproc-c from HammerDB. The tproc-c benchmark is derived from TPC-C.

The benchmark was run for several workloads:
  • vu=10, wh=1000 - 10 virtual users, 1000 warehouses
  • vu=20, wh=1000 - 20 virtual users, 1000 warehouses
  • vu=40, wh=1000 - 40 virtual users, 1000 warehouses
  • vu=10, wh=2000 - 10 virtual users, 2000 warehouses
  • vu=20, wh=2000 - 20 virtual users, 2000 warehouses
  • vu=40, wh=2000 - 40 virtual users, 2000 warehouses
  • vu=10, wh=4000 - 10 virtual users, 4000 warehouses
  • vu=20, wh=4000 - 20 virtual users, 4000 warehouses
  • vu=40, wh=4000 - 40 virtual users, 4000 warehouses
The wh=1000 workloads are less heavy on IO. The wh=4000 workloads are more heavy on IO.

The benchmark for Postgres is run by a variant of this script which depends on scripts here. The MySQL scripts are similar.
  • stored procedures are enabled
  • partitioning is used because the warehouse count is >= 1000
  • a 5 minute rampup is used
  • then performance is measured for 60 minutes
Basic metrics: iostat

I am still improving my helper scripts to report various performance metrics. The table here has average values from iostat during the benchmark run phase for MySQL 8.4.8 and Postgres 18.1. For these configurations the NOPM values for Postgres and MySQL were similar so I won't present normalized values (average value / NOPM) and NOPM is throughput.
  • average wMB/s increases with the warehouse count for Postgres but not for MySQL
  • r/s increases with the warehouse count for Postgres and MySQL
iostat metrics
* r/s = average rate of reads/s from storage
* wMB/s = average MB/s written to storage

my8408
r/s     wMB/s
22833.0 906.2   vu=40, wh=1000
63079.8 1428.5  vu=40, wh=2000
82282.3 1398.2  vu=40, wh=4000

pg181
r/s     wMB/s
30394.9 1261.9  vu=40, wh=1000
59770.4 1267.8  vu=40, wh=2000
78052.3 1272.9  vu=40, wh=4000

Basic metrics: vmstat

I am still improving my helper scripts to report various performance metrics. The table here has average values from vmstat during the benchmark run phase for MySQL 8.4.8 and Postgres 18.1. For these configurations the NOPM values for Postgres and MySQL were similar so I won't present normalized values (average value / NOPM).
  • CPU utilization is almost 2X larger for MySQL
  • Context switch rates are more than 2X larger for MySQL
  • In the future I hope to learn why MySQL uses almost 2X more CPU per transaction and has more than 2X more context switches per transaction relative to Postgres
vmstat metrics
* cs - average value for cs (context switches/s)
* us - average value for us (user CPU)
* sy - average value for sy (system CPU)
* id - average value for id (idle)
* wa - average value for wa (waiting for IO)
* us+sy - sum of us and sy

my8408
cs      us      sy      id      wa      us+sy
455648  61.9    8.2     24.2    5.7     70.1    vu=40, wh=1000
484955  50.4    9.2     19.5    21.0    59.6    vu=40, wh=2000
487410  39.5    8.4     19.4    32.6    48.0    vu=40, wh=4000

pg181
cs      us      sy      id      wa      us+sy
127486  23.5    10.1    63.3    3.0     33.6    vu=40, wh=1000
166257  17.2    11.1    62.5    9.1     28.3    vu=40, wh=2000
203578  13.9    11.3    59.2    15.6    25.2    vu=40, wh=4000

Results

My analysis at this point is simple -- I only consider average throughput. Eventually I will examine throughput over time and efficiency (CPU and IO).

On the charts that follow y-axis does not start at 0 to improve readability at the risk of overstating the differences. The y-axis shows relative throughput. There might be a regression when the relative throughput is less than 1.0. There might be an improvement when it is > 1.0. The relative throughput is:
(NOPM for some-version / NOPM for base-version)

I provide three charts below:

  • only MySQL - base-version is MySQL 5.6.51
  • only Postgres - base-version is Postgres 12.22
  • Postgres vs MySQL - base-version is Postgres 18.1, some-version is MySQL 8.4.8
Results: MySQL 5.6 to 9.6

Legend:

  • my5651.z12a is MySQL 5.6.51 with the z12a50g config
  • my5744.z12a is MySQL 5.7.44 with the z12a50g config
  • my8045.z12a is MySQL 8.0.45 with the z12a50g config
  • my8408.z12a is MySQL 8.4.8 with the z12a50g config
  • my9500.z13a is MySQL 9.6.0 with the z13a50g config

Summary

  • At the lowest concurrency (vu=10) MySQL 8.4.8 has similar throughput as 5.6.51 because CPU regressions in modern MySQL offset the concurrency improvements.
  • At the highest concurrency (vu=40) MySQL 8.4.8 is much faster than 5.6.51 and the regressions after 5.7 are small. This matches what I have seen elsewhere -- while modern MySQL suffers from CPU regressions it benefits from concurrency improvements. Imagine if we could get those concurrency improvements without the CPU regressions.

And the absolute NOPM values are here:

my5651my5744my8045my8408my9600
vu=10, wh=1000163059183268156039155194151748
vu=20, wh=1000210506321670283282281038279269
vu=40, wh=1000216677454743439589435095433618
vu=10, wh=2000107492130229111798110161108386
vu=20, wh=2000155398225068193658190717189847
vu=40, wh=2000178278302723297236307504293217
vu=10, wh=400081242103406894148931688458
vu=20, wh=4000131241179112155134152998152301
vu=40, wh=4000146809228554234922229511230557

Results: Postgres 12 to 18

Legend:

  • pg1222 is Postgres 12.22 with the x10a50g config
  • pg1323 is Postgres 13.23 with the x10a50g config
  • pg1420 is Postgres 14.20 with the x10a50g config
  • pg1515 is Postgres 15.15 with the x10a50g config
  • pg1611 is Postgres 16.11 with the x10a50g config
  • pg177 is Postgres 17.7 with the x10a50g config
  • pg181 is Postgres 18.1 with the x10b50g config

Summary

  • Modern Postgres has regressions relative to old Postgres
  • The regressions increase with the warehouse count, at wh=4000 the NOPM drops between 3% and 13% depending on the virtual user count (vu).


The relative NOPM values are here:

pg1222pg1323pg1420pg1515pg1611pg177pg181
vu=10, wh=10001.0001.0001.0541.0421.0041.0100.968
vu=20, wh=10001.0001.0351.0371.0281.0281.0010.997
vu=40, wh=10001.0001.0400.9881.0001.0270.9980.970
vu=10, wh=20001.0001.0261.0591.0751.0681.0811.029
vu=20, wh=20001.0001.0221.0461.0430.9790.9720.934
vu=40, wh=20001.0001.0141.0321.0360.9791.0100.947
vu=10, wh=40001.0001.0271.0321.0350.9930.9980.974
vu=20, wh=40001.0001.0051.0491.0480.9400.9270.876
vu=40, wh=40001.0000.9911.0190.9831.0010.9790.937

The absolute NOPM values are here:

pg1222pg1323pg1420pg1515pg1611pg177pg181
vu=10, wh=1000353077353048372015367933354513356469341688
vu=20, wh=1000423565438456439398435454435288423986422397
vu=40, wh=1000445114462851439728445144457110444364431648
vu=10, wh=2000223048228914236231239868238117241185229549
vu=20, wh=2000314380321380328688328044307728305452293627
vu=40, wh=2000320347324769330444331896313553323454303403
vu=10, wh=4000162054166461167320167761160962161716157872
vu=20, wh=4000244598245804256593256231230037226844214309
vu=40, wh=4000252931250634257820248584253059247610236986

Results: MySQL vs Postgres

Legend:

  • pg181 is Postgres 18.1 with the x10b50g config
  • my8408 is MySQL 8.4.8 with the z12a50g config

Summary

  • Postgres and MySQL have similar throughput for the largest warehouse count (wh=4000)
  • Otherwise Postgres gets between 1.4X and 2X more throughput (NOPM)
The absolute NOPM values are here:

pg181my8408
vu=10, wh=1000341688155194
vu=20, wh=1000422397281038
vu=40, wh=1000431648435095
vu=10, wh=2000229549110161
vu=20, wh=2000293627190717
vu=40, wh=2000303403307504
vu=10, wh=400015787289316
vu=20, wh=4000214309152998
by Mark Callaghan (noreply@blogger.com)

Butlers or Architects?

In a recent viral post, Matt Shumer declares dramatically that we've crossed an irreversible threshold. He asserts that the latest AI models now exercise independent judgment, that he simply gives an AI plain-English instructions, steps away for a few hours, and returns to a flawlessly finished product that surpasses his own capabilities. In the near future, he claims, AI will autonomously handle all knowledge work and even build the next generation of AI itself, leaving human creators completely blindsided by the exponential curve.

This was a depressing read. The dramatic tone lands well. And by extrapolating from progress in the last six years, it's hard to argue against what AI might achieve in the next six.

I forwarded this to a friend of mine, who had the misfortune of reading it before bed. He told me he had a nightmare about it, dreaming of himself as an Uber driver, completely displaced from his high-tech career.


Someone on Twitter had a come back: "The thing I don't get is: Claude Code is writing 100% of Claude's code now. But Anthropic has 100+ open dev positions on their jobs page?" Boris Cherny of Anthropic replied: "The reality is that someone has to prompt the Claudes, talk to customers, coordinate with other teams, and decide what to build next. Engineering is changing, and great engineers are more important than ever."

This is strongly reminiscent of the Shell Game podcast I wrote about recently. And it connects to my arguments in "Agentic AI and The Mythical Agent-Month" about the mathematical laws of scaling coordination. Throwing thousands of AI agents at a project does not magically bypass Brooks' Law. Agents can dramatically scale the volume of code generated, but they do not scale insight. Coordination complexity and verification bottlenecks remain firmly in place. Until you solve the epistemic gap of distributed knowledge, adding more agents simply produces a faster, more expensive way to generate merge conflicts. Design, at its core, is still very human.

Trung Phan's recent piece on how Docusign still employs 7,000 people in the age of AI provides useful context as well. Complex organizations don't dissolve overnight. Societal constructs, institutional inertia, regulatory frameworks, and the deeply human texture of business relationships all act as buffers. The world changes slower than the benchmarks suggest.


So we are nowhere near a fully autonomous AI that sweeps up all knowledge work and solves everything. When we step back, two ways of reading the situation come into view.

The first is that we are all becoming butlers for LLMs: priming the model, feeding it context in careful portions, adding constraints, nudging tone, coaxing the trajectory. Then stepping back to watch it cook. We do the setup and it does the real work.

But as a perennial optimist, I think we are becoming architects. Deep work will not disappear, rather it will become the only work that matters. We get to design the blueprint, break down logic in high-level parts, set the vision, dictate strategy, and chart trajectory. We do the real thinking, and then we make the model grind.

In anyway, this shift brings a real danger. If we delegate execution, it becomes tempting to delegate thought gradually. LLMs make thinking feel optional. People were already reluctant to think; now they can bypass it entirely. It is unsettling to watch a statistical prediction machine stand in for reasoning. Humbling, too. Maybe we're not as special as we assumed. 

This reminds me Ted Chiang's story "Catching Crumbs from the Table" where humanity is reduced to interpreting the outputs of a vastly superior intellect. Human scientists no longer produce breakthroughs themselves; they spend their careers reverse-engineering discoveries made by "metahumans". The tragedy is that humans are no longer the source of the insight, they are merely trying to explain metahumans' genius. The title captures the feeling really well. We're not at the table anymore. We're just gathering what falls from it.

Even if things come to that, I know I'll keep thinking, keep learning, keep striving to build things. As I reflected in an earlier post on finding one's true calling, this pursuit of knowledge and creation is my dharma. That basic human drive to understand things and build things is not something an LLM can automate away. This I believe.



I recently launched a free email newsletter for the blog. Subscribe here to get these essays delivered to your inbox, along with behind-the-scenes commentary and curated links on distributed systems, technology, and other curiosities. 

February 14, 2026

HammerDB tproc-c on a small server, Postgres and MySQL

This has results for HammerDB tproc-c on a small server using MySQL and Postgres. I am new to HammerDB and still figuring out how to explain and present results so I will keep this simple and just share graphs without explaining the results.

tl;dr

  • Modern Postgres is faster than old Postgres
  • Modern MySQL has large perf regressions relative to old MySQL, and they are worst at low concurrency for CPU-bound worklads. This is similar to what I see on other benchmarks.
  • Modern Postgres is about 2X faster than MySQL at low concurrency (vu=1) and when the workload isn't IO-bound (w=100). But with some concurrency (vu=6) or with more IO per transaction (w=1000, w=2000) they have similar throughput. Note that partitioning is used at w=1000 and 2000 but not at w=100.

Builds, configuration and hardware

I compiled Postgres versions from source: 12.22, 13.23, 14.20, 15.15, 16.11, 17.7 and 18.1.

I compiled MySQL versions from source: 5.6.51, 5.7.44, 8.0.44, 8.4.7, 9.4.0 and 9.5.0.

The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here.

For versions prior to 18, the config file is named conf.diff.cx10a_c8r32 and they are as similar as possible and here for versions 1213141516 and 17.

For Postgres 18 the config file is named conf.diff.cx10b_c8r32 and adds io_mod='sync' which matches behavior in earlier Postgres versions.

For MySQL the config files are named my.cnf.cz12a_c8r32 and are here: 5.6.515.7.448.0.4x8.4.x9.x.0.

For both Postgres and MySQL fsync on commit is disabled to avoid turning this into an fsync benchmark. The server has an SSD with high fsync latency.

Benchmark

The benchmark is tproc-c from HammerDB. The tproc-c benchmark is derived from TPC-C.

The benchmark was run for several workloads:
  • vu=1, w=100 - 1 virtual user, 100 warehouses
  • vu=6, w=100 - 6 virtual users, 100 warehouses
  • vu=1, w=1000 - 1 virtual user, 1000 warehouses
  • vu=6, w=1000 - 6 virtual users, 1000 warehouses
  • vu=1, w=2000 - 1 virtual user, 2000 warehouses
  • vu=6, w=2000 - 6 virtual users, 2000 warehouses
The w=100 workloads are less heavy on IO. The w=1000 and w=2000 workloads are more heavy on IO.

The benchmark for Postgres is run by this script which depends on scripts here. The MySQL scripts are similar.
  • stored procedures are enabled
  • partitioning is used for when the warehouse count is >= 1000
  • a 5 minute rampup is used
  • then performance is measured for 120 minutes
Results

My analysis at this point is simple -- I only consider average throughput. Eventually I will examine throughput over time and efficiency (CPU and IO).

On the charts that follow y-axis does not start at 0 to improve readability at the risk of overstating the differences. The y-axis shows relative throughput. There might be a regression when the relative throughput is less than 1.0. There might be an improvement when it is > 1.0. The relative throughput is:
(NOPM for some-version / NOPM for base-version)

I provide three charts below:

  • only MySQL - base-version is MySQL 5.6.51
  • only Postgres - base-version is Postgres 12.22
  • Postgres vs MySQL - base-version is Postgres 18.1, some-version is MySQL 8.4.7

Results: MySQL 5.6 to 8.4

Legend:

  • my5651.z12a is MySQL 5.6.51 with the z12a_c8r32 config
  • my5744.z12a is MySQL 5.7.44 with the z12a_c8r32 config
  • my8044.z12a is MySQL 8.0.44 with the z12a_c8r32 config
  • my847.z12a is MySQL 8.4.7 with the z12a_c8r32 config
  • my9400.z12a is MySQL 9.4.0 with the z12a_c8r32 config
  • my9500.z12a is MySQL 9.5.0 with the z12a_c8r32 config

Summary

  • Perf regressions in MySQL 8.4 are smaller with vu=6 and wh >= 1000 -- the cases where there is more concurrency (vu=6) and the workload does more IO per transaction (wh=1000 & 2000). Note that partitioning is used at w=1000 and 2000 but not at w=100.
  • Perf regressions in MySQL 8.4 are larger with vu=1 and even more so with wh=100 (low concurrency, less IO per transaction).
  • Performance has mostly been dropping from MySQL 5.6 to 8.4. From other benchmarks the problem is from new CPU overheads at low concurrency.
  • While perf regressions in modern MySQL at high concurrency have been less of a problem on other benchmarks, this server is too small to support high concurrency.

Results: Postgres 12 to 18

Legend:

  • pg1222.x10a is Postgres 12.22 with the x10a_c8r32 config
  • pg1323.x10a is Postgres 13.23 with the x10a_c8r32 config
  • pg1420.x10a is Postgres 14.20 with the x10a_c8r32 config
  • pg1515.x10a is Postgres 15.15 with the x10a_c8r32 config
  • pg1611.x10a is Postgres 16.11 with the x10a_c8r32 config
  • pg177.x10a is Postgres 17.7 with the x10a_c8r32 config
  • pg181.x10b is Postgres 18.1 with the x10b_c8r32 config

Summary

  • Modern Postgres is faster than old Postgres



Results: MySQL vs Postgres

Legend:

  • pg181.x10b is Postgres 18.1 with the x10b_c8r32 config
  • my847.z12a is MySQL 8.4.7 with the z12a_c8r32 config

Summary

  • MySQL and Postgres have similar throughput for vu=6 at w=1000 and 2000. Note that partitioning is used at w=1000 and 2000 but not at w=100.
  • Otherwise Postgres is 2X faster than MySQL








 

Cross join in MongoDB

Relational database joins are, conceptually, a cartesian product followed by a filter (the join condition). Without that condition, you get a cross join that returns every possible combination. In MongoDB, you can model the same behavior at read time using $lookup, or at write time by embedding documents.

Example

Define two collections: one for clothing sizes and one for gender-specific fits:

db.sizes.insertMany([  
  { code: "XS", neckCm: { min: 31, max: 33 } },  
  { code: "S",  neckCm: { min: 34, max: 36 } },  
  { code: "M",  neckCm: { min: 37, max: 39 } },  
  { code: "L",  neckCm: { min: 40, max: 42 } },  
  { code: "XL", neckCm: { min: 43, max: 46 } }  
]);

db.fits.insertMany([
  {
    code: "MEN",
    description: "Straight cut, broader shoulders, narrower hips"
  },
  {
    code: "WOMEN",
    description: "Tapered waist, narrower shoulders, wider hips"
  }
]);

Each collection stores independent characteristics, and every size applies to every fit. The goal is to generate all valid product variants.

Cross join on read: $lookup + $unwind

In order to add all sizes to each body shape, use a $lookup without filter condition and, as it adds them as an embedded array, use $unwind to get one document per combination:

db.sizes.aggregate([
  {
    $lookup: {
      from: "fits",
      pipeline: [],
      as: "fit"
    }
  },
  { $unwind: "$fit" },
  { $sort: { "fit.code": 1, code: 1 } },
  {
    $project: {
      _id: 0,
      code: { $concat: ["$fit.code", "-", "$code"] }
    }
  }
]);

Here is the result:

Application-side

For such small static reference collections, the application may simply read both and join with loops:

const sizes = db.sizes.find({}, { code: 1, _id: 0 }).sort({ code: 1 }).toArray();
const fits  = db.fits.find({},  { code: 1, _id: 0 }).sort({ code: 1 }).toArray();

for (const fit of fits) {
  for (const size of sizes) {
    print(`${fit.code}-${size.code}`);
  }
}

While it's good to keep the reference in a database, such static data can stay in cache in the application.

Cross join on write: embed the many-to-many

Because sizes are inherently tied to body shapes (no size exists without a body shape), embedding them in the fits documents is often a better model:

db.fits.aggregate([
  {
    $lookup: {
      from: "sizes",
      pipeline: [
        { $project: { _id: 0, code: 1, neckCm:1 } },
        { $sort: { code: 1 } }
      ],
      as: "sizes"
    }
  },
  {
    $merge: {
      into: "fits",
      on: "_id",
      whenMatched: "merge",
      whenNotMatched: "discard"
    }
  }
]);

Here is the new shape of the single collection:

Once embedded, the query becomes straightforward, simply unwind the embedded array:

db.fits.aggregate([
  { $unwind: "$sizes" },
  {
    $project: {
      _id: 0,
      code: {
        $concat: ["$code", "-", "$sizes.code"]
      }
    }
  }
]);

You may embed only the fields required, like the size code, or all fields like I did here with the neck size, and then remove the size collection:

db.sizes.drop()

Although this may duplicate the values for each body shape, it only requires using updateMany() instead of updateOne() when updating it. For example, the following updates one size:

db.fits.updateMany(  
  {},  
  { $set: { "sizes.$[s].neckCm": { min: 38, max: 40 } } },  
  {  
    arrayFilters: [  
      { "s.code": "M" }  
    ]  
  }  
); 

Duplication has the advantage of returning all required information in a single read, without joins or multiple queries, and it is not problematic for updates since it can be handled with a single bulk update operation. Unlike relational databases—where data can be modified through ad‑hoc SQL and business rules must therefore be enforced at the database level—MongoDB applications are typically domain‑driven, with clear ownership of data and a single responsibility for performing updates.

In that context, consistency is maintained by the application's service rather than by cross‑table constraints. This approach also lets business rules evolve, such as defining different sizes for men and women, without changing the data model.

Conclusion

In a fully normalized relational model, all relationships use the same pattern: a one-to-many relationship between two tables, enforced by a primary (or unique) key on one side and a foreign key on the other. This holds regardless of cardinality (many can be three or one million), lifecycle rules (cascade deletes or updates), ownership (shared or exclusive parent), navigation direction (and access patterns). Even many-to-many relationships are just two one-to-many relationships via a junction table.

MongoDB exposes these same concepts as modeling choices—handled at read time with $lookup, at write time through embedding, or in the application—instead of enforcing a single normalized representation. The choice depends on the domain data and access patterns.

February 13, 2026

Supabase incident on February 12, 2026

A detailed account of the February 12 outage in us-east-2, what caused it, and the steps we are taking to prevent it from happening again.

February 12, 2026

Achieve near-zero downtime database maintenance by using blue/green deployments with AWS JDBC Driver

In this post we introduce the blue/green deployment plugin for the AWS JDBC Driver, a built-in plugin that automatically handles connection routing, traffic management, and switchover detection during blue/green deployment switchovers. We show you how to configure and use the plugin to minimize downtime during database maintenance operations during blue/green deployment switchovers.

Do You Think I Am a Goldfish?

Academic writing has long been criticized for its formulaic nature. As I wrote about earlier, research papers are unfortunately written to please 3 specific expert reviewers who are overwhelmingly from academia. Given this twisted incentive structure (looking impressive for peer-review), the papers end up becoming formulaic, defensive, and often inpenetrable. 

Ironically, this very uniformity makes it trivially easy for LLMs to replicate academic writing. It is easy to spot LLM use in personal essays, but I dare you to do it successfully in academic writing. 

Aside: Ok, I baited myself with my own dare. In general, it is very hard to detect LLM usage at the paragraph level in a research paper. But LLM usage in research papers becomes obvious when you see the same definition repeated 3-4 times across consecutive pages.  The memoryless nature of LLMs causes them to recycle the same terms and phrases, and I find myself thinking "you already explained this to me four times, do you think I am a goldfish?" I have been reviewing a lot of papers recently, and this is the number one tell-tale sign. A careful read by the authors would clean this up easily, making LLM usage nearly undetectable. To be clear, I am talking about LLM assistance in polishing writing, not wholesale generation. A paper with no original ideas is a different beast entirely. They are vacuous and easy to spot. 

Anyway, as LLM use become ubiquitous, conference/journal reviewing is facing a big crisis. There are simply too many articles being submitted, as it is easy to generate text and rush half-baked ideas into the presses. I am, of course, unhappy about this. Writing that feels effortless because an LLM smooths every step deprives you of the strain that produces "actual understanding". That strain in writing is not a defect; it creates the very impetus for discovering what you actually think, rather than faking/imitating thought.

But here we are. We are at an inflection point in academic publishing.  I recently came across this post, which documents an experiment where an LLM replicated and extended a published empirical political science paper with near-human fidelity, at a fraction of the time and cost.

I have been predicting the collapse of the publishing system for a decade. The flood of LLM-aided research might finally break its back. And here is where I want to take you in this post. I want to imagine how academic writing may change in this new publishing regime. Call it a 5-10 year outlook, because at this day and age, who can predict anything beyond that.

I claim that costly signals of genuine intelligence will become the currency of survival in this new environment.

Costly signals work because they are expensive to fake, like a peacock’s tail or an elk’s antlers. And I claim academic writing will increasingly demand features that are expensive to fake. Therefore, a distinctive voice becomes more valuable precisely because it cannot be generated without genuine intellectual engagement. Personal narratives, peculiar perspectives, unexpected conceptual leaps, and field-specific cultural fluency are things that require deep immersion and creative investment that LLMs lack. These are the costly signals that will make a paper worth publishing. 

Literature reviews are cheap to automate, so they will shrink --as we are already seeing. But reviews with distinctive voice and genuine insight, ones that reflect on the author's own learning and thought process, will survive. Work that builds creative frameworks and surprising connections, which are expensive to produce, will flourish. When anyone can generate competent prose, only writing that screams "a specific human spent serious time thinking about this" will cut through.

So, LLMs may accidentally force academia toward what it always claimed to value: original thinking and clear communication. The costliest signal of all is having something genuinely new to say, and saying well. I am an optimist, as you can easily tell, if you are a long time reader of this blog.

“Simplicity and elegance are unpopular because they require hard work and discipline to achieve and education to be appreciated.”

-- Edsger W. Dijkstra

February 11, 2026

Migrate relational-style data from NoSQL to Amazon Aurora DSQL

In this post, we demonstrate how to efficiently migrate relational-style data from NoSQL to Aurora DSQL, using Kiro CLI as our generative AI tool to optimize schema design and streamline the migration process.

Prisma + MongoDB “Hello World”

Prisma is an ORM (Object-Relational Mapper). With MongoDB, it acts as an Object Document Mapper, mapping collections to TypeScript models and providing a consistent, type-safe query API.

MongoDB is a document database with a flexible schema. Prisma does not provide schema migrations for MongoDB, but it supports nested documents and embedded types to take advantage of MongoDB’s data locality.

This article walks through a minimal “Hello World” setup on a Docker environment:

  • Run MongoDB as a replica set
  • Connect to it using Prisma
  • Insert a "Hello World" document
  • Read and display all documents

Start MongoDB as a replica set

Prisma requires MongoDB to run as a replica set. While MongoDB supports many operations without transactions, Prisma relies on MongoDB sessions and transactional behavior internally, which are only available on replica sets.

Start MongoDB in a Docker container with replica set support enabled:

docker run --name mg -d mongo --replSet rs0

Initialize the replica set (a single‑node replica set is sufficient for local development and testing):

docker exec -it mg mongosh --eval "rs.initiate()"

Start a Node.js container

Start a Node.js container that can access MongoDB using the hostname mongo:

docker run --rm -it --link mg:mongo node bash

Prepare the Node.js environment

Update the package manager, install an editor, update npm, disable funding messages, and move to the working directory:

apt-get update
apt-get install -y vim
npm install -g npm@11.9.0
npm config set fund false
cd /home

Install Prisma Client and enable ES modules

Install Prisma Client and enable ES modules by adding "type": "module" to package.json:

npm install @prisma/client@6.19.0
sed -i '1s/{/{\n  "type": "module",/' package.json

Using ES modules enables standard import syntax and aligns the project with modern Node.js tooling.

Install Prisma CLI and TypeScript tooling

Install the Prisma CLI and supporting tooling, and generate the initial Prisma configuration:

npm install -D prisma@6.19.0 @types/node 
npm install -D tsx
npx prisma init

Configure the Prisma schema

Edit prisma/schema.prisma, change the provider from postgresql to mongodb, and define a minimal Message model:

generator client {
  provider = "prisma-client"
  output   = "../generated/prisma"
}

datasource db {
  provider = "mongodb"
  url      = env("DATABASE_URL")
}

model Message {
  id        String   @id @default(auto()) @map("_id") @db.ObjectId
  content   String
  createdAt DateTime @default(now())
}

Prisma maps MongoDB’s _id field to a String backed by an ObjectId.
The prisma-client generator produces TypeScript output in a custom directory to avoid using @prisma/client.

Configure the database connection

Define the MongoDB connection string in .env:

DATABASE_URL="mongodb://mongo:27017/test"

Prisma reads DATABASE_URL at generation time, while the application reads it at runtime. Importing dotenv/config ensures both environments are consistent.

Generate the Prisma client

Generate the Prisma client from the schema:

npx prisma generate

This produces TypeScript client files in generated/prisma.

Write the “Hello World” program

Create prisma/index.ts:

import 'dotenv/config'
import { PrismaClient } from '../generated/prisma/client.ts'

const prisma = new PrismaClient()

async function main() {
  await prisma.$connect()
  console.log('Connected to MongoDB')

  await prisma.message.create({
    data: {
      content: 'Hello World',
    },
  })

  const messages = await prisma.message.findMany()

  console.log('Messages in database:')
  for (const message of messages) {
    console.log(`- ${message.content} at ${message.createdAt}`)
  }
}

main()
  .catch(console.error)
  .finally(() => prisma.$disconnect())

This program connects to MongoDB, inserts a “Hello World” document, and prints all stored messages.

Run the program

For running TypeScript directly in modern Node.js projects, tsx is generally preferred over ts-node due to better ESM support and faster startup.

Execute the TypeScript file:

npx tsx prisma/index.ts

Output:

Connected to MongoDB
Messages in database:
- Hello World at Wed Feb 11 2026 17:36:08 GMT+0000 (Coordinated Universal Time)

Conclusion and final note on schemas in MongoDB

This example shows a minimal Prisma + MongoDB setup:

  • MongoDB running as a replica set
  • Prisma configured for MongoDB
  • A single model with one insert and one read

From here, you can add schema evolution, indexes, and more complex queries while keeping the same core configuration.

MongoDB is often called schemaless, but that’s misleading in practice, as we started to declare the database schema in schema.prisma and generate the client for it. Real‑world MongoDB applications are schema‑driven, with structure defined in the application layer through models, validation rules, and access patterns.

Unlike relational databases—where the schema is enforced in the database and then mapped into the application—MongoDB uses the same document structure across all layers: in‑memory cache, on‑disk storage, and application models. This preserves data locality, avoids ORM overhead and migration scripts, and simplifies the development.

Prisma makes this explicit by defining the schema in code, providing type safety and consistency while keeping MongoDB’s document model flexible as your application evolves.

OSTEP Chapter 8

The crux of this chapter is how to schedule tasks without perfect knowledge. If you remember from the previous chapter, the core tension in CPU scheduling is these two conflicting goals:

  • Minimizing Turnaround Time: Usually achieved by running shorter jobs first (SJF).
  • Minimizing Response Time: Usually achieved by Round Robin scheduling (RR). Essential for interactive users.

Unfortunately, the OS does not have a crystal ball. It doesn't know if a process is a short interactive job or a massive number-crunching batch job. The Multi-Level Feedback Queue (MLFQ) solves this by encoding/capturing information from history of the job, and assumes that if a job has been CPU-intensive in the past, it likely will be in the future. As we'll see below, it also gives a chance for jobs to redeem themselves through the boosting process.

I really enjoyed this chapter. MLFQ, invented by Corbato in 1962, is a brilliant scheduling algorithm. This elegant solution served as the base scheduler for many systems, including BSD UNIX derivatives, Solaris, and Windows NT and subsequent Windows operating systems.

(This is part of our series going through OSTEP book chapters.)


How MLFQ Works: The Basic Rules

The chapter constructs the MLFQ algorithm iteratively, starting with a basic structure involving distinct queues, each with a different priority level.

  • Rule 1: If Priority(A) > Priority(B), A runs.
  • Rule 2: If Priority(A) = Priority(B), they run in Round-Robin.

But how does a job get its priority?

  • Rule 3: New jobs start at the highest priority.
  • Rule 4 (Initial Version): If a job uses up its time allotment, it moves down a queue. If it gives up the CPU (e.g., for I/O) before the time is up, it stays at the same priority.

This setup cleverly approximates Shortest Job First. Because the scheduler assumes every new job is short (giving it high priority), true short jobs finish quickly. Long jobs eventually exhaust their time slices and sink to the bottom queues, where they run only when the system isn't busy with interactive tasks.


Patching the initial MLFQ rules

However, this basic version has fatal flaws.

  • If too many interactive jobs flood the system, low-priority background jobs might starve.
  • A clever user could rewrite a program to yield the CPU (say through I/O: writing to a dummy file) just before its time slice ends. This resets the job's allotment, allowing it to monopolize the CPU at the highest priority.
  • A job that starts CPU-intensive but becomes interactive later (like a compiler finishing and waiting for input) would be stuck at the bottom priority.

To fix these issues, the chapter introduces two crucial modifications.

The Priority Boost: To prevent low-priority jobs from starving, the scheduler employs Rule 5: After a set time period (S), all jobs are moved back to the topmost queue. This "boost" ensures that CPU-bound jobs get at least some processing time and allows jobs that have become interactive to return to a high-priority state.

Better Accounting: To stop users from gaming the system, the scheduler rewrites Rule 4 regarding how it tracks time. Rule 4: Instead of resetting the allotment every time a job yields the CPU, the scheduler tracks the total time a job uses at a given priority level. Once the allotment is used up (regardless of how many times the job yielded the CPU) it is demoted.


Tuning MLFQ

The remaining piece of the puzzle is parameterization. An MLFQ requires choosing the number of queues, the time slice length for each, and the frequency of the priority boost. There are no easy answers to these questions, and finding a satisfactory balance often requires deep experience with specific workloads. For example, most implementations employ varying time-slice lengths, assigning short slices (e.g., 10 ms) to high-priority queues for responsiveness and longer slices (e.g., 100s of ms) to low-priority queues for efficiency. Furthermore, the priority boost interval is often referred to as a "voodoo constant" because it requires magic to set correctly; if the value is too high, jobs starve, but if it is too low, interactive performance suffers.

MLFQ is a milestone in operating systems design. It delivers strong performance for interactive jobs without prior knowledge of job length, while remaining fair to long-running tasks. As noted earlier, it became the base scheduler for many operating systems, with several variants refining the core idea. One notable variant is the decay-usage approach used in FreeBSD 4.3. Instead of using fixed priority tables (as in Solaris), it computes priority using a mathematical function of recent CPU usage. Running increases a job’s usage counter and lowers its priority, while the passage of time decays this counter. Decay plays the same role as periodic priority boosts. As usage fades, priority rises, ensuring long-running jobs eventually run and allowing jobs that shift from CPU-bound to interactive to regain high priority.


TLA+ model  

I used Gemini to write a TLA+ model of the MLFQ algorithm here. To run this MLFQ TLA+ model at Spectacle for visualization, click this link and it will open the model on your browser, no installation or plugin required. What you will see is the initial state. Click on any enabled action to take it, you can go back and forward on the right pane to explore the execution. And you can share a URL back with anyone to point to an interesting state or trace, just like I did here.

February 10, 2026

{ w: 1 } Asynchronous Writes and Conflict Resolution in MongoDB

MongoDB guarantees durability—the D in ACID—over the network with strong consistency—the C in the CAP theorem—by default. It still maintains high availability: in the event of a network partition, the majority of nodes continue to serve consistent reads and writes transparently, without raising errors to the application.

A consensus protocol based on Raft is used to achieve this at two levels:

  • Writes are directed to the shard's primary, which coordinates consistency between the collection and the indexes. Raft is used to elect one replica as primary, with the others acting as secondaries.
  • Writes to the shard's primary are replicated to the secondaries and acknowledged once a majority has guaranteed durability on persistent storage. The equivalent of the Raft log is the data itself—the transaction oplogs.

It's important to distinguish the two types of consensus involved: one for controlling replica roles and one for the replication of data itself. By comparison, failover automation around monolithic databases like PostgreSQL can use a consensus protocol to elect a primary (as Patroni does), but replication itself is built into PostgreSQL and does not rely on a consensus protocol—a failure in the middle may leave inconsistency between replicas.

Trade-offs between performance and consistency

Consensus on writes increases latency, especially in multi-region deployments, because it requires synchronous replication and waiting on the network, but it guarantees no data loss in disaster recovery scenarios (RPO = 0). Some workloads may prefer lower latency and accept limited data loss (for example, a couple of seconds of RPO after a datacenter burns). If you ingest data from IoT devices, you may favor fast ingestion at the risk of losing some data in such a disaster. Similarly, when migrating from another database, you might prefer fast synchronization and, in case of infrastructure failure, simply restart the migration from before the failure point. In such cases, you can use {w:1} write concern in MongoDB instead of the default {w:"majority"}.

Most failures are not full-scale disasters where an entire data center is lost, but transient issues with short network disconnections. With {w:1}, the primary risk is not data loss—because writes can be synchronized eventually—but split brain, where both sides of a network partition continue to accept writes. This is where the two levels of consensus matter:

  • A new primary is elected, and the old primary steps down, limiting the split-brain window to a few seconds.
  • With the default {w:"majority"}, writes that cannot reach a majority are not acknowledged on the side of the partition without a quorum. This prevents split brain. However, with {w:1}, those writes are acknowledged until the old primary steps down.

Because the failure is transient, when the old primary rejoins, no data is physically lost: writes from both sides still exist. However, these writes may conflict, resulting in a diverging database state with two branches. As with any asynchronous replication, this requires conflict resolution. MongoDB handles this as follows:

  • Writes from the new primary are preserved, as this is where the application has continued to make progress.
  • Writes that occurred on the old primary during the brief split-brain window are rolled back, and it pulls the more recent writes from the new primary.

Thus, when you use {w:1}, you accept the possibility of limited data loss in the event of a failure. Once the node is back, these writes are not entirely lost, but they cannot be merged automatically. MongoDB stores them as BSON files in a rollback directory so you can inspect them and perform manual conflict resolution if needed.

This conflict resolution is a Recover To a Timestamp (RTT).

Demo on a Docker lab

Let's try it. I start 3 containers as a replica set:

docker network create lab
docker run --network lab --name m1 --hostname m1 -d mongo --replSet rs0
docker run --network lab --name m2 --hostname m2 -d mongo --replSet rs0
docker run --network lab --name m3 --hostname m3 -d mongo --replSet rs0
docker exec -it m1 mongosh --eval '
rs.initiate({
  _id: "rs0",
  members: [
    { _id: 0, host: "m1:27017", priority: 3 },
    { _id: 1, host: "m2:27017", priority: 2 },
    { _id: 2, host: "m3:27017", priority: 1 }
  ]
})
'
until
docker exec -it m1 mongosh --eval "rs.status().members.forEach(m => print(m.name, m.stateStr))" |
 grep -C3 "m1:27017 PRIMARY"
do sleep 1 ; done

The last command waits until m1 is the primary, as set by its priority. I do that to make the demo reproducible with simple copy-paste.

I insert "XXX-10" when connected to m1:

docker exec -it m1 mongosh --eval '
  db.demo.insertOne(
   { _id:"XXX-10" , date:new Date() },
   { writeConcern: {w: "1"}    }
)
'

{ acknowledged: true, insertedId: 'XXX-10' }

I disconnect the secondary m2:

docker network disconnect lab m2

With a replication factor of 3, the cluster is resilient to one failure and I insert "XXX-11", when connected to the primary:

docker exec -it m1 mongosh --eval '
  db.demo.insertOne(
   { _id:"XXX-11" , date:new Date() },
   { writeConcern: {w: "1"}    }
)
'

{ acknowledged: true, insertedId: 'XXX-11' }

I disconnect m1, the current primary, and reconnect m2, and immediately insert "XXX-12", still connected to m1:

docker network disconnect lab m1
docker network    connect lab m2

docker exec -it m1 mongosh --eval '
  db.demo.insertOne(
   { _id:"XXX-12" , date:new Date() },
   { writeConcern: {w: "1"}    }
)
'

{ acknowledged: true, insertedId: 'XXX-12' }

Here, m1 is still a primary for a short period before it detects it cannot reach the majority of replicas and steps down. If the write concern was {w: "majority"} it would have waited and failed, not able to sync to the quorum, but with {w: "1"} the replication is asynchronous and the write is acknowledged when written to local disks.

Two seconds later, a similar write fails because the primary stepped down:

sleep 2

docker exec -it m1 mongosh --eval '
  db.demo.insertOne(
   { _id:"XXX-13" , date:new Date() },
   { writeConcern: {w: "1"}    }
)
'

MongoServerError: not primary

I wait that m2 is the new primary, as set by priority, and connect to it to insert "XXX-20":

until
docker exec -it m2 mongosh --eval "rs.status().members.forEach(m => print(m.name, m.stateStr))" |
 grep -C3 "m2:27017 PRIMARY"
do sleep 1 ; done

docker exec -it m2 mongosh --eval '
  db.demo.insertOne(
   { _id:"XXX-20" , date:new Date() },
   { writeConcern: {w: "1"}    }
)
'

{ acknowledged: true, insertedId: 'XXX-20' }

No nodes are down, it's only a network partition, and I can read from all nodes as long as I don't connect through the network. I query the collection on each side:

docker exec -it m1 mongosh --eval 'db.demo.find()'
docker exec -it m2 mongosh --eval 'db.demo.find()'
docker exec -it m3 mongosh --eval 'db.demo.find()'

The inconsistency is visible, "XXX-12" is only in m1 and "XXX-20" only in m2 and m3:

I reconnect m1 so that all nodes can communicate and synchronize their state:

docker network    connect lab m1

I query again and all nodes show the same values:

"XXX-12" has disappeared and all nodes are now synchronized to the current state. When it rejoined, m1 rolled back the operations that occurred during the split-brain window. This is expected and acceptable, since the write used a { w: 1 } write concern, which explicitly allows limited data loss in case of failure in order to avoid cross-network latency on each write.

The rolled back operations are not lost, MongoDB logged them in a rollback directory in the BSON format, with the rolled back document as well as the related oplog.

I read and decode all BSON in the rollback directory:


docker exec -i m1 bash -c '
for f in /data/db/rollback/*/removed.*.bson
do
 echo "$f"
 bsondump $f --pretty
done
' | egrep --color=auto '^|^/.*|.*("op":|"XXX-..").*'

The deleted document is in /data/db/rollback/0ae03154-0a51-4276-ac62-50d73ad31fe0/removed.2026-02-10T10-40-58.1.bson:

{
        "_id": "XXX-12",
        "date": {
                "$date": {
                        "$numberLong": "1770719868965"
                }
        }
}

The deleted oplog for the related insert is in /data/db/rollback/local.oplog.rs/removed.2026-02-10T10-40-58.0.bson:

{
        "lsid": {
                "id": {
                        "$binary": {
                                "base64": "erR2AoFXS3mbcX4BJSiWjw==",
                                "subType": "04"
                        }
                },
                "uid": {
                        "$binary": {
                                "base64": "47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=",
                                "subType": "00"
                        }
                }
        },
        "txnNumber": {
                "$numberLong": "1"
        },
        "op": "i",
        "ns": "test.demo",
        "ui": {
                "$binary": {
                        "base64": "CuAxVApRQnasYlDXOtMf4A==",
                        "subType": "04"
                }
        },
        "o": {
                "_id": "XXX-12",
                "date": {
                        "$date": {
                                "$numberLong": "1770719868965"
                        }
                }
        },
        "o2": {
                "_id": "XXX-12"
        },
        "stmtId": {
                "$numberInt": "0"
        },
        "ts": {
                "$timestamp": {
                        "t": 1770719868,
                        "i": 1
                }
        },
        "t": {
                "$numberLong": "1"
        },
        "v": {
                "$numberLong": "2"
        },
        "wall": {
                "$date": {
                        "$numberLong": "1770719868983"
                }
        },
        "prevOpTime": {
                "ts": {
                        "$timestamp": {
                                "t": 0,
                                "i": 0
                        }
                },
                "t": {
                        "$numberLong": "-1"
                }
        }
}

Conclusion: beyond Raft

By default, MongoDB favors strong consistency and durability: writes use { w: "majority" }, are majority-committed, never rolled back, and reads with readConcern: "majority" never observe rolled-back data. In this mode, MongoDB behaves like a classic Raft system: once an operation is committed, it is final.

MongoDB also lets you explicitly relax that guarantee by choosing a weaker write concern such as { w: 1 }. In doing so, you tell the system: "Prioritize availability and latency over immediate global consistency." The demo shows what that implies:

  • During a transient network partition, two primaries can briefly accept writes.
  • Both branches of history are durably written to disk.
  • When the partition heals, MongoDB deterministically chooses the majority branch.
  • Operations from the losing branch are rolled back—but not discarded. They are preserved as BSON files with their oplog entries.
  • The node then recovers to a majority-committed timestamp (RTT) and rolls forward.

This rollback behavior is where MongoDB intentionally diverges from vanilla Raft.

In classic Raft, the replicated log is the source of truth, and committed log entries are never rolled back. Raft assumes a linearizable, strongly consistent state machine where the application does not expect divergence. MongoDB, by contrast, comes from a NoSQL and event-driven background, where asynchronous replication, eventual consistency, and application-level reconciliation are sometimes acceptable trade-offs.

As a result:

  • MongoDB still uses Raft semantics for leader election and terms, so two primaries are never elected in the same term.
  • For data replication, MongoDB extends the model with Recover To a Timestamp (RTT) rollback.
  • This allows MongoDB to safely support lower write concerns, fast ingestion, multi-region latency optimization, and migration workloads—without silently corrupting state.

In short, MongoDB replication is based on Raft, but adds rollback semantics to support real-world distributed application patterns. Rollbacks happen only when you explicitly allow them, never with majority writes, and they are fully auditable and recoverable.

Hydra joins Supabase

The Hydra team, maintainers of pg_duckdb, is joining Supabase to focus on Postgres + Analytics and Open Warehouse Architecture.