a curated list of database news from authoritative sources

August 04, 2025

Why MongoDB skips indexes when flattening or renaming sub-document fields in $project before $match aggregation pipeline

MongoDB is a general-purpose database that offers more than just a JSON datastore. It is recognized for its efficient indexing and aggregation capabilities within flexible schemas. The query planner optimizes the aggregation pipeline to filter data early and effectively utilize indexes.
Although this process looks similar to SQL databases, MongoDB stands out by supporting advanced indexing on nested objects and arrays.

It's important to note that some optimizer transformations applicable to tabular data in relational databases may not work the same way with documents. Keeping this in mind helps in understanding why an index is employed or not, as the query planner must ensure consistent results across all possible documents in the flexible schema.

As an example, a projection renaming a JSON dotted path before filtering can alter its semantics and hinder index usage.

Example

I store customer contact information like this:

db.customers.insertMany([
{  "customer_id": "C001",
  "contact": {
    "email": "marla.singer@fightclub.com",
    "phone": "555-1010"
  }
}
]);

{
  acknowledged: true,
  insertedIds: { '0': ObjectId('6890a5530400e901b3d4b0e1') }
}

The subdocument offers flexibility, accommodating customers who may have multiple emails, phone numbers or other contact information. However, I prefer to begin with a simple document that does not include arrays.

To enable efficient lookup by email, I create the following index:

db.customers.createIndex({ "contact.email": 1 })

contact.email_1

In a normalized SQL database, it is necessary to establish upfront that each user can only have one email address. If this is the case, the email should be stored in a column within the customers table or in a separate "contacts" table, with a unique foreign key to the customers table.

In contrast, MongoDB's document model allows flexibility, as you do not need to decide between One-to-One or One-to-Many relationships once for all future insertions. The data model remains accurate even if a customer has multiple contacts or a contact has several email addresses, thus avoiding significant refactoring when business rules change.

If you want to flatten the information in your aggregation pipeline, you may rename the contact's email to a simple top-level "email" field in a $project stage. You can achieve this by writing the following projection in an aggregation pipeline:

  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  }

You may additionally search for one email and execute the following aggregation pipeline:

db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  },
  {
    $match: { email: "marla.singer@fightclub.com" }
  }
])

[
  {
    _id: ObjectId('6890a5530400e901b3d4b0e1'),
    customer_id: 'C001',
    email: 'marla.singer@fightclub.com'
  }
]

Intuitively, this seems reasonable and you test it on your data sample, assuming one email per customer. However you realize that your index is not used:

db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  },
  {
    $match: { email: "marla.singer@fightclub.com" }
  }
]).explain( ).stages[0]['$cursor'].queryPlanner.winningPlan

{
  isCached: false,
  stage: 'PROJECTION_DEFAULT',
  transformBy: { _id: true, customer_id: true, email: '$contact.email' },
  inputStage: { stage: 'COLLSCAN', direction: 'forward' }
}

The $match operation was not pushed down to the $project stage, resulting in a COLLSCAN instead of an IXSCAN.

If you are used to SQL databases and tabular data, you might assume that the projection only renames "contact.email" to "email". This could lead to an expectation that the query planner would transform the expression $match: { "email": ... } to $match: { "contact.email": ... } and use the index on "contact.email".
However, this is not the case in a document database. The reason lies in the potential existence of arrays and embedded documents. Renaming dotted paths to a top-level field can alter the semantics based on the data's structure.

To use your index, you must start with the $match stage and $project later:

db.customers.aggregate([
  {
    $match: { "contact.email": "marla.singer@fightclub.com" }
  },
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  }
]).explain( ).queryPlanner.winningPlan

{
  isCached: false,
  stage: 'PROJECTION_DEFAULT',
  transformBy: { _id: true, customer_id: true, email: '$contact.email' },
  inputStage: {
    stage: 'FETCH',
    inputStage: {
      stage: 'IXSCAN',
      keyPattern: { 'contact.email': 1 },
      indexName: 'contact.email_1',
      isMultiKey: false,
      multiKeyPaths: { 'contact.email': [] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: {
        'contact.email': [ '["marla.singer@fightclub.com", "marla.singer@fightclub.com"]' ]
      }
    }
  }
}

The index was used and the result is apparently the same. However, it is semantically different and the result could be different with array fields, which is why the query planner doesn't perform the transformation.

With MongoDB flexible schema, I can insert sub-documents with different shape, like an array of multiple contacts, or even array of arrays:

db.customers.insertMany([
  {
    customer_id: "C002",
    contact: [
      { email: "robert.paulson@fightclub.com" },
      { phone: "555-2020" }
    ]
  },
  {
    customer_id: "C003",
    contact: [
      { email: ["narrator@fightclub.com", "tyler.durden@fightclub.com"] },
      { email: "jack@fightclub.com" },
    ]
  }
]);

{
  acknowledged: true,
  insertedIds: {
    '0': ObjectId('6890a74d0400e901b3d4b0e2'),
    '1': ObjectId('6890a74d0400e901b3d4b0e3')
  }
}

MongoDB accommodates changing business requirements and increasingly complex data structures throughout the application lifecycle. In C002, I introduced an array of contact elements, including email addresses and phone numbers. In C003, I added multiple values for a single element, enabling users to store numerous emails under a single "contact.email" path.

The projection of "contact.email" to the top level still allows the filter to be applied to array elements in C002. However, the projection of C003 results in an array of arrays.

db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  }
])

[
  {
    _id: ObjectId('6890a5530400e901b3d4b0e1'),
    customer_id: 'C001',
    email: 'marla.singer@fightclub.com'
  },
  {
    _id: ObjectId('6890a74d0400e901b3d4b0e2'),
    customer_id: 'C002',
    email: [ 'robert.paulson@fightclub.com' ]
  },
  {
    _id: ObjectId('6890a74d0400e901b3d4b0e3'),
    customer_id: 'C003',
    email: [
      [ 'narrator@fightclub.com', 'tyler.durden@fightclub.com' ],
      'jack@fightclub.com'
    ]
  }
]

With such a projection, $match: { "contact.email": "robert.paulson@fightclub.com" } will find one document that has one item equal to "robert.paulson@fightclub.com". However, $match: { "contact.email": "tyler.durden@fightclub.com" } will find no document. The projection of C003 is an array that has one item being an array, and would be in the result of $match: { "contact.email": [ 'narrator@fightclub.com', 'tyler.durden@fightclub.com' ] } but not on $match: { "contact.email": 'tyler.durden@fightclub.com' } alone.

With the projection first, I get a result for C002:


db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  },
  {
    $match: { email: "robert.paulson@fightclub.com" }
  }
]);

[
  {
    _id: ObjectId('6890a74d0400e901b3d4b0e2'),
    customer_id: 'C002',
    email: [ 'robert.paulson@fightclub.com' ]
  }
]

However, still with the projection first, I get no result for C003:

db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  },
  {
    $match: { email: "tyler.durden@fightclub.com" }
  }
]);

(no result)

To get C003 I need to match one of the first-level array items fully:


db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  },
  {
    $match: { email: [ 'narrator@fightclub.com', 'tyler.durden@fightclub.com' ] }
  }
]);

[
  {
    _id: ObjectId('6890a74d0400e901b3d4b0e3'),
    customer_id: 'C003',
    email: [
      [ 'narrator@fightclub.com', 'tyler.durden@fightclub.com' ],
      'jack@fightclub.com'
    ]
  }
]

db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  },
  {
    $match: { email: 'jack@fightclub.com' }
  }
]);

[
  {
    _id: ObjectId('6890a74d0400e901b3d4b0e3'),
    customer_id: 'C003',
    email: [
      [ 'narrator@fightclub.com', 'tyler.durden@fightclub.com' ],
      'jack@fightclub.com'
    ]
  }
]

This is different from a query that filters first, using a dotted path:

db.customers.aggregate([
  {
    $match: { "contact.email": 'tyler.durden@fightclub.com' }
  },
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  }
]);

[
  {
    _id: ObjectId('6890a74d0400e901b3d4b0e3'),
    customer_id: 'C003',
    email: [
      [ 'narrator@fightclub.com', 'tyler.durden@fightclub.com' ],
      'jack@fightclub.com'
    ]
  }
]

When building pipelines or views in MongoDB, avoid renaming dotted fields (e.g., "contact.email" to "email") if you intend to query and index the flattened field in downstream filters, especially in views or multi-stage aggregations. MongoDB's query optimizer cannot safely push a $match down before that $project, because the semantics with arrays/embedded documents and arrays-of-arrays would result in incorrect matches.

This differs from SQL, where the schema is fixed and column renaming in views or subqueries is safe. With MongoDB’s flexible schema and support for arrays at any level, such rewrite optimizations are not always correct.

For best performance (index usage), always filter on the original dotted path before projecting/renaming it. This may require placing filtering stages early, or restructuring the pipeline to avoid flattening nested fields you want to use for indexed lookups.

Integrating Citus with Patroni: Sharding and High Availability Together

Citus is a robust PostgreSQL extension that aids in scaling data distribution and provides a solid sharding mechanism. It enriches features like distributed tables, reference tables, columnar storage, schema-based sharding, etc. We have already covered the basics of Citus and the initial setup part in some earlier blog posts: How To Scale a Single-Host PostgreSQL […]

How hard is it to migrate AWAY from Convex?

The video walks through an experiment in “de-lock-in-ifying” a small Convex app: starting with the basic TanStack Start template, the author recreates Convex queries, mutations and actions as TanStack Start server functions; swaps Convex’s reactive data layer for React Query (with manual cache invalidation); and replaces Convex’s built-in cloud database with a self-hosted Postgres instance accessed via Drizzle ORM—eventually wrapping Drizzle in a Convex-style API so most original code can be copy-pasted. They also bolt on transactions, discuss substitutes for other Convex features (file storage, realtime, auth, scheduling, search, etc.), and note that exporting Convex data is straightforward. The upshot: you can migrate off Convex without huge code changes, but you trade Convex’s “batteries-included” simplicity for extra infrastructure to manage—so the easiest escape hatch is still running Convex in self-hosted mode.

August 02, 2025

MongoDB High Availability: Replica Set in a Docker Lab

MongoDB guarantees consistent and durable write operations through write-ahead logging, which protects data from instance crashes by flushing the journal to disk upon commit. It also protects against network partitions and storage failures with synchronous replication to a quorum of replicas. Replication and failover are built-in and do not require external tools or extensions. To set up a replica set, start three mongod instances as members of the same replica set using the --replSet option with the same name. To initiate the replica set, connect to one of the nodes and specify all members along with their priorities to become primary for the Raft election.

To experiment with replication, I run it in a lab with Docker Compose, where each node is a container. However, the network and disk latencies are too small compared to real deployments. I use Linux utilities tc and strace to inject some artificial latencies and test the setup in terms of latency, consistency, and resilience.

For this post, I write to the primary and read from each node to explain the write concern and its consequences for latency. Take this as an introduction. The examples don't show all the details, which also depend on read concerns, sharding, and resilience to failures.

Replica Set

I use the following Dockerfile to add some utilities to the MongoDB image:

FROM mongodb/mongodb-community-server
USER root
RUN apt-get update && apt-get install -y iproute2 strace

I start 3 replicas with the following Docker Compose service:

  mongo:
    build: .
    volumes:
      - .:/scripts:ro
    # inject 100ms network latency and 50ms disk sync latency 
    cap_add:
      - NET_ADMIN   # for tc
      - SYS_PTRACE  # for strace
    command: |
     bash -xc '
     tc qdisc add dev eth0 root netem delay 100ms ;
     strace -e inject=fdatasync:delay_enter=50000 -f -Te trace=fdatasync -o /dev/null mongod --bind_ip_all --replSet rs0 --logpath /var/log/mongod
     '
    deploy:
      replicas: 3

The command injects a 100ms network latency: with tc qdisc add dev eth0 root netem delay 100ms (it requires NET_ADMIN capability). The MongoDB server is started with strace (it requires SYS_PTRACE capability), which injects a delay of 50000 microseconds (delay_enter=50000) on each call to fdatasync

I declared a service to initiate the replicaset:

  init-replica-set:
    build: .
    depends_on:
      mongo:
        condition: service_started
    entrypoint: |
      bash -xc '
        sleep 3 ; 
        mongosh --host mongo --eval "
         rs.initiate( {_id: \"rs0\", members: [
          {_id: 0, priority: 3, host: \"${COMPOSE_PROJECT_NAME}-mongo-1:27017\"},
          {_id: 1, priority: 2, host: \"${COMPOSE_PROJECT_NAME}-mongo-2:27017\"},
          {_id: 2, priority: 1, host: \"${COMPOSE_PROJECT_NAME}-mongo-3:27017\"}]
         });
        ";
        sleep 1
      '

Read after Write application

I use a service to run the client application:

  client:
    build: .
    depends_on:
      init-replica-set:
        condition: service_completed_successfully
    volumes:
      - .:/scripts:ro
    entrypoint: |
      bash -xc '
        mongosh --host mongo -f /scripts/read-and-write.js
      '

The read-and-write.js script connects to each node with direct connection, labeled 1️⃣, 2️⃣, and 3️⃣, and also connects to the replica set, labeled 🔢, which writes to the primary and can read from secondary nodes:

const connections = {    
  "🔢": 'mongodb://rs-mongo-1:27017,rs-mongo-2:27017,rs-mongo-3:27017/test?replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=true&w=majority&journal=true',    
  "1️⃣": 'mongodb://rs-mongo-1:27017/test?directConnection=true&connectTimeoutMS=900&serverSelectionTimeoutMS=500&socketTimeoutMS=300',    
  "2️⃣": 'mongodb://rs-mongo-2:27017/test?directConnection=true&connectTimeoutMS=900&serverSelectionTimeoutMS=500&socketTimeoutMS=300',    
  "3️⃣": 'mongodb://rs-mongo-3:27017/test?directConnection=true&connectTimeoutMS=900&serverSelectionTimeoutMS=500&socketTimeoutMS=300',    
};    

After defining the connection strings, the script attempts to establish separate connections to each MongoDB node in the replica set, as well as a connection using the replica set URI that can send reads to secondaries. It continuously retries connections until at least one node responds and a primary is detected. The script keeps references to all active connections.

Once the environment is ready, the script enters an infinite loop to perform and monitor read and write operations. On each loop iteration, it first determines the current primary node. It then writes a counter value, which is a simple incrementing integer, to the primary node by updating a document identified by the client’s hostname. After performing the write call, it reads the same document from all nodes—primary, secondaries, and the replica set URI—recording the value retrieved from each and the time it took for the read to return.

For every read and write, the script logs details, including the value read or written, the node that handled the operation, the time it took, and whether the results match expectations. It uses checkmarks to indicate success and issues mismatch warnings if a value is stale. If an operation fails (such as when a node is temporarily unavailable), the script automatically attempts to reconnect to that node in the background for future operations.

I made all this available in the following repo:

https://github.com/FranckPachot/lab-mongodb-replicaset/tree/blog-202507-mongodb-high-availability-replicaset-in-a-docker-lab

Just start it with:


docker compose up --build

Write Concern majority - wait for network and disk

The connection string specifies w=majority

Once initialized, each line shows the value that is written to the replica set connection 🔢 and read from each connection 🔢,1️⃣, 2️⃣,3️⃣:

Screenshot:

Here is a sample output:

client-1            | 2025-07-08T20:19:01.044Z Write 19 to 🔢 ✅(  358ms) Read 19 from 🔢 ✅(  104ms) 19 from 1️⃣ ✅(  105ms) 19 from 2️⃣ ✅(  105ms) 19 from 3️⃣ ✅(  105ms) client e0edde683498
client-1            | 2025-07-08T20:19:02.111Z Write 20 to 🔢 ✅(  357ms) Read 20 from 🔢 ✅(  104ms) 20 from 1️⃣ ✅(  104ms) 20 from 2️⃣ ✅(  105ms) 20 from 3️⃣ ✅(  104ms) client e0edde683498
client-1            | 2025-07-08T20:19:03.179Z Write 21 to 🔢 ✅(  357ms) Read 21 from 🔢 ✅(  103ms) 21 from 1️⃣ ✅(  104ms) 21 from 2️⃣ ✅(  103ms) 21 from 3️⃣ ✅(  104ms) client e0edde683498
client-1            | 2025-07-08T20:19:04.244Z Write 22 to 🔢 ✅(  357ms) Read 22 from 🔢 ✅(  103ms) 22 from 1️⃣ ✅(  103ms) 22 from 2️⃣ ✅(  104ms) 22 from 3️⃣ ✅(  104ms) client e0edde683498
client-1            | 2025-07-08T20:19:05.310Z Write 23 to 🔢 ✅(  357ms) Read 23 from 🔢 ✅(  105ms) 23 from 1️⃣ ✅(  105ms) 23 from 2️⃣ ✅(  104ms) 23 from 3️⃣ ✅(  104ms) client e0edde683498
client-1            | 2025-07-08T20:19:06.377Z Write 24 to 🔢 ✅(  357ms) Read 24 from 🔢 ✅(  105ms) 24 from 1️⃣ ✅(  105ms) 24 from 2️⃣ ✅(  104ms) 24 from 3️⃣ ✅(  104ms) client e0edde683498
client-1            | 2025-07-08T20:19:07.443Z Write 25 to 🔢 ✅(  357ms) Read 25 from 🔢 ✅(  104ms) 25 from 1️⃣ ✅(  104ms) 25 from 2️⃣ ✅(  104ms) 25 from 3️⃣ ✅(  104ms) client e0edde683498
client-1            | 2025-07-08T20:19:08.508Z Write 26 to 🔢 ✅(  357ms) Read 26 from 🔢 ✅(  104ms) 26 from 1️⃣ ✅(  104ms) 26 from 2️⃣ ✅(  104ms) 26 from 3️⃣ ✅(  105ms) client e0edde683498

The program verifies that the read gets the latest write (✅), but keep in mind, this is not guaranteed. The default write concern is 'majority', which serves as a durability guarantee. It ensures that a write operation is saved to persistent storage on the majority of replicas in the journal. However, it does not wait for the write to be applied to the database and to be visible by reads. The goal is to measure the latency involved in acknowledging durability.

With an artificial latency of 100ms on the network and 50ms on the disk, we observe a connection time of 100ms to a node for both read and write operations.
For writes, it adds 250ms for the majority write concern:

  • 100ms for a secondary to pull the write operation (oplog)
  • 50ms to sync the journal to disk on the secondary
  • 100ms for the secondary to update the sync state to the primary

The total duration is 350ms. It also includes syncing to disk on the primary, which occurs in parallel with the replication.

MongoDB replication differs from many databases in that it employs a mechanism similar to Raft to achieve consistency across multiple nodes. However, changes are pulled by the secondary nodes rather than pushed by the primary. The primary node waits for a commit state, indicated by a Hybrid Logical Clock timestamp, sent by the secondary.

Write Concern: 0 - do not wait for durability

Another difference when comparing with traditional databases is that the client driver is part of the consensus protocol. To demonstrate it, I changed w=majority to w=0 not to wait for any acknowledgment of the write call, and restarted the client, with five replicas of it:


 docker compose up --scale client=5

The write is faster, not waiting on the network or disk, but the value that is read is stale:

client-5            | 2025-07-08T20:48:50.823Z Write 113 to 🔢 🚫(    1ms) Read 112 from 🔢 🚫(  103ms) 113 from 1️⃣ ✅(  103ms) 112 from 2️⃣ 🚫(  103ms) 112 from 3️⃣ 🚫(  103ms) client e0e3c8b1bafd
client-3            | 2025-07-08T20:48:50.824Z Write 113 to 🔢 🚫(    1ms) Read 112 from 🔢 🚫(  104ms) 113 from 1️⃣ ✅(  104ms) 112 from 2️⃣ 🚫(  104ms) 112 from 3️⃣ 🚫(  104ms) client 787c2676d17e
client-2            | 2025-07-08T20:48:51.459Z Write 114 to 🔢 🚫(    1ms) Read 113 from 🔢 🚫(  105ms) 114 from 1️⃣ ✅(  104ms) 113 from 2️⃣ 🚫(  105ms) 113 from 3️⃣ 🚫(  104ms) client 9fd577504268
client-1            | 2025-07-08T20:48:51.520Z Write 114 to 🔢 🚫(    1ms) Read 113 from 🔢 🚫(  105ms) 114 from 1️⃣ ✅(  105ms) 113 from 2️⃣ 🚫(  104ms) 113 from 3️⃣ 🚫(  104ms) client e0edde683498
client-4            | 2025-07-08T20:48:51.522Z Write 114 to 🔢 🚫(    1ms) Read 113 from 🔢 🚫(  103ms) 114 from 1️⃣ ✅(  103ms) 113 from 2️⃣ 🚫(  103ms) 113 from 3️⃣ 🚫(  103ms) client a6c1eaab69a7
client-5            | 2025-07-08T20:48:51.530Z Write 114 to 🔢 🚫(    0ms) Read 113 from 🔢 🚫(  103ms) 114 from 1️⃣ ✅(  103ms) 113 from 2️⃣ 🚫(  103ms) 113 from 3️⃣ 🚫(  103ms) client e0e3c8b1bafd
client-3            | 2025-07-08T20:48:51.532Z Write 114 to 🔢 🚫(    1ms) Read 113 from 🔢 🚫(  104ms) 114 from 1️⃣ ✅(  103ms) 113 from 2️⃣ 🚫(  103ms) 113 from 3️⃣ 🚫(  103ms) client 787c2676d17e
client-2            | 2025-07-08T20:48:52.168Z Write 115 to 🔢 🚫(    1ms) Read 114 from 🔢 🚫(  103ms) 115 from 1️⃣ ✅(  103ms) 114 from 2️⃣ 🚫(  103ms) 114 from 3️⃣ 🚫(  103ms) client 9fd577504268
client-4            | 2025-07-08T20:48:52.230Z Write 115 to 🔢 🚫(    1ms) Read 114 from 🔢 🚫(  103ms) 115 from 1️⃣ ✅(  103ms) 114 from 2️⃣ 🚫(  103ms) 114 from 3️⃣ 🚫(  103ms) client a6c1eaab69a7
client-1            | 2025-07-08T20:48:52.229Z Write 115 to 🔢 🚫(    1ms) Read 114 from 🔢 🚫(  104ms) 115 from 1️⃣ ✅(  104ms) 114 from 2️⃣ 🚫(  103ms) 114 from 3️⃣ 🚫(  103ms) client e0edde683498
client-5            | 2025-07-08T20:48:52.237Z Write 115 to 🔢 🚫(    2ms) Read 114 from 🔢 🚫(  103ms) 115 from 1️⃣ ✅(  103ms) 114 from 2️⃣ 🚫(  103ms) 114 from 3️⃣ 🚫(  103ms) client e0e3c8b1bafd
client-3            | 2025-07-08T20:48:52.240Z Write 115 to 🔢 🚫(    1ms) Read 114 from 🔢 🚫(  103ms) 115 from 1️⃣ ✅(  103ms) 114 from 2️⃣ 🚫(  103ms) 114 from 3️⃣ 🚫(  103ms) client 787c2676d17e
client-2            | 2025-07-08T20:48:52.876Z Write 116 to 🔢 🚫(    1ms) Read 115 from 🔢 🚫(  103ms) 116 from 1️⃣ ✅(  104ms) 115 from 2️⃣ 🚫(  104ms) 115 from 3️⃣ 🚫(  103ms) client 9fd577504268
client-4            | 2025-07-08T20:48:52.936Z Write 116 to 🔢 🚫(    1ms) Read 115 from 🔢 🚫(  103ms) 116 from 1️⃣ ✅(  104ms) 115 from 2️⃣ 🚫(  103ms) 115 from 3️⃣ 🚫(  103ms) client a6c1eaab69a7

The write occurs immediately, succeeding as soon as it is buffered on the driver. While this doesn't guarantee the durability of the acknowledged writes, it does avoid the costs associated with any network latency. In scenarios such as IoT, prioritizing throughput is crucial, even if it means accepting potential data loss during failures.

Because the write is acknowleged immediately, but has to be replicated and applied on other nodes, I read stale values (indicated by 🚫) except when the time to read was higher than the time to replicate and apply, but there's no guarantee on it.

Write Concern: 1 journal: false

I adjusted the write concern to w=1, which means that the system will wait for acknowledgment from the primary node. By default, this acknowledgment ensures that the journal recording the write operation is saved to persistent storage. However, I disabled it by setting journal=false, allowing the write latency to be reduced to just the network time to the primary, which is approximately 100ms:

client-2            | 2025-07-08T20:50:08.756Z Write 10 to 🔢 (  104ms) Read 10 from 🔢 (  105ms) 10 from 1️⃣ (  105ms) 10 from 2️⃣ (  104ms) 10 from 3️⃣ (  104ms) client 9fd577504268
client-4            | 2025-07-08T20:50:08.949Z Write 10 to 🔢 (  103ms) Read 10 from 🔢 (  105ms) 10 from 1️⃣ (  105ms) 10 from 2️⃣ (  106ms) 10 from 3️⃣ (  105ms) client a6c1eaab69a7
client-1            | 2025-07-08T20:50:08.952Z Write 10 to 🔢 (  103ms) Read 10 from 🔢 (  104ms) 10 from 1️⃣ (  104ms) 10 from 2️⃣ (  104ms) 10 from 3️⃣ (  105ms) client e0edde683498
client-3            | 2025-07-08T20:50:08.966Z Write 10 to 🔢 (  103ms) Read 10 from 🔢 (  104ms) 10 from 1️⃣ (  105ms) 10 from 2️⃣ (  104ms) 10 from 3️⃣ (  104ms) client 787c2676d17e
client-5            | 2025-07-08T20:50:08.970Z Write 10 to 🔢 (  103ms) Read 10 from 🔢 (  105ms) 10 from 1️⃣ (  105ms) 10 from 2️⃣ (  105ms) 10 from 3️⃣ (  105ms) client e0e3c8b1bafd
client-2            | 2025-07-08T20:50:09.569Z Write 11 to 🔢 (  103ms) Read 11 from 🔢 (  104ms) 11 from 1️⃣ (  104ms) 11 from 2️⃣ (  104ms) 11 from 3️⃣ (  104ms) client 9fd577504268
client-4            | 2025-07-08T20:50:09.762Z Write 11 to 🔢 (  104ms) Read 10 from 🔢 🚫(  105ms) 11 from 1️⃣ (  106ms) 11 from 2️⃣ (  105ms) 11 from 3️⃣ (  105ms) client a6c1eaab69a7
client-1            | 2025-07-08T20:50:09.765Z Write 11 to 🔢 (  103ms) Read 11 from 🔢 (  107ms) 10 from 1️⃣ 🚫(  104ms) 11 from 2️⃣ (  105ms) 11 from 3️⃣ (  106ms) client e0edde683498
client-3            | 2025-07-08T20:50:09.778Z Write 11 to 🔢 (  105ms) Read 11 from 🔢 (  104ms) 11 from 1️⃣ (  105ms) 11 from 2️⃣ (  105ms) 11 from 3️⃣ (  104ms) client 787c2676d17e
client-5            | 2025-07-08T20:50:09.782Z Write 11 to 🔢 (  103ms) Read 11 from 🔢 (  105ms) 11 from 1️⃣ (  104ms) 11 from 2️⃣ (  105ms) 11 from 3️⃣ (  105ms) client e0e3c8b1bafd
client-2            | 2025-07-08T20:50:10.381Z Write 12 to 🔢 (  103ms) Read 11 from 🔢 🚫(  105ms) 11 from 1️⃣ 🚫(  105ms) 12 from 2️⃣ (  105ms) 12 from 3️⃣ (  105ms) client 9fd577504268
client-1            | 2025-07-08T20:50:10.578Z Write 12 to 🔢 (  104ms) Read 12 from 🔢 (  106ms) 12 from 1️⃣ (  105ms) 12 from 2️⃣ (  105ms) 12 from 3️⃣ (  106ms) client e0edde683498
client-4            | 2025-07-08T20:50:10.579Z Write 12 to 🔢 (  104ms) Read 12 from 🔢 (  106ms) 12 from 1️⃣ (  106ms) 12 from 2️⃣ (  105ms) 12 from 3️⃣ (  105ms) client a6c1eaab69a7
client-5            | 2025-07-08T20:50:10.594Z Write 12 to 🔢 (11751ms) Read 11 from 🔢 🚫(  106ms) 12 from 1️⃣ (  106ms) 11 from 2️⃣ 🚫(  106ms) 11 from 3️⃣ 🚫(  105ms) client e0e3c8b1bafd
client-3            | 2025-07-08T20:50:10.592Z Write 12 to 🔢 (11753ms) Read 11 from 🔢 🚫(  105ms) 12 from 1️⃣ (  105ms) 11 from 2️⃣ 🚫(  105ms) 11 from 3️⃣ 🚫(  105ms) client 787c2676d17e

It is important to understand the consequences of failure. The change is written to the filesystem buffers but may not have been fully committed to disk since fdatasync() is called asynchronously every 100 milliseconds. This means that if the Linux instance crashes, up to 100 milliseconds of acknowledged transactions could be lost. However, if the MongoDB instance fails, there is no data loss, as the filesystem buffers remain intact.

Write Concern: 1 journal: true

Still with w=1, but the default journal=true, an fdatasync() is run before the acknowledgment of the write, to guarantee durability on that node. With my injected latency, it adds 50 milliseconds:

client-1            | 2025-07-08T20:52:34.922Z Write 48 to 🔢 ✅(  155ms) Read 48 from 🔢 ✅(  105ms) 48 from 1️⃣ ✅(  105ms) 47 from 2️⃣ 🚫(  105ms) 48 from 3️⃣ ✅(  105ms) client e0edde683498
client-3            | 2025-07-08T20:52:35.223Z Write 50 to 🔢 ✅(  154ms) Read 50 from 🔢 ✅(  104ms) 50 from 1️⃣ ✅(  105ms) 49 from 2️⃣ 🚫(  105ms) 50 from 3️⃣ ✅(  105ms) client 787c2676d17e
client-2            | 2025-07-08T20:52:35.276Z Write 49 to 🔢 ✅(  155ms) Read 49 from 🔢 ✅(  104ms) 49 from 1️⃣ ✅(  105ms) 48 from 2️⃣ 🚫(  105ms) 49 from 3️⃣ ✅(  105ms) client 9fd577504268
client-5            | 2025-07-08T20:52:35.377Z Write 49 to 🔢 ✅(  155ms) Read 49 from 🔢 ✅(  105ms) 49 from 1️⃣ ✅(  104ms) 48 from 2️⃣ 🚫(  105ms) 49 from 3️⃣ ✅(  104ms) client e0e3c8b1bafd
client-4            | 2025-07-08T20:52:35.430Z Write 50 to 🔢 ✅(  154ms) Read 50 from 🔢 ✅(  104ms) 50 from 1️⃣ ✅(  105ms) 49 from 2️⃣ 🚫(  105ms) 50 from 3️⃣ ✅(  105ms) client a6c1eaab69a7
client-1            | 2025-07-08T20:52:35.785Z Write 49 to 🔢 ✅(  154ms) Read 49 from 🔢 ✅(  103ms) 49 from 1️⃣ ✅(  103ms) 48 from 2️⃣ 🚫(  103ms) 49 from 3️⃣ ✅(  103ms) client e0edde683498
client-3            | 2025-07-08T20:52:36.086Z Write 51 to 🔢 ✅(  154ms) Read 51 from 🔢 ✅(  104ms) 51 from 1️⃣ ✅(  105ms) 50 from 2️⃣ 🚫(  104ms) 51 from 3️⃣ ✅(  104ms) client 787c2676d17e
client-2            | 2025-07-08T20:52:36.140Z Write 50 to 🔢 ✅(  154ms) Read 50 from 🔢 ✅(  105ms) 50 from 1️⃣ ✅(  104ms) 49 from 2️⃣ 🚫(  104ms) 50 from 3️⃣ ✅(  105ms) client 9fd577504268
client-5            | 2025-07-08T20:52:36.241Z Write 50 to 🔢 ✅(  155ms) Read 50 from 🔢 ✅(  104ms) 50 from 1️⃣ ✅(  103ms) 49 from 2️⃣ 🚫(  103ms) 50 from 3️⃣ ✅(  104ms) client e0e3c8b1bafd
client-4            | 2025-07-08T20:52:36.294Z Write 51 to 🔢 ✅(  154ms) Read 51 from 🔢 ✅(  102ms) 51 from 1️⃣ ✅(  103ms) 50 from 2️⃣ 🚫(  103ms) 51 from 3️⃣ ✅(  103ms) client a6c1eaab69a7
client-1            | 2025-07-08T20:52:36.645Z Write 50 to 🔢 ✅(  154ms) Read 50 from 🔢 ✅(  103ms) 50 from 1️⃣ ✅(  103ms) 49 from 2️⃣ 🚫(  103ms) 50 from 3️⃣ ✅(  103ms) client e0edde683498
client-3            | 2025-07-08T20:52:36.950Z Write 52 to 🔢 ✅(  154ms) Read 52 from 🔢 ✅(  104ms) 52 from 1️⃣ ✅(  103ms) 51 from 2️⃣ 🚫(  103ms) 52 from 3️⃣ ✅(  104ms) client 787c2676d17e
client-2            | 2025-07-08T20:52:37.003Z Write 51 to 🔢 ✅(  154ms) Read 51 from 🔢 ✅(  105ms) 51 from 1️⃣ ✅(  105ms) 50 from 2️⃣ 🚫(  105ms) 51 from 3️⃣ ✅(  104ms) client 9fd577504268
client-5            | 2025-07-08T20:52:37.103Z Write 51 to 🔢 ✅(  155ms) Read 51 from 🔢 ✅(  103ms) 51 from 1️⃣ ✅(  104ms) 50 from 2️⃣ 🚫(  104ms) 51 from 3️⃣ ✅(  104ms) client e0e3c8b1bafd
client-4            | 2025-07-08T20:52:37.155Z Write 52 to 🔢 ✅(  155ms) Read 52 from 🔢 ✅(  104ms) 52 from 1️⃣ ✅(  104ms) 51 from 2️⃣ 🚫(  104ms) 52 from 3️⃣ ✅(  103ms) client a6c1eaab69a7
client-1            | 2025-07-08T20:52:37.508Z Write 51 to 🔢 ✅(  154ms) Read 51 from 🔢 ✅(  104ms) 51 from 1️⃣ ✅(  104ms) 50 from 2️⃣ 🚫(  104ms) 51 from 3️⃣ ✅(  104ms) client e0edde683498

In summary, MongoDB allows applications to balance performance (lower latency) and durability (resilience to failures) rather than relying on one-size-fits-all configuration that waits even when it is not necessary according to business requirements. For any given setup, the choice must consider the business requirements as well as the infrastructure: resilience of compute and storage services, local or remote storage, and network latency between nodes. In a lab, injecting network and disk latency can help simulate scenarios that illustrate the consequences of reading from secondary nodes or recovering from a failure.

To fully understand how it works, I recommend checking your understanding by reading the documentation on Write Concern and practicing in a lab. The defaults may vary per driver and version, and the consequences may not be visible without a high load or failure. In current versions, MongoDB favors data protection with the write consistency defaulting to "majority" and journaling to true (writeConcernMajorityJournalDefault), but if you set w:1 journaling defaults to false.

August 01, 2025

Postgres 18 beta2: large server, Insert Benchmark, part 2

I repeated the benchmark for one of the workloads used in a recent blog post on Postgres 18 beta2 performance. The workload used 1 client and 1 table with 50M rows that fits in the Postgres buffer pool. In the result I explain here, one of the benchmark steps was run for ~10X more time. Figuring out how long to run the steps in the Insert Benchmark is always a work in progress -- I want to test more things, so I don't want to run steps for too long, but there will be odd results if the run times are too short.

tl;dr

  • up to 2% less throughput on range queries in the qr100 benchmark step. This is similar to what I saw in my previous report.
  • up to 12% more throughput on the l.i2 benchmark step in PG beta1 and beta2. This is much better than what I saw in my previous report.

Details

Details on the benchmark are in my previous post.

The benchmark is explained here and was run for one workloads -- 1 client, cached.

  • run with 1 client, 1 table and a cached database
  • load 50M rows in step l.i0, do 160M writes in step l.i1 and 40M in l.i2. Note that here the l.i1 and l.i2 steps run for ~10X longer than in my previous post.
The benchmark steps are:

  • l.i0
    • insert X million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts Y million rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and Z million rows are inserted and deleted per table.
    • Wait for N seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of N is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: overview

The performance report is here.

The summary section has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS (rQPS) to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from Postgres 17.4.

When rQPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. When it is 0.90 then I claim there is a 10% regression. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with red for <= 0.97, green for >= 1.03 and grey for values between 0.98 and 1.02.

Results: 1 client, cached

Normally I summarize the summary but I don't do that here to save space.

But the tl;dr is:
  • up to 2% less throughput on range queries in the qr100 benchmark step. This is similar to what I saw in my previous report.
  • up to 12% more throughput on the l.i2 benchmark step in PG beta1 and beta2. This is much better than what I saw in my previous report.

July 31, 2025

Security Advisory: CVE Affecting Percona Monitoring and Management (PMM)

A vulnerability has been discovered in all versions of Percona Monitoring and Management (PMM). There is no evidence this vulnerability has been exploited in the wild, and no customer data has been exposed. Vulnerability details This vulnerability stems from the way PMM handles input for MySQL services and agent actions. By abusing specific API endpoints, […]

Lessons from Building an AI App Builder on Convex

Over the past few months, we have built and grown Chef into the only AI app builder that knows backend. But, this process didn’t start with Chef. It started with building the Convex, the database that Chef is built on top of.

Scalability for the Large-Scale: File Copy-Based Initial Sync for Percona Server for MongoDB

On behalf of the entire Percona product team for MongoDB,  I’m excited to announce a significant enhancement to Percona Server for MongoDB: File Copy-Based Initial Sync (FCBIS). It is designed to accelerate your large-scale database deployment with a more efficient method for initial data synchronization. FCBIS reduces the time and resources required by the initial […]

Recent reads (July 2025)

 I know I should call this recent listens, but I am stuck with the series name. So here it goes. These are some recent "reads" this month.


Billion Dollar Whale

Reading the Billion Dollar Whale was exhausting. I am not talking about the writing, which was well-paced and packed with a lot of detail. The problem is the  subject, Jho Low, who is a slippery and soulless character, who conned Malaysia out of billions via the 1Malaysia Development Berhad sovereign wealth fund.

Jho Low is a Wharton grad. He is a big  spender and party boy. Dropping millions of dollars a night for gambling and partying. His party buddies included Leonardo DiCaprio, Paris Hilton, and Jamie Foxx. Jho was a showoff and pretentious ass. What does Wharton teach these people? Do they actively recruit for this type of people?

Jho was aided by the complicity of Prime Minister Najib Razak and his luxury-addicted wife. We are talking entire stores shut down for private shopping and flights hauling nothing but shoes, and cash bleeding out of a developing nation. Maybe the corruption isn't that surprising if you have followed similar stories. Turkey has been going through similar deals with Qatar, Gulf royalty, and procurement grifts. What was shocking in the book was the scale, and how easily it worked. Everyone looked away, banks, regulators, and governments. 

Najib Razak is serving time at prison, but Jho Low still walks free, and probably still convinced he is legit and bright businessman. This book left me fuming throughout.


The Shepherd's Crown

The Shepherd’s Crown is Terry Pratchett’s last Discworld book, and you can feel it through the themes of death, legacy, and transition in the book.

Granny Weatherwax, the formidable witch and moral center of the series, dies early. Her death isn't dramatic but quiet and peaceful. She leaves her cottage and her responsibilities to Tiffany Aching, the young witch she mentored. (It was later revealed that Pratchett intended one last twist in the book: Granny Weatherwax had hidden her soul in a cat, delaying her meeting with Death until she could say, "I'm leaving on my own terms". But Pratchett didn't get to write that scene.)

The story seemed rushed (obviously given the circumstances), but the writing is still high quality. There were several sentences that made me laugh out loud. Even weakened by Alzheimer's, Pratchett was still sharper than most writers at their best. He also had an awesome command of the English language. Borrowing what Douglas Adams said about Wodehouse, Pratchett was one of the greatest musicians of the English language.

Stephen Briggs, the longtime narrator, does a beautiful job with the audiobook. It feels like a goodbye letter to his friend.


David Heinemeier Hansson (DHH) interview

DHH, who now looks like a young Schwarzenegger with perfect curls, just did a six-hour interview with Lex Fridman. Yes, six hours and eight minutes. Here’s the link (also to the transcript), if you dare. Halfway along the interview, I noticed DHH sounds a lot like Bill Burr, especially when he is ranting, and he rants a lot. It is not just the voice, but the delivery, the takes, the contrarianism, and the angst. It's uncanny.

I remember reading DHH years ago and thinking, "This guy's basically a communist." But now he talks like a proper capitalist, maybe even a conservative. And yet, I still think DHH is authentic. He has "strong opinions, loosely held". He is loud and often smug. He is not afraid to throw elbows and get into fights. But I get the sense that he is persuadable, and if he saw he was wrong, he would change course. 

Of course the conversation spans Ruby, Rails, AI, and the philosophy of programming. DHH argues "Ruby does scale" citing that Shopify runs on Rails and handles over a million dynamic requests per second. He says Ruby is a "luxury language" that is human-friendly. Sure, it’s not the fastest. But it lets developers move fast, stay happy, and write expressive code. DHH argues the performance bottlenecks are  usually at the database. And in most businesses, developer time costs more than servers.

DHH says he uses AI daily as a tutor, a pair programmer, and a sounding board. But he draws a hard line at "vibe coding". He said vibe coding felt hollow, and  worse, it felt like his skills were evaporating. His rule is to always keep hands on the keyboard. He argues convincingly that programming is learned by doing, not by watching. Like playing guitar, the muscle memory is the knowledge. DHH sees programming not just as a job, but as a craft, something worth doing for its own sake.

July 30, 2025

Improve PostgreSQL performance: Diagnose and mitigate lock manager contention

Are your database read operations unexpectedly slowing down as your workload scales? Many organizations running PostgreSQL-based systems encounter performance bottlenecks that aren’t immediately obvious. When many concurrent read operations access tables with numerous partitions or indexes, they can even exhaust PostgreSQL’s fast path locking mechanism, forcing the system to use shared memory locks. The switch […]

Real Life Is Uncertain. Consensus Should Be Too!

Aleksey and I sat down to read this paper on Monday night. This was an experiment which aimed to share how experts read papers in real time. We haven't read this paper before to keep things raw. As it is with research, we ended up arguing with the paper (and between each other) back and forth. It was messy, and it was also awesome. We had a lot of fun. Check our discussion video below (please listen at 1.5x, I sound less horrible at that speed, ah also this thing is 2 hours long). The paper I annotated during our discussion is also available here.

This paper appeared in HotOS 2025, so it is very recent. It's a position paper arguing that the traditional F-threshold fault model in consensus protocols is outdated and even misleading.

Yes, the F-threshold fault model does feel like training wheels we never took off. In his essay "the joy of sects", Pat Helland bring this topic to tease distributed systems folk: "Distributed systems folks. These people vacillate between philosophers and lawyers. No one else can talk so fluently about total order without discussing the scope of totality. Availability is always couched by assuming a loss of, at most, F servers, while never veering into what happens when an operator logs into the system. Integral to their psyche is the belief in deterministic outcomes in the face of nondeterministic environments. Sometimes these discussions get seriously F'd up!" I am proud to report that, during a meeting with Pat, I came up with the pun this is "F'ed up, for F=1 and N=3". Although I must concede that Pat is the real A-tell-a The Pun!

So, the premise is agreeable: the clean abstraction of "up to f faults" is too simplistic for how real distributed systems fail. They argue that treating all faults equally (as the f-model does) hides real-world complexities. Machines don’t fail uniformly. Failures evolve over time (bathtub curve), cluster around software updates, and depend on hardware, and even location in datacenter (temperature). This 6 page paper has 77 references, so they seem to have done an extensive literature search on this topic.

Building on this observation, the paper advocates a paradigm shift: replacing the fixed F-model with a probabilistic approach based on per-node failure probabilities, derived from telemetry and predictive modeling. While the paper doesn't propose a new algorithm, it suggests that such a model enables cost-reliability tradeoffs. For instance, it mentions that running Raft on nine unreliable nodes could match the reliability of three high-end ones at a third of the cost. (It is unclear whether this accounts/prorates for throughput differences.)

But the deeper we read in to the paper, the more we found ourselves asking: what exactly is a fault curve (p_u), and what is this new model? Is this p_u = 1%,  per year, per month, per quorum formation? The paper never quite defines fault-curves, even though it's central to the argument.

We got even more confused in the paper's conflation of safety and liveness for Raft. FLP (1985) taught us to keep safety and liveness separate. Raft and Paxos are known for prioritizing safety above all. Even when there are more crash failures than F, the safety is not violated. So when the paper reports “Raft is only 99.97% safe and live,” the precision is misleading. What does that number even mean? How was it calculated? Also there is a big difference between "safe OR live" and "safe AND live". Why were the two bunched together in Table 2 for Raft?  What is meant here?

The paper says: "As faults are probabilistic, it is always possible for the number of faults to exceed F. Thus no consensus protocol can offer a guarantee stronger than probabilistic safety or liveness." Again, I suspect that "or" (in this case) between safety and liveness is carrying a lot of load. The safety of Paxos family of protocols rely on the quorum intersection property, so even when F is exceeded, the safety is not violated, although liveness could be lost. The paper says "Violating quorum intersection invariants triggers safety violations." But the quorum intersection is a priori calculated, the sum of two quorum sizes has to be bigger than N, so this is guaranteed by arithmetic, and it is not a probabilistic guarantee. We had to hypothesize a lot about why the paper seems to claim some safety violation: Is it maybe some durability loss? Is this assuming Byzantine failure? We still don't have an answer.

The paper does better with PBFT, separating safety and liveness in their reliability table. But even there, the model feels underspecified. There's a leap from "fault curves exist" to "this quorum configuration gives X nines of safety" without laying out a mathematical foundation.

Another counterintuitive point in the paper was the idea that more nodes can make things worse, probabilistically. For instance, the paper claims that a 5-node PBFT deployment could be significantly more reliable than a 7-node one, because quorum intersection becomes more fragile as the system scales with unreliable nodes. Again, we couldn't really make sense of this claim either, as there was not much explanation for it. 


Is giving up the F faults abstraction worth it?

This is a position paper, and it plays that role well. It surfaces important observations, challenges sacred abstractions, and provokes discussion. It aims to bring consensus modeling into a more probabilistic/complex (real?) world where failure rates vary, telemetry exists, and tradeoffs matter. It advocates for getting rid of the rigid F upper-bound for fault-tolerance. But complexity cuts both ways. A richer/complex model may capture more nuance, but it can also make reasoning and safety proofs much harder. And clarity and simplicity and guaranteed fault-tolerance is essential for consensus.

Actually, the paper made me appreciate the F abstraction for faults even more. It is simple, but it makes reasoning simpler in return.  It is possible to still be probabilistic and do all that analysis in selecting the F number. These days due to constant software rollovers many systems go with F=2 and N=5, or even higher numbers. And the nice thing about the Paxos family of protocols is due to quorum intersection, safety is always guaranteed, non-probabilistic, even when the F limit is exceed by extra crash faults (in reality network faults and partitions also bunch in here). And there has been good progress in decoupling F from N (thanks to the flexible quorums result), which addresses some of the complaints in the paper (e.g., "Linear size quorums can be overkill"). Moreover, heterogeneous deployments are already considered, and leader selection heuristics exist.

If the goal is the replace the F abstraction, there should be more thought put into what new abstraction would be proposed to take over. Abstractions are at the core of Computer Science and Distributed Systems. As one of my favorite Dijkstra quotes say:  "The purpose of abstraction is not to be vague, but to create new semantic level where one can be absolutely precise."

A Practical Guide to PostgreSQL Replication with Both Asynchronous and Synchronous Standbys

PostgreSQL streaming replication allows a standby server to continuously replicate data from a primary server. It operates by streaming Write-Ahead Log (WAL) segments from a primary server to one or more standby (replica) servers. The WAL segments contain a record of all changes made to the database, including data modifications and schema alterations (specifically, the […]

July 29, 2025

Postgres 18 beta2: large server, sysbench

This has performance results for Postgres 17.4, 17.5, 18 beta1 and 18 beta2 on a large server with sysbench microbenchmarks. Results like this from me are usually boring because Postgres has done a great job at avoiding performance regressions over time. This work was done by Small Datum LLC and not sponsored. Previous work from me for Postgres 17.4 and 18 beta1 is here.

The workload here is cached by Postgres and my focus is on regressions from new CPU overhead or mutex contention.

tl;dr

  • there might be small regressions (~2%) for range queries on the benchmark with 1 client. One cause is more CPU in BuildCachedPlan. 
  • there might besmall regressions (~2%) for range queries on the benchmark with 40 clients. One cause is more CPU in PortalRunSelect.
  • otherwise things look great

Builds, configuration and hardware

I compiled Postgres from source using -O2 -fno-omit-frame-pointer.

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and 
ext4. More details on it are here.

The config file for 17.4 and 17.5 is conf.diff.cx10a_c32r128.

The config files for 18 beta 1 are:
  • conf.diff.cx10b_c8r32
    • uses io_method='sync' to match Postgres 17 behavior
  • conf.diff.cx10c_c8r32
    • uses io_method='worker' and io_workers=32 to do async IO via a thread pool. I eventually learned that 32 is too large but I don't think it matters much on this workload.
  • conf.diff.cx10d_c8r32
    • uses io_method='io_uring' to do async IO via io_uring
Benchmark

I used sysbench and my usage is explained here. To save time I only run 27 of the 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

The tests are run using two workloads. For both the read-heavy microbenchmarks run for 300 seconds and write-heavy run for 600 seconds.
  • 1-client
    • run with 1 client and 1 table with 50M rows
  • 40-clients
    • run with 40 client and 8 table with 10M rows per table
The command lines to run all tests with my helper scripts are:
  •  bash r.sh 1 50000000 300 600 $deviceName 1 1 1
  • bash r.sh 8 10000000 300 600 $deviceName 1 1 40
Results

All files I saved from the benchmark are here.

I don't provide graphs in this post to save time and because there are few to no regressions from Postgres 17.4 to 18 beta2. The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

I do provide tables below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for PG 17.4)
When the relative QPS is > 1 then some version is faster than 17.4.  When it is < 1 then there might be a regression. Values from iostat and vmstat divided by QPS also provided. Theses can help to explain why something is faster or slower because it shows how much HW is used per request.

Results: 1-client

Tables with QPS per microbenchmark are here using absolute and relative QPS. All of the files I saved for this workload are here.

For point queries
  • QPS is mostly ~2% better in PG 18 beta2 relative to 17.4 (see here), but the same is true for 17.5. Regardless, this is good news.
For range queries without aggregation
  • full table scan is ~6% faster in PG 18 beta2 and ~4% faster in 17.5, both relative to 17.4
  • but for the other microbenchmarks, PG 18 beta2, 18 beta1 and 17.5 are 1% to 5% slower than 17.4. 
    • From vmstat and iostat metrics for range-[not]covered-pk and range-[not]covered-si this is explained by an increase in CPU/query (see the cpu/o column in the previous links). I also see a few cases where CPU/query is much larger but only for 18 beta2 with configs that use io_method =worker and =io_uring. 
    • I measured CPU using vmstat which includes all CPU on the host so perhaps something odd happens with other Postgres processes or some rogue process is burning CPU. I checked more results from vmstat and iostat and don't see storage IO during the tests.
    • Code that does the vacuum and checkpoint is here, output from the vacuum work is here, and the Postgres logfiles are here. This work is done prior to the range query tests.
For range queries with aggregation
  • there are regressions (see here), but here they are smaller than what I see above for range queries without aggregation
  • the interesting result is for the same query, but run with different selectivity to go from a larger to a smaller range and the regression increases as the range gets smaller (see here). To me this implies the likely problem is the fixed cost -- either in the optimizer or query setup (allocating memory, etc).
For writes
  • there are small regressions, mostly from 1% to 3% (see here).
  • the regressions are largest for the 18beta configs that use io_method=io_uring, that might be expected given the benefits it provides
Then I used Flamegraphs (see here) to try and explain the regressions. My benchmark helper scripts collect a Flamegraph about once per minute for each microbenchmark and the microbenchmarks were run for 5 minutes if read-mostly or 10 minutes if write-heavy. Then the ~5 or ~10 samples from perf (per microbenchmark) are combined to produce one Flamegraph per microbenchmark. My focus is on the distribution of time across thread stacks where there are stacks for parse, optimize, execute and network.
  • For range-covered-pk there is a small (2% to 3%) increase from PG 17.4 to 18 beta2 in BuildCachedPlan (see here for 17.4 and 18 beta2).
  • The increase in CPU for BuildCachedPlan also appears in Flamegraphs for other range query microbenchmarks
Results: 40-clients

Tables with QPS per microbenchmark are here using absolute and relative QPS. All of the files I saved for this workload are here, Postgres logfiles are here, output from vacuum is here and Flamegraphs are here.

For point queries
  • QPS is similar from PG 17.4 through 18 beta1 (see here).
For range queries without aggregation
  • full table scan is mostly ~2% faster after 17.4 (see here)
  • for the other microbenchmarks, 3 of the 4 have small regressions of ~2% (see here). The worst is range-covered-pk and the problem appears to be more CPU per query (see here). Unlike above where the new overhead was in BuildCachedPlan, here it is in the stack with PortalRunSelect.
For range queries with aggregation
  • QPS is similar from PG 17.4 through 18 beta2 (see here)
For writes
  • QPS drops by 1% to 5% for many microbenchmarks, but this problem starts in 17.5 (see here)
  • From vmstat and iostat metrics for update-one (which suffers the most, see here) the CPU per operation overhead does not increase (see the cpu/o column), the number of context switches per operation also does not increase (see the cs/o column).
  • Also from iostat, the amount of data written to storage doesn't change much.




























Morty: Scaling Concurrency Control with Re-Execution

This EuroSys '23 paper reads like an SOSP best paper. Maybe it helped that EuroSys 2023 was in Rome. Academic conferences are more enjoyable when the venue doubles as a vacation.

The Problem

Morty tackles a fundamental question: how can we improve concurrency under serializable isolation (SER), especially without giving up on interactive transactions? Unlike deterministic databases (e.g., Calvin) that require transactions to declare read and write sets upfront, Morty supports transactions that issue dynamic reads and writes based on earlier results.

Transactional systems, particularly in geo-replicated settings, struggle under contention. High WAN latency stretches transaction durations, increasing the window for conflicts. The traditional answer is blind exponential backoff, but that leads to low CPU utilization. TAPIR and Spanner replicas often idle below 17% under contention as Morty's evaluation experiments show.

Morty's approach to tackle the problem is to start from first principles, and investigate what limits concurrency under serializability? For this, Morty formalizes serialization windows, defined per transaction and per object, stretching from the commit of the value read to the commit of the value written. Serializability requires these windows not to overlap. (Note that avoiding overlaps is necessary but not sufficient for SER: you also need to prevent cycles through read-only transactions or indirect dependencies, which Morty addresses at commit time validation checks.)

Figures 1 and 2 illustrate these serialization windows. With re-execution, Morty commits T2 with minimal delay, instead of aborting and retrying. Figure 3 shows another case where re-execution avoids wasted work.

Morty Design

Morty's re-execution mechanism hinges on two ideas: read unrolling and a priori ordering.

Read unrolling allows Morty to selectively rewind and recompute parts of a transaction that depended on outdated reads. Rather than aborting the entire transaction, Morty re-executes just the stale portion. This is possible because transactions are written in continuation-passing style (CPS), which makes control flow and data dependencies explicit. CPS is common in asynchronous programming (JavaScript, Go, Java, Python, and libraries like NodeJS, LibEvent, and Tokio) and it maps well to networked databases like Morty.

Morty’s CPS API supports re-execution directly. Get(ctx, key, cont) reads a key and resumes at cont. Put(ctx, key, val) tentatively writes. Commit(ctx, cont) initiates prepare and finalize. Abort drops the execution. Abandon means transaction is re-executed (often partially) using a continuation from an earlier point. Re-execution reuses the context, shifts the stale read forward, and resumes execution.

A priori ordering assigns each transaction a speculative timestamp at arrival, defining a total order a la MVTSO. This order is not revised, even if clocks are skewed or messages are delayed. Instead, if execution violates the speculative order (e.g., a read misses a write that should've come earlier), Morty detects the conflict and re-executes the transaction to realign with the original order. The system adapts execution to fit the speculative schedule, not vice versa. The paper claims aborts are rare since re-execution usually succeeds.


I think a key idea in Morty is that contrary to most approaches, Morty ignores read validity (that committed transactions only observe committed data) during execution to expose more concurrency to transactions. It exposes both committed and uncommitted write to transactions by leveraging MVTSO and allows reads from uncommitted versions. These speculative reads are later validated at prepare-time prior to commit. If a read depended on a write that never committed, or missed a newer write, Morty re-executes the transaction (through abondon call)  or aborts it as a last resort. 

In addition to serialization windows, Morty defines validity windows to measure how long a transaction waits for its inputs to commit. A transaction Ti's validity window on object x starts when its dependency commits and ends when Ti commits. Like serialization windows, overlapping validity windows are disallowed. But unlike serialization windows, Morty doesn't try to align validity windows, and instead focuses on minimizing their span. Long validity windows mean low throughput. Morty shortens validity windows by avoiding unnecessary delays between reads and commits, preventing cascading speculative reads, and favoring re-execution over abort-and-retry.

Re-execution typically occurs during the commit protocol, when replicas must check commit status across a quorum. If they detect a stale read or violated serialization window, they trigger re-execution before finalizing. Validation checks include:

  • Reads didn't miss writes.
  • Other transactions didn't miss our writes.
  • Reads match committed state (no dirty reads).
  • No reads from garbage-collected transactions.
  • If all pass, replicas vote to commit. Otherwise, they vote to abandon and may supply a new value to trigger re-execution.

But why do replicas vote at all? Because Morty doesn't use Raft-style replica groups, with a leader calling the shots. In contrast to Raft-groups approach,  Morty doesn't have a central log or a leader for serializing/ordering all commands. It is closer to TAPIR, and it uses timestamps to assign speculative order. By integrating concurrency control with replication, Morty aims to improve throughput under contention and achieve low-latency geo-replication. So,  quorum-based voting ensures consistency and fault-tolerance as in TAPIR.

Voting ensures that a commit is durable across failures, visible to a majority, and recoverable even if the coordinator crashes. Without this, there's no way to guarantee correctness in a crash or partition.

Recovery is still tricky. Morty replicates across 2f+1 nodes and tolerates f failures. Coordinators may stall, so Morty uses a Paxos-style recovery protocol with view changes: any replica can step up and finalize the commit decision for a failed coordinator. This isn't trivial as it requires care to avoid split-brain and maintain consistency.

Morty's re-execution resembles CockroachDB’s read-refresh a bit. CRDB refreshes read timestamps if read spans haven't been overwritten, but it doesn't re-execute application logic. If one key's value changes, Morty rewinds only the dependent continuation. In contrast to CRDB, which must restart the whole transaction if refresh fails,  Morty semantically rewinds and reruns logic with new values.


Evaluation

The results are impressive. On TPC-C, Morty achieves 7.4x the throughput of Spanner-style-TAPIR, 4.4x of TAPIR, and 1.7x of MVTSO. On the high-contention Retwis benchmark, Morty delivers 96x throughput over TAPIR.

Morty scales with CPU. On Zipfian Retwis, it grows from 7.8k to 35.3k txn/s with 20 cores. Spanner and TAPIR plateau early (at 17% CPU utilization) due to frequent aborts and exponential backoff.


Conclusion

Morty is one of the most technically rich papers on serializability in recent years. It's dense and demanding. It assumes deep familiarity with concurrency control, replication, and async programming. But for those in the distributed systems and databases intersection, Morty is a very rewarding read.

One gripe: the code link is broken. https://github.com/matthelb/morty/


July 28, 2025

How to Perform Rolling Index Builds with Percona Operator for MongoDB

This post explains how to perform a Rolling Index Build on a Kubernetes environment running Percona Operator for MongoDB. Why and when to perform a Rolling Index Build? Building an index requires: CPU and I/O resources Database locks (even if brief) Network bandwidth If you have very tight SLAs or systems that are already operating […]

Postgres 18 beta2: large server, Insert Benchmark

This has results for the Insert Benchmark with Postgres on a large server. 

There might be small regressions, but I have more work in progress to explain that:

  • for a workload with 1 client and a cached database I see a small increase in CPU/operation (~10%) during the l.i2 benchmark step. I am repeating that benchmark.
  • for a workload with 20 clients and an IO-bound database I see a small decrease in QPS (typically 2% to 4%) during read+write benchmark steps.

Builds, configuration and hardware

I compiled Postgres from source using -O2 -fno-omit-frame-pointer for versions 18 beta2, 18 beta1, 17.5 and 17.4.

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and 
ext4. More details on it are here.

The config file for Postgres 17.4 and 17.5 is here and named conf.diff.cx10a_c32r128.

For 18 beta1 and beta2 I tested 3 configuration files, and they are here:
  • conf.diff.cx10b_c32r128 (x10b) - uses io_method=sync
  • conf.diff.cx10cw4_c32r128 (x10cw4) - uses io_method=worker with io_workers=4
  • conf.diff.cx10d_c32r128 (x10d) - uses io_method=io_uring
The Benchmark

The benchmark is explained here and was run for three workloads:
  • 1 client, cached
    • run with 1 client, 1 table and a cached database
    • load 50M rows in step l.i0, do 16M writes in step l.i1 and 4M in l.i2
  • 20 clients, cached
    • run with 20 clients, 20 tables (table per client) and a cached database
    • for each client/table - load 10M rows in step l.i0, do 16M writes in step l.i1 and 4M in l.i2
  • 20 clients, IO-bound
    • run with 20 clients, 20 tables (table per client) and a database larger than RAM
    • for each client/table - load 200M rows in step l.i0, do 4M writes in step l.i1 and 1M in l.i2
    • for the qr100, qr500 and qr1000 steps the working set is cached, otherwise it is not
The benchmark steps are:

  • l.i0
    • insert X million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts Y million rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and Z million rows are inserted and deleted per table.
    • Wait for N seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of N is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: overview

The performance report are here:
The summary section has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA. The summary sections are here:
Below I use relative QPS (rQPS) to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from Postgres 17.4.

When rQPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. When it is 0.90 then I claim there is a 10% regression. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with red for <= 0.97, green for >= 1.03 and grey for values between 0.98 and 1.02.

Results: 1 client, cached

Normally I summarize the summary but I don't do that here to save space.

There might be regressions on the l.i2 benchmark step that does inserts+deletes with smaller transactions while l.i1 does the same but with larger transactions. These first arrive with Postgres 17.5 but I will ignore that because it sustained a higher rate on the preceding benchmark step (l.i1) and might suffer from more vacuum debt during l.i2.

From the response time table, 18 beta2 does better than 18 beta1 based on the 256us and 1ms columns.

From the vmstat and iostat metrics, there is a ~10% increase in CPU/operation starting in Postgres 17.5 -- the value in the cpupq column increases from 596 for PG 17.4 to ~660 starting in 17.5. With one client the l.i2 step finishes in ~500 seconds and that might be too short. I am repeating the bencchmark to run that step for 4X longer.

Results: 20 clients, cached

Normally I summarize the summary but I don't do that here to save space. Regardless, this is easy to summarize - there are small improvements (~4%) on the l.i1 and l.i2 benchmark steps and no regressions elsewhere.

Results: 20 clients, IO-bound

Normally I summarize the summary but I don't do that here to save space.

From the summary, Postgres did not sustain the target write rates during qp1000 and qr1000 but I won't claim that it should have been able to -- perhaps I need faster IO. The first table in the summary section uses a grey background to indicate that. Fortunately, all versions were able to sustain a similar write rate. This was a also a problem for some versions on the qp500 step.

For the l.i2 step there is an odd outlier with PG 18beta2 and the cx10cw4_c32r128 config (that uses io_method=worker). I will ignore that for now.

For many of the read+write tests (qp100, qr100, qp500, qr500, qp1000, qr1000) thoughput with PG 18 beta1 and beta2 is up to 5% less than for PG 17.4. The regression might be explained by a small increase in CPU/operation.