a curated list of database news from authoritative sources

May 13, 2025

Amazon CloudWatch Database Insights applied in real scenarios

In this post, we show how you can use Amazon CloudWatch Database Insights for troubleshooting your Amazon RDS and Amazon Aurora resources. CloudWatch Database Insights serves as a database observability solution offering a tailored experience for DevOps engineers, application developers, and database administrators. This tool is designed to accelerate database troubleshooting processes and address issues across entire database fleets, enhancing overall operational efficiency.

May 12, 2025

InnoDB Cluster Setup: Building a 3-Node High Availability Architecture

Modern applications need to be highly available and easy to scale. A three-node MySQL InnoDB Cluster—built on MySQL Group Replication and connected through MySQL Router—provides a reliable way to support critical workloads. To set up this architecture, you start by deploying three MySQL server instances. In this example, the nodes are assigned the following hostname-to-IP […]

May 09, 2025

What is Convex & Why Should Developers Care?

Looking for a backend that feels built for how modern developers actually code? Convex is quickly becoming a go-to choice in 2025 for full-stack devs shipping real-time, collaborative apps without the backend overhead. This video breaks down what Convex is and how it works—from schema design to reactivity. You’ll see how Convex blends TypeScript safety, live queries, serverless functions, and a schema-flexible database into one cohesive developer experience. Whether you’re building a multiplayer game, a live doc editor, or a chat app, this walkthrough helps you figure out if Convex is the right fit—and how it stacks up to Firebase, Supabase, or Lambda-based setups.

Patroni: The Key PostgreSQL Component for Enterprise High Availability

You don’t have time for database downtime. Whether you’re managing a high-traffic application, supporting business-critical transactions, or ensuring real-time analytics stay online, the last thing you need is a PostgreSQL failure that grinds everything to a halt. But let’s be real; high availability (HA) in PostgreSQL isn’t something you can just switch on. Setting up […]

May 08, 2025

Modular verification of MongoDB Transactions using TLA+

Joint work with Will Schultz.

A transaction groups multiple operations into an all-or-nothing logical-box to reduce the surface area exposed to concurrency control and fault recovery, simplifying the application programmer's job. Transactions support ACID guarantees: atomicity, consistency, isolation, durability. Popular isolation levels include, Read Committed (RC), Snapshot Isolation (SI), and Serializability (SER), which offer increasing protection against concurrency anomalies.


MongoDB Transactions

MongoDB’s transaction model has evolved incrementally.  

  • v3.2 (2015): Introduced single-document transactions using MVCC in the WiredTiger storage engine.  
  • v4.0 (2018): Extended support to multi-document transactions within a replica set (aka shard).  
  • v4.2 (2019): Enabled fully distributed transactions across shards.

Replica Set Transactions. All transaction operations are first performed on the primary using the WiredTiger transaction workflow/algorithm. Before commit of the transaction, all the updates are Raft-replicated with secondaries using the assigned timestamp to ensure consistent ordering. MongoDB uses Hybrid Logical Clocks (HLCs). The read timestamp reflects the latest stable snapshot. The commit timestamp is issued atomically and advances the cluster time. The default ReadConcern="Snapshot" ensures reads reflect a majority-committed snapshot with a given timestamp without yielding. And the WriteConcern="Majority" guarantees writes are durably replicated to a majority.

Distributed Transactions. MongoDB txns are general interactive transactions, rather than limited one-shot transactions as in DynamoDB. Clients interact through mongos, the transaction router, for executing a transaction. Mongos assigns the transaction a cluster-wide read timestamp and dispatches operations to relevant shard primaries. Each shard primary sets its local WiredTiger read timestamp and handles operations. If a conflict (e.g., write-write race) occurs, the shard aborts the transaction and informs mongos.

If the transaction is not aborted and the client asks for a commit, the mongos asks the first shard contacted for the transaction to coordinate the 2-phase commit (2PC). Handing off the commit-coordination to a Raft-replicated primary helps ensure durability/recoverability of the transaction. The shard primary launches a standard 2PC:

  • Sends "prepare" to all participant shards.
  • Each shard Raft-replicates a prepare oplog entry with a local prepare timestamp.
  • The coordinator picks the max prepare timestamp returned from the shards as the global commit timestamp.
  • Participant shards Raft-replicate commit at this timestamp and acknowledge back.

Using Alex Miller's Execution, Validation, Ordering, Persistence (EVOP) framework to describe MongoDB's distributed transactions, we get the following figure. MongoDB overlaps execution and validation. Execution stages local changes at each participant shard. All the while, Validation checks for write-write and prepare conflicts. Ordering and Persistence come later. The global commit timestamp provides the ordering for the transactions. Shards expose changes atomically using this timestamp.

MongoDB provides an MVCC+OCC flavor that prefers aborts over blocking. WiredTiger acquires locks on keys at first write access, causing later conflicting transactions to abort. This avoids delays from waiting, reducing contention and improving throughput under high concurrency.


TLA+ Modeling

Distributed transactions are difficult to reason about. MongoDB’s protocol evolved incrementally, tightly coupling the WiredTiger storage layer, replication, and sharding infrastructure. Other sources of complexity include aligning time across clusters, speculative majority reads, recovery protocol upon router failure, chunk migration by the catalog,  interactions with DDL operations, and fault-tolerance.

To reason about MongoDB's distributed transaction protocol formally, we developed the first TLA+ specification of multi-shard database transactions at scale. Our spec is available publicly on GitHub. This spec captures the transaction behavior and isolation guarantees precisely.

Our TLA+ model is modular. It consists of MultiShardTxn, which encodes the high-level sharded transaction protocol, and Storage, which models replication and storage behavior at each shard. This modularity pays off big-time as we discuss in the model-based verification section below.

We validate isolation using a state-based checker based on the theory built by Crooks et al., PODC’17 and the TLA+ library implemented by Soethout. The library would take a log of operations and verify whether the transactions satisfy snapshot isolation, read committed, etc. This is a huge boost for checking/validating transaction isolation.

Our TLA+ model helps us explore how RC/WC selection for MongoDB tunable consistency levels affect transaction isolation guarantees. As MongoDB already tells its customers,  MongoDB's "ReadConcern: majority" does not guarantee snapshot isolation. If you use it instead of "ReadConcern:Snapshot", you may get fractured reads: a transaction may observe some, but not all, of another transaction's effects.

Let's illustrate this with a simplified two-shard, two-transaction model from an earlier spec. T1 writes to K1 and K2 (sharded to S1 and S2, respectively) and commits via two-phase commit. T2 reads K1 before T1 writes it and K2 after T1 has committed. Due to `readConcern: majority`, it reads the old value of K1 and the new value of K2, violating snapshot isolation. The read is fractured.

You can explore this violation trace using a browser-based TLA+ trace explorer that Will Schultz built by following this link. The Spectacle tool loads the TLA+ spec from GitHub, interprets it using JavaScript interpreter, and shows/visualizes step-by-step state changes. You can step backwards and forwards using the buttons, and explore enabled actions. This makes model outputs accessible to engineers unfamiliar with TLA+. You can share a violation trace simply by sending a link.

Modeling helped us to clarify another subtlety: handling prepared but uncommitted transactions in the storage engine. If the transaction protocol ignores a prepare conflict, T2's read at the t-start snapshot might see a value that appears valid at its timestamp, but is later overwritten by T1's commit at an earlier timestamp, violating snapshot semantics. That means a read must wait on prepared transactions to avoid this problem. This is an example of cross-layer interaction between the transaction protocol and the underlying WiredTiger storage we mentioned earlier.

Finally, some performance stats. Checking Snapshot Isolation and Read Committed with the TLA+ model on an EC2 `m6g.2xlarge` instance takes around 10 minutes for small instances of the problem. With just two transactions and two keys, the space is large but manageable. Bugs, if they exist, tend to show up even in small instances.


Model-based Verification

We invested early in modularizing the storage model (a discipline Will proposed) which paid off. With a clean storage API between the transaction layer and storage engine, we can generate test cases from TLA+ traces that exercise the real WiredTiger code, not just validate traces. This bridges the gap between model and implementation.

WiredTiger, being a single-node embedded store with a clean API, is easy to steer and test. We exploit this by generating unit tests directly from the model. Will built a Python tool that:

  • Parses the state graph from TLC,
  • Computes a minimal path cover,
  • Translates each path into Python unit tests,
  • Verifies that implementation values conform to the model.

This approach is powerful: our handwritten unit tests number in the thousands, but the generator produces over 87,000 tests in 30 minutes. Each test exercises the precise contract defined in the model, systematically linking high-level protocol behavior to the low-level storage layer. These tests bridge model and code, turning formal traces into executable validations.


Permissiveness

We use TLA+ to also evaluate the permissiveness of MongoDB’s transaction protocol—the degree of concurrency it allows under a given isolation level without violating correctness. Higher permissiveness translates to fewer unnecessary aborts and better throughput. Modeling lets us quantify how much concurrency is sacrificed for safety, and where the implementation might be overly conservative.

To do this, we compare our protocol's accepted behaviors to abstract commit tests from Crooks et al PODC'17 paper. By comparing the transaction protocol behavior to idealized isolation specs, we can locate overly strict choices and explore safe relaxations.

For example, for read committed, MongoDB's transaction protocol (computed over our finite model/configurations) accepts around 76% of the behaviors allowed by the isolation spec. One source of restriction for Read Committed is prepare conflicts, a mechanism that prevents certain races. Disabling it raises permissiveness to 79%. In one such case, a transaction reads the same key twice and sees different values: a non-repeatable read. Snapshot isolation forbids this;  but read committed allows it. MongoDB blocks it, maybe unnecessarily. If relaxing this constraint improves performance without violating correctness, it may be worth reconsidering.


Future Work

Our modular TLA+ specification brings formal clarity to a complex, distributed transaction system. But work remains on the following fronts:

  • Model catalog behavior to ensure correctness during chunk migrations.
  • Extend multi-grain modeling to other protocols.
  • Generate test cases directly from TLA+ to bridge spec and code.
  • Analyze and optimize permissiveness to improve concurrency.

Deploying External Read Replica in MySQL InnoDB Cluster

Innodb Cluster or ClusterSet topologies already have secondary instances that can act as a failover for primary or also offload read requests. However, with MySQL 8.4, we now have the feasibility of adding a separate async replica to the cluster for serving various special/ad-hoc queries or some reporting purposes. This will also help offload read traffic away […]

May 07, 2025

Concurrency Control and Recovery in Database Systems: Preface and Chapter 1

I'm catching up on Phil Eaton's book club and just finished the preface and Chapter 1 of Concurrency Control and Recovery in Database Systems by Bernstein, Hadzilacos, and Goodman.

This book came out in 1987: two years before the fall of Berlin wall, 5 years before Windows 3.1, and before the Gray/Reuters book on Transaction Processing

I was first surprised about why "Recovery" was featured prominently in the book title, but then realized that in 1987 there was no solid solution for recovery. The ARIES paper on the write-ahead-log came out in 1992.

Once I realized that, the structure made sense: concurrency control (Chapters 1–5), recovery (Chapters 6–7), and a forward-looking chapter on replication (Chapter 8), where they candidly admit: "No database systems that we know of support general purpose access to replicated distributed data."

  1. The Problem
  2. Serializability Theory
  3. Two Phase Locking
  4. Non-Locking Schedulers
  5. Multiversion Concurrency Control
  6. Centralized Recovery
  7. Distributed Recovery
  8. Replicated Data


Chapter 1: The Problem

Chapter 1 motivates the twin problems of concurrency control and recovery. It defines correct transaction behavior from the user’s point of view and introduces an internal system model that will be used throughout the book.

"User's point of view" here means an operational one, focused on histories of reads/writesm rather than the state-based, client-centric view I am fond of. (See my write-up of Seeing is Believing, Crooks et al., PODC 2017.)

With 40 years of hindsight, the definitions in Chapter 1 reads as straightforward: transactions, commit/abort, recoverability, cascading aborts, lost updates, serializability. The chapter uses many scenarios to illustrate how transactions interfere and how their effects must sometimes be undone upon abort.

I was initially surprised by the level of detail around undoing writes, until I remembered that this is a single-version model. No MVCC yet. Chapter 5 eventually gets there, explaining its benefits and trade-offs. But in 1987, MVCC was too costly in terms of storage/CPU, and OCC was not considered due to high contention workloads of that era. Instead, transactions were implemented with 2PL on resource-constrained single computers. Bernstein himself contributed a lot for the popularization of both MVCC and OCC.

The chapter closes with a model of a database system with four components: transaction manager (TM), scheduler, recovery manager (RM), and cache manager (CM). Transactions issue operations to the TM, which forwards them to the scheduler for concurrency control. The scheduler ensures serializability (and often strictness) by delaying, executing, or rejecting operations. The RM guarantees atomicity and durability, and uses Fetch/Flush from the CM to read/write persistent state. The CM handles movement between volatile and stable storage.

Coordination among modules relies on handshaking: if a module wants two operations ordered (say, write before flush), it must wait for an acknowledgment before issuing the next operation. This is a minimalist model of layered interaction that earns points for clarity and separation of concerns, although real systems today bear little resemblance to it.

RocksDB 10.2 benchmarks: large server

 This post has benchmark results for RocksDB 10.x, 9.x, 8.11, 7.10 and 6.29 on a large server.

\tl;dr

  • There are several big improvements
  • There are no new regressions
  • For the block cache hyperclock does much better than LRU on CPU-bound tests

Software

I used RocksDB versions 6.0.2, 6.29.5, 7.10.2, 8.11.4, 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.2, 9.7.4, 9.8.4, 9.9.3, 9.10.0, 9.11.2, 10.0.1, 10.1.3, 10.2.1. Everything was compiled with gcc 11.4.0.

For 8.x, 9.x and 10.x the benchmark was repeated using both the LRU block cache (older code) and hyperclock (newer code). That was done by setting the --cache_type argument:

  • lru_cache was used for versions 7.6 and earlier
  • hyper_clock_cache was used for versions 7.7 through 8.5
  • auto_hyper_clock_cache was used for versions 8.5+

Hardware

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.

Benchmark

Overviews on how I use db_bench are here and here.

All of my tests here use a CPU-bound workload with a database that is cached by RocksDB and the benchmark is run for 36 threads.

Tests were repeated for 3 workload+configuration setups:

  • byrx - database is cached by RocksDB
  • iobuf - database is larger than RAM and RocksDB uses buffered IO
  • iodir - database is larger than RAM and RocksDB uses O_DIRECT
The benchmark steps named on the charts are:
  • fillseq
    • load RocksDB in key order with 1 thread
  • revrangeww, fwdrangeww
    • do reverse or forward range queries with a rate-limited writer. Report performance for the range queries
  • readww
    • do point queries with a rate-limited writer. Report performance for the point queries.
  • overwrite
    • overwrite (via Put) random keys using many threads
Results: byrx

Performance summaries are here for: LRU block cache, hyperclock and LRU vs hyperclock. A spreadsheet with relative QPS and charts is here.

The graphs below shows relative QPS which is: (QPS for me / QPS for base case). When the relative QPS is greater than one than performance improved relative to the base case. The y-axis doesn't start at zero in most graphs to make it easier to see changes.

This chart has results for the LRU block cache and the base case is RocksDB 6.29.5:
  • overwrite
    • ~1.2X faster in modern RocksDB
  • revrangeww, fwdrangeww, readww
    • slightly faster in modern RocksDB
  • fillseq
    • ~15% slower in modern RocksDB most likely from new code added for correctness checks
This chart has results for the hyperclock block cache and the base case is RocksDB 8.11.4:
  • there are approximately zero regressions. The changes are small and might be normal variance.
This chart has results from RocksDB 10.2.1. The base case uses the LRU block cache and that is compared with hyperclock
  • readww
    • almost 3X faster with hyperclock because it suffers the most from block cache contention
  • revrangeww, fwdrangeww
    • almost 2X faster with hyperclock
  • fillseq
    • no change with hyperclock because the workload uses only 1 thread
  • overwrite
    • no benefit from hyperclock because write stalls are the bottleneck
Results: iobuf

Performance summaries are here for: LRU block cache, hyperclock and LRU vs hyperclock. A spreadsheet with relative QPS and charts is here.

The graphs below shows relative QPS which is: (QPS for me / QPS for base case). When the relative QPS is greater than one than performance improved relative to the base case. The y-axis doesn't start at zero in most graphs to make it easier to see changes.

This chart has results for the LRU block cache and the base case is RocksDB 6.29.5.
  • fillseq
    • ~1.6X faster since RocksDB 7.x
  • readww
    • ~6% faster in modern RocksDB
  • overwrite
  • revrangeww, fwdrangeww
    • ~5% slower since early 8.x
This chart has results for the hyperclock block cache and the base case is RocksDB 8.11.4.
  • overwrite
    • suffered from issue 12038 in versions 8.6 through 9.8. The line would be similar to what I show above had the base case been prior to 8.5 or earlier
  • fillseq
    • ~7% faster in 10.2 relative to 8.11
  • revrangeww, fwdrangeww, readww
    • unchanged from 8.11 to 10.2

This chart has results from RocksDB 10.2.1. The base case uses the LRU block cache and that is compared with hyperclock.

  • readww
    • ~8% faster with hyperclock. The benefit here is smaller than above for byrx because the workload here is less CPU-bound
  • revrangeww, fwdrangeww, overwrite
    • slightly faster with hyperclock
  • fillseq
    • no change with hyperclock because the workload uses only 1 thread

Results: iodir

Performance summaries are here for: LRU block cache, hyperclock and LRU vs hyperclock. A spreadsheet with relative QPS and charts is here

The graphs below shows relative QPS which is: (QPS for me / QPS for base case). When the relative QPS is greater than one than performance improved relative to the base case. The y-axis doesn't start at zero in most graphs to make it easier to see changes.

This chart has results for the LRU block cache and the base case is RocksDB 6.29.5.

  • fillseq
    • ~1.6X faster since RocksDB 7.x (see results above for iobuf)
  • overwrite
    • ~1.2X faster in modern RocksDB
  • revrangeww, fwdrangeww, readww
    • unchanged from 6.29 to 10.2

This chart has results for the hyperclock block cache and the base case is RocksDB 8.11.4.

  • overwrite
    • might have a small regression (~3%) from 8.11 to 10.2
  • revrangeww, fwdrangeww, readww, fillseq
    • unchanged from 8.11 to 10.2

This chart has results from RocksDB 10.2.1. The base case uses the LRU block cache and that is compared with hyperclock.

  • there are small regressions and/or small improvements and/or normal variance



May 06, 2025

HTAP Using a Star Query on MongoDB Atlas Search Index

MongoDB is used for its strength in managing online transaction processing (OLTP) with a document model that naturally aligns with domain-specific transactions and their access patterns. In addition to these capabilities, MongoDB supports advanced search techniques through its Atlas Search index, based on Apache Lucene. This can be used for near-real-time analytics and, combined with aggregation pipeline, add some online analytical processing (OLAP) capabilities. Thanks to the document model, this analytics capability doesn't require a different data structure and enables MongoDB to execute hybrid transactional and analytical (HTAP) workloads efficiently, as demonstrated in this article with an example from a healthcare domain.

Traditional relational databases employ a complex query optimization method known as "star transformation" and rely on multiple single-column indexes, along with bitmap operations, for efficient ad-hoc queries. This requires a dimensional schema, or star schema, which differs from the normalized operational schema updated by transactions. In contrast, MongoDB can be queried with a similar strategy using its document schema for operational use cases, simply requiring the addition of an Atlas Search index on the collection that stores transaction facts.

To demonstrate how a single index on a fact collection enables efficient queries even when filters are applied to other dimension collections, I utilize the MedSynora DW dataset, which is similar to a star schema with dimensions and facts. This dataset, published by M. Ebrar Küçük on Kaggle, is a synthetic hospital data warehouse covering patient encounters, treatments, and lab tests, and is compliant with privacy standards for healthcare data science and machine learning.

Import the dataset

The dataset is accessible on Kaggle as a folder of comma-separated values (CSV) files for dimensions and facts compressed into a 730MB zip file. The largest fact table that I'll use holds 10 million records.

I download the CSV files and uncompress them:

curl -L -o medsynora-dw.zip "https://www.kaggle.com/api/v1/datasets/download/mebrar21/medsynora-dw"

unzip medsynora-dw.zip

I import each file into a collection, using mongoimport from the MongoDB Database Tools:


for i in "MedSynora DW"/*.csv
do
 mongoimport -d "MedSynoraDW" --file="$i"  --type=csv --headerline -c "$(basename "$i" .csv)" -j 8                                                              
done

For this demo, I'm interested in two fact tables: FactEncounter and FactLabTest. Here are the fields described in the file headers:

# head -1 "MedSynora DW"/Fact{Encounter,LabTests}.csv

==> MedSynora DW/FactEncounter.csv <==
Encounter_ID,Patient_ID,Disease_ID,ResponsibleDoctorID,InsuranceKey,RoomKey,CheckinDate,CheckoutDate,CheckinDateKey,CheckoutDateKey,Patient_Severity_Score,RadiologyType,RadiologyProcedureCount,EndoscopyType,EndoscopyProcedureCount,CompanionPresent

==> MedSynora DW/FactLabTests.csv <==
Encounter_ID,Patient_ID,Phase,LabType,TestName,TestValue

The fact tables reference the following dimensions:


# head -1 "MedSynora DW"/Dim{Disease,Doctor,Insurance,Patient,Room}.csv

==> MedSynora DW/DimDisease.csv <==
Disease_ID,Admission Diagnosis,Disease Type,Disease Severity,Medical Unit

==> MedSynora DW/DimDoctor.csv <==
Doctor_ID,Doctor Name,Doctor Surname,Doctor Title,Doctor Nationality,Medical Unit,Max Patient Count

==> MedSynora DW/DimInsurance.csv <==
InsuranceKey,Insurance Plan Name,Coverage Limit,Deductible,Excluded Treatments,Partial Coverage Treatments

==> MedSynora DW/DimPatient.csv <==
Patient_ID,First Name,Last Name,Gender,Birth Date,Height,Weight,Marital Status,Nationality,Blood Type

==> MedSynora DW/DimRoom.csv <==
RoomKey,Care_Level,Room Type

Here is the dimensional model, often referred to as a "star schema" because the fact tables are located at the center, referencing the dimensions. Because of normalization, when facts contain a one-to-many composition it is described in two CSV files to fit into two SQL tables:

Star schema with facts and dimensions. The facts are stored in two tables in CSV files or a SQL database, but on a single collection in MongoDB. It holds the fact measures and dimension keys, which reference the key of the dimension collections.

MongoDB allows the storage of one-to-many compositions, such as Encounters and LabTests, within a single collection. By embedding LabTests as an array in Encounter documents, this design pattern promotes data colocation to reduce disk access and increase cache locality, minimizes duplication to improve storage efficiency, maintains data integrity without requiring additional foreign key processing, and enables more indexing possibilities. The document model also circumvents a common issue in SQL analytic queries, where joining prior to aggregation may yield inaccurate results due to the repetition of parent values in a one-to-many relationship.

As this would be the right data model for an operational database with such data, I create a new collection, using an aggregation pipeline, that I'll use instead of the two that were imported from the normalized CSV:


db.FactLabTests.createIndex({ Encounter_ID: 1, Patient_ID: 1 });

db.FactEncounter.aggregate([
  {
    $lookup: {
      from: "FactLabTests",
      localField: "Encounter_ID",
      foreignField: "Encounter_ID",
      as: "LabTests"
    }
  },
  {
    $addFields: {
      LabTests: {
        $map: {
          input: "$LabTests",
          as: "test",
          in: {
            Phase: "$$test.Phase",
            LabType: "$$test.LabType",
            TestName: "$$test.TestName",
            TestValue: "$$test.TestValue"
          }
        }
      }
    }
  },
  {
    $out: "FactEncounterLabTests"
  }
]);

Here is how one document looks:

AtlasLocalDev atlas [direct: primary] MedSynoraDW> 

db.FactEncounterLabTests.find().limit(1)
[
  {
    _id: ObjectId('67fc3d2f40d2b3c843949c97'),
    Encounter_ID: 2158,
    Patient_ID: 'TR479',
    Disease_ID: 1632,
    ResponsibleDoctorID: 905,
    InsuranceKey: 82,
    RoomKey: 203,
    CheckinDate: '2024-01-23 11:09:00',
    CheckoutDate: '2024-03-29 17:00:00',
    CheckinDateKey: 20240123,
    CheckoutDateKey: 20240329,
    Patient_Severity_Score: 63.2,
    RadiologyType: 'None',
    RadiologyProcedureCount: 0,
    EndoscopyType: 'None',
    EndoscopyProcedureCount: 0,
    CompanionPresent: 'True',
    LabTests: [
      {
        Phase: 'Admission',
        LabType: 'CBC',
        TestName: 'Lymphocytes_abs (10^3/µl)',
        TestValue: 1.34
      },
      {
        Phase: 'Admission',
        LabType: 'Chem',
        TestName: 'ALT (U/l)',
        TestValue: 20.5
      },
      {
        Phase: 'Admission',
        LabType: 'Lipids',
        TestName: 'Triglycerides (mg/dl)',
        TestValue: 129.1
      },
      {
        Phase: 'Discharge',
        LabType: 'CBC',
        TestName: 'RBC (10^6/µl)',
        TestValue: 4.08
      },
...

In MongoDB, the document model utilizes embedding and reference design patterns, resembling a star schema with a primary fact collection and references to various dimension collections. It is crucial to ensure that the dimension references are properly indexed before querying these collections.

Atlas Search index

Search indexes are distinct from regular indexes, which rely on a single composite key, as they can index multiple fields without requiring a specific order to establish a key. This feature makes them perfect for ad-hoc queries, where the filtering dimensions are not predetermined.

I create a single Atlas Search index that encompasses all dimensions or measures that I might use in predicates, including those found in an embedded document:

db.FactEncounterLabTests.createSearchIndex(
  "SearchFactEncounterLabTests", {
    mappings: {
      dynamic: false,
      fields: {
        "Encounter_ID":        { "type": "number"  },
        "Patient_ID":          { "type": "token"  },
        "Disease_ID":          { "type": "number"  },
        "InsuranceKey":        { "type": "number"  },
        "RoomKey":             { "type": "number"  },
        "ResponsibleDoctorID": { "type": "number" },
        "CheckinDate":         { "type": "token"  },
        "CheckoutDate":        { "type": "token"  },
        "LabTests":            {
          "type": "document" , fields: {
            "Phase":           { "type": "token"  },
            "LabType":         { "type": "token"  },
            "TestName":        { "type": "token"  },
            "TestValue":       { "type": "number" }
          }
        }
      }
    }
  }
);

Since I don't need extra text searching on the keys, I designate the character string ones as token. I label the integer keys as number. Generally, the keys are utilized for equality predicates. However, some can be employed for ranges when the format permits, such as check-in and check-out dates formatted as YYYY-MM-DD.

In a relational database, the star schema approach emphasizes the importance of limiting the number of columns in the fact tables, as they typically contain numerous rows. Smaller dimension tables can hold more columns and are typically denormalized in SQL databases (favoring a star schema over a snowflake schema). Likewise, in document modeling, incorporating all dimension fields would unnecessarily increase the size of the fact collection documents, making it more straightforward to reference the dimension collection. The general principles of data modeling in MongoDB enable querying it as a star schema without requiring extra consideration, as MongoDB databases are designed for the application access patterns.

Star query

A star schema allows processing queries which filter fields within dimension collections in several stages:

  1. In the first stage, filters are applied to the dimension collections to extract all dimension keys. These keys typically do not require additional indexes, as the dimensions are generally small in size.
  2. In the second stage, a search is conducted using all previously obtained dimension keys on the fact collection. This process utilizes the search index built on those keys, allowing for quick access to the required documents.
  3. A third stage may retrieve additional dimensions to gather the necessary fields for aggregation or projection. This multi-stage process ensures that the applied filter reduces the dataset from the large fact collection before any further operations are conducted.

For an example query, I aim to analyze lab test records for female patients who are over 170 cm tall, underwent lipid lab tests, have insurance coverage exceeding 80%, and were treated by Japanese doctors in deluxe rooms for hematological conditions.

Search aggregation pipeline

To optimize the fact collection process and apply all filters, I will begin with a simple aggregation pipeline that starts with a search on the search index. This allows for filters to be applied directly to the fact collection's fields, while additional filters will be incorporated in stage one of the star query. I utilize a local variable with a compound operator to facilitate the addition of more filters for each dimension in stage one of the star query.

Before going though the star query stages to add filters on dimensions, my query has a filter on the lab type which is in the fact collection, and indexed.

const search =  {
    "$search": {
      "index": "SearchFactEncounterLabTests",
      "compound": {
        "must": [
          { "in":    { "path": "LabTests.LabType" , "value": "Lipids"   } },
        ]
      },
      "sort": { CheckoutDate: -1 }
    }
  }

I have added a "sort" operation to sort the result by check-out date in descending order. This illustrates the advantage of sorting during the index search rather than in later steps of the aggregation pipeline, particularly when a "limit" is applied.

I'll use this local variable to add more filters in Stage 1 of the star query, so that it can be executed for Stage 2, and collect documents for Stage 3.

Stage 1: Query the dimension collections

In the first phase of the star query, I obtain the dimension keys from the dimension collections. For every dimension with a filter, get the dimension keys, with a find() on the dimension, and append a "must" condition to the "compound" of the fact index search.

The following adds the conditions on the Patient (female patients over 170 cm):

search["$search"]["compound"]["must"].push( { in: {
 path: "Patient_ID",                         // Foreign Key in Fact
 value: db.DimPatient.find(                  // Dimension collection
  {Gender: "Female", Height: { "$gt": 170 }} // filter on Dimension
 ).map(doc => doc["Patient_ID"]).toArray() } // Primary Key in Dimension
})

The following adds the conditions on the Doctor (Japanese):

search["$search"]["compound"]["must"].push( { in: {
 path: "ResponsibleDoctorID",               // Foreign Key in Fact
 value: db.DimDoctor.find(                  // Dimension collection
  {"Doctor Nationality": "Japanese" }       // filter on Dimension
 ).map(doc => doc["Doctor_ID"]).toArray() } // Primary Key in Dimension
})

The following adds the condition on the Room (Deluxe):

search["$search"]["compound"]["must"].push( { in: {
 path: "RoomKey",                         // Foreign Key in Fact
 value: db.DimRoom.find(                  // Dimension collection
  {"Room Type": "Deluxe" }                // filter on Dimension
 ).map(doc => doc["RoomKey"]).toArray() } // Primary Key in Dimension
})

The following adds the condition on the Disease (Hematology):

search["$search"]["compound"]["must"].push( { in: {
 path: "Disease_ID",                         // Foreign Key in Fact
 value: db.DimDisease.find(                  // Dimension collection
  {"Disease Type": "Hematology" } // filter on Dimension
 ).map(doc => doc["Disease_ID"]).toArray() } // Primary Key in Dimension
})

Finally, the condition on the Insurance coverage (greater than 80%):

search["$search"]["compound"]["must"].push( { in: {
 path: "InsuranceKey",                         // Foreign Key in Fact
 value: db.DimInsurance.find(                  // Dimension collection
  {"Coverage Limit": { "$gt": 0.8 } }          // filter on Dimension
 ).map(doc => doc["InsuranceKey"]).toArray() } // Primary Key in Dimension
})

All these search criteria have the same shape: a find() on the dimension collection, with the filters from the query, resulting in an array of dimension keys (like a primary key in a dimension table) that are used to search in the fact documents using it as a reference (like a foreign key in a fact table).

Each of those steps has queried the dimension collection to obtain a simple array of dimension keys, which are added to the aggregation pipeline. Rather than joining tables like in a relational database, the criteria on the dimensions are pushed down to the query on the fact tables.

Stage 2: Query the fact search index

With short queries on the dimensions, I have built the following pipeline search step:

AtlasLocalDev atlas [direct: primary] MedSynoraDW> print(search)
{
  '$search': {
    index: 'SearchFactEncounterLabTests',
    compound: {
      must: [
        { in: { path: 'LabTests.LabType', value: 'Lipids' } },
        {
          in: {
            path: 'Patient_ID',
            value: [
              'TR551',    'TR751',    'TR897',    'TRGT201',  'TRJB261',
              'TRQG448',  'TRSQ510',  'TRTP535',  'TRUC548',  'TRVT591',
              'TRABU748', 'TRADD783', 'TRAZG358', 'TRBCI438', 'TRBTY896',
              'TRBUH905', 'TRBXU996', 'TRCAJ063', 'TRCIM274', 'TRCXU672',
              'TRDAB731', 'TRDFZ885', 'TRDGE890', 'TRDJK974', 'TRDKN003',
              'TRE004',   'TRMN351',  'TRRY492',  'TRTI528',  'TRAKA962',
              'TRANM052', 'TRAOY090', 'TRARY168', 'TRASU190', 'TRBAG384',
              'TRBYT021', 'TRBZO042', 'TRCAS072', 'TRCBF085', 'TRCOB419',
              'TRDMD045', 'TRDPE124', 'TRDWV323', 'TREUA926', 'TREZX079',
              'TR663',    'TR808',    'TR849',    'TRKA286',  'TRLC314',
              'TRMG344',  'TRPT435',  'TRVZ597',  'TRXC626',  'TRACT773',
              'TRAHG890', 'TRAKW984', 'TRAMX037', 'TRAQR135', 'TRARX167',
              'TRARZ169', 'TRASW192', 'TRAZN365', 'TRBDW478', 'TRBFG514',
              'TRBOU762', 'TRBSA846', 'TRBXR993', 'TRCRL507', 'TRDKA990',
              'TRDKD993', 'TRDTO238', 'TRDSO212', 'TRDXA328', 'TRDYU374',
              'TRDZS398', 'TREEB511', 'TREVT971', 'TREWZ003', 'TREXW026',
              'TRFVL639', 'TRFWE658', 'TRGIZ991', 'TRGVK314', 'TRGWY354',
              'TRHHV637', 'TRHNS790', 'TRIMV443', 'TRIQR543', 'TRISL589',
              'TRIWQ698', 'TRIWL693', 'TRJDT883', 'TRJHH975', 'TRJHT987',
              'TRJIM006', 'TRFVZ653', 'TRFYQ722', 'TRFZY756', 'TRGNZ121',
              ... 6184 more items
            ]
          }
        },
        {
          in: {
            path: 'ResponsibleDoctorID',
            value: [ 830, 844, 862, 921 ]
          }
        },
        { in: { path: 'RoomKey', value: [ 203 ] } },
        {
          in: {
            path: 'Disease_ID',
            value: [
              1519, 1506, 1504, 1510,
              1515, 1507, 1503, 1502,
              1518, 1517, 1508, 1513,
              1509, 1512, 1516, 1511,
              1505, 1514
            ]
          }
        },
        { in: { path: 'InsuranceKey', value: [ 83, 84 ] } }
      ]
    },
    sort: { CheckoutDate: -1
  }
}

MongoDB Atlas Search indexes, built on Apache Lucene, efficiently handle complex queries with multiple conditions and manage long arrays of values. In this example, a search operation integrates the compound operator with the "must" clause to apply filters across attributes. This capability simplifies query design after resolving complex filters into lists of dimension keys.

With the "search" operation created above, I can run an aggregation pipeline to get the document I'm interested in:

db.FactEncounterLabTests.aggregate([
 search,
])

With my example, nine documents are returned in 50 milliseconds.

Estimate the count

This approach is ideal for queries with filters on many conditions, where none are very selective alone, but the combination is highly selective. Using queries on dimensions and a search index on facts avoids reading more documents than necessary. However, depending on the operations you will add to the aggregation pipeline, it is a good idea to estimate the number of records returned by the search index to avoid runaway queries.

Typically, an application that allows users to execute multi-criteria queries may define a threshold and return an error or warning when the estimated number of documents exceeds it, prompting the user to add more filters. For such cases, you can run a "$searchMeta" on the index before a "$search" operation. For example, the following checks that the number of documents returned by the filter is lower than 10,000:

MedSynoraDW>
                                    
                                    
                                    
                                    
                                

Notes from the TLA+ Community Event

I attended the TLA+ Community Event at Hamilton, Ontario on Sunday. Several talks pushed the boundaries of formal methods in the real world through incorporating testing, conformance, model translation, and performance estimation. The common thread was that: TLA+ isn't just for specs anymore. It's being integrated into tooling: fuzzers, trace validators, and compilers. The community is building bridges from models to reality, and it's a good time to be in the loop.

Below is a summary of selected talks, followed by some miscellaneous observations. This is just a teaser; the recordings will be posted soon on the TLA+ webpage.


Model-Based Fuzzing for Distributed Systems — Srinidhi Nagendra

Traditional fuzzing relies on random inputs and coverage-guided mutation, and works well for single-process software. But it fails for distributed systems due to too many concurrent programs, too many interleavings, and no clear notion of global coverage.

Srinidhi's work brings model-based fuzzing to distributed protocols using TLA+ models for coverage feedback. The approach, ModelFuzz, samples test cases from the implementation (e.g., Raft), simulates them on the model, and uses coverage data to guide mutation. Test cases are not sequences of messages, but of scheduling choices and failure events. This avoids over-generating invalid traces (e.g., a non-leader sending an AppendEntries).

The model acts as a coverage oracle. But direct enumeration of model executions is infeasible because of too many traces, too much instrumentation, too much divergence from optimized implementations (e.g., snapshotting in etcd). Instead, ModelFuzz extracts traces with lightweight instrumentation as mentioned above, simulates them on the model, and mutates schedules in simple ways: swapping events, crashes, and message deliveries. This turns out to be surprisingly effective. They found 1 new bug in etcd, 2 known and 12 new bugs in RedisRaft. They also showed faster bug-finding compared to prior techniques.


TraceLink: Automating Trace Validation with PGo — Finn Hackett & Ivan Beschastnikh

Validating implementation traces against TLA+ specs is still hard. Distributed systems don't hand you a total order. Logs are huge. Instrumentation is brittle. This talk introduced TraceLink, a toolchain that builds on PGo (a compiler from PlusCal to Go) to automate trace validation.

There are three key ideas. First, compress traces by grouping symbolically and using the binary format. Second, track causality using vector clocks, and either explore all possible event orderings (breadth-first) or just one (depth-first). Third, generate diverse traces via controlled randomness (e.g., injecting exponential backoffs between high-level PlusCal steps).

TraceLink is currently tied to PGo-compiled code, but they plan to support plain TLA+ models. Markus asked: instead of instrumenting with vector clocks, why not just log with a high-resolution global clock? That might work too. 

Finn is a MongoDB PhD fellow, and will be doing his summer internship with us at MongoDB Research in the Distributed Systems Research Group.


Translating C to PlusCal for Model Checking — Asterios Tech

Asterios Tech (a Safran defense group subsidiary) works on embedded systems with tiny kernels and tight constraints. They need to verify their scheduler, and manual testing doesn't cut it. So the idea they explore is to translate a simplified C subset to PlusCal automatically, then model check the algorithms for safety to the face of concurrency.

The translator, c2pluscal, is built as a Frama-C plugin. Due to the nature of the embedded programming domain, the C code is limited: no libc, no malloc, no dynamic structures. This simplicity helps in the translation process. Pointers are modeled using a TLA+ record with fields for memory location, frame pointer, and offset. Load/store/macros are mapped to PlusCal constructs. Arrays become sequences. Structs become records. Loops and pointer arithmetic are handled conservatively.

I am impressed that they model pointer arithmetic. This is a promising approach for analyzing legacy embedded C code formally, without rewriting it by hand.


More talks

The "TLA+ for All: Notebook-Based Teaching" talk introduced Jupyter-style TLA+ notebooks for pedagogy supporting inline explanations, executable specs, and immediate feedback.

I presented the talk "TLA+ Modeling of MongoDB Transactions" (joint work with Will Schultz). We will post a writeup soon.

Jesse J. Davis presented "Are We Serious About Using TLA+ For Statistical Properties?". He plans to blog about it.

Andrew Helwer presented "It’s never been easier to write TLA⁺ tooling!", and I defer to his upcoming blog post as well.

Markus Kuppe, who did the crux of organizing the event, demonstrated that GitHub Copilot can solve the Diehard problem with TLA+ in 4 minutes of reasoning, with some human intervention. He said that the TLA+ Foundation and NVidia is funding the "TLAI" challenge, for exploring novel AI augmentation of TLA+ modeling. 


Miscellaneous

The 90-minute lunch breaks were very European. A U.S. conference would cap it at an hour, and DARPA or NSF would eliminate it entirely: brown bag through talks. The long break was great for conversation.

In our workshop, audience questions were frequent and sharp. We are a curious bunch.

The venue was McMaster University in Hamilton, 90 minutes drive from home. Border crossings at the Lewiston-Queenston bridge were smooth without delays. But questions from border officers still stressed my daughters (ages 9 and 13). I reminded them how much worse we had it when we had visas, instead of the US citizenship.

My daughters also noticed how everything (roads, buildings, parks) is called Queen's this and Queen's that. My 9th year old tried to argue that since Canada is so close to US and since it looks so similar to US, it feels more like a U.S. state than a separate country. Strong Trump vibes with that one.

USD to CAD exchange rate is $1.38. I still remembered them to be pretty much on par, so I was surprised. We hadn’t visited Canada since 2020. A Canadian friend told me there's widespread discontent about the economy, rent and housing prices.

Canadians are reputed to be very nice. But drivers were aggressive—cutting in, speeding in mall lots. I also received tense, passive-aggressive encounters from two cashiers and a McMaster staff. Eh.

MongoDB Benchmarking Made Effortless With MongoDB Workload Generator

Recognizing a gap in the availability of straightforward tools for MongoDB benchmarking, particularly those that do not require complex compilation, configuration, and intricate setups, I developed MongoDB Workload Generator. The aim was to provide MongoDB users with an accessible solution for effortlessly generating realistic data and simulating application workloads on both sharded and non-sharded clusters.  […]

May 05, 2025

Amazon DocumentDB and multi-key indexing

In previous posts, I discussed the limitations of MongoDB emulations on databases like Oracle and PostgreSQL. AWS offers Amazon DocumentDB, which provides compatibility with MongoDB 5.0 and may run on top of Aurora PostgreSQL, a guess due to some similarities, never confirmed officially, but the storage capabilities are those of Aurora.

MongoDB's strength is not just in document storage like a key-value store, but also in its efficient indexing for queries with equality, sorting, and range filtering on flexible schemas with embedded arrays and sub-documents.
None of the compatible options I've tested can execute a simple .find().sort().limit() effectively without excessive document scanning, because of the underlying inverted indexes limitations. Is Amazon DocumentDB better?

I create a single-node instance with the highest version available, which compatible with MongoDB 5.0, an old version (I'm testing this in May 2025):

I created the same collection as in the previous post:

for (let i = 0; i < 10000; i++) {
 db.demo.insertOne( {
  a: 1 ,
  b: Math.random(),
 ts: new Date()
} )
}

db.demo.createIndex({ "a": 1 , ts: -1 }) ;

db.demo.countDocuments( { a: 1 } );

db.demo.find( { a: 1 } ).explain("executionStats")

The index I have created is not used, the whole collection is read (COLLSCAN returning 10000 documents):

rs0 [direct: primary] test> db.demo.countDocuments( { a: 1 } )

10000

rs0 [direct: primary] test> db.demo.find( { a: 1 } 
).explain("executionStats")

{
  queryPlanner: {
    plannerVersion: 1,
    namespace: 'test.demo',
    winningPlan: { stage: 'COLLSCAN' }
  },
  executionStats: {
    executionSuccess: true,
    executionTimeMillis: '10.268',
    planningTimeMillis: '0.124',
    executionStages: {
      stage: 'COLLSCAN',
      nReturned: '10000',
      executionTimeMillisEstimate: '9.803'
    }
  },
  serverInfo: { host: 'docdb-2025-05-02-12-45-45', port: 27017, version: '5.0.0' },
  ok: 1,
  operationTime: Timestamp({ t: 1746190938, i: 1 })
}

All rows match my filter so maybe Amazon DocumentDB has a different query planner and a full scan is valid here.

To validate that the index can be used, I can add a hint:

rs0 [direct: primary] test> db.demo.find( { a: 1 }
).hint({a:1,ts:-1}).explain("executionStats")

{
  queryPlanner: {
    plannerVersion: 1,
    namespace: 'test.demo',
    winningPlan: {
      stage: 'SUBSCAN',
      inputStage: { stage: 'IXSCAN', indexName: 'a_1_ts_-1', direction: 'forward' }
    }
  },
  executionStats: {
    executionSuccess: true,
    executionTimeMillis: '73.495',
    planningTimeMillis: '16.987',
    executionStages: {
      stage: 'SUBSCAN',
      nReturned: '10000',
      executionTimeMillisEstimate: '55.689',
      inputStage: {
        stage: 'IXSCAN',
        nReturned: '10000',
        executionTimeMillisEstimate: '42.151',
        indexName: 'a_1_ts_-1',
        direction: 'forward'
      }
    }
  },
  serverInfo: { host: 'docdb-2025-05-02-12-45-45', port: 27017, version: '5.0.0' },
  ok: 1,
  operationTime: Timestamp({ t: 1746190980, i: 1 })
}

The execution plan is not as verbose as MongoDB, so this doesn't give lots of information about the seek and index keys, but at least I know that my index can be used. Note that SUBSCAN is not a MongoDB execution plan stage, and there's no information telling me if { a: 1 } was filtered efficiently by the index. The next test will tell more.

Time to test what failed on CosmosDB - a simple compound index used for equality and sort:

rs0 [direct: primary] test> db.demo.find( { a: 1 } 
).sort({ts:-1}).limit(10).explain("executionStats")

{
  queryPlanner: {
    plannerVersion: 1,
    namespace: 'test.demo',
    winningPlan: {
      stage: 'SUBSCAN',
      inputStage: {
        stage: 'LIMIT_SKIP',
        inputStage: {
          stage: 'IXSCAN',
          indexName: 'a_1_ts_-1',
          direction: 'forward'
        }
      }
    }
  },
  executionStats: {
    executionSuccess: true,
    executionTimeMillis: '0.398',
    planningTimeMillis: '0.161',
    executionStages: {
      stage: 'SUBSCAN',
      nReturned: '10',
      executionTimeMillisEstimate: '0.200',
      inputStage: {
        stage: 'LIMIT_SKIP',
        nReturned: '10',
        executionTimeMillisEstimate: '0.195',
        inputStage: {
          stage: 'IXSCAN',
          nReturned: '10',
          executionTimeMillisEstimate: '0.193',
          indexName: 'a_1_ts_-1',
          direction: 'forward'
        }
      }
    }
  },
  serverInfo: { host: 'docdb-2025-05-02-12-45-45', port: 27017, version: '5.0.0' },
  ok: 1,
  operationTime: Timestamp({ t: 1746191080, i: 1 })
}
rs0 [direct: primary] test> ;

This looks good. Finally, is there a MongoDB-compatible API that can use an index to optimize pagination queries?

This was simple, as the index values are all scalar. However, the flexible schema of MongoDB allows arrays when one document has a One to Many relationship instead of a One to One.
In my opinion, this is the main advantage of a document database: not having to change the complete data model, which carries risks to the existing data not concerned by the change, when a business rule evolves. I explained an example of this, a multi-city airport, in a previous article.

I added similar documents but with an array of values in "a":

for (let i = 0; i < 10000; i++) {
 db.demo.insertOne( {
  a: [0,1,2] ,
  b: Math.random(),
 ts: new Date()
} )
}

In parallel, I did the same on a MongoDB Atlas database, running the genuine document database, version 8.0, to illustrate the expected outcomes. The index is utilized efficiently, reading just 10 index keys and fetching 10 documents without subsequent filtering or sorting:

But it is not the same on Amazon DocumentDB where the full collection is scanned, and sorted, before picking the 10 documents of the result:

rs0 [direct: primary] test> db.demo.find( { a: 1 }
).sort({ts:-1}).limit(10).explain("executionStats")

{
  queryPlanner: {
    plannerVersion: 1,
    namespace: 'test.demo',
    winningPlan: {
      stage: 'SUBSCAN',
      inputStage: {
        stage: 'LIMIT_SKIP',
        inputStage: {
          stage: 'SORT',
          sortPattern: { ts: -1 },
          inputStage: { stage: 'COLLSCAN' }
        }
      }
    }
  },
  executionStats: {
    executionSuccess: true,
    executionTimeMillis: '53.092',
    planningTimeMillis: '0.479',
    executionStages: {
      stage: 'SUBSCAN',
      nReturned: '10',
      executionTimeMillisEstimate: '52.572',
      inputStage: {
        stage: 'LIMIT_SKIP',
        nReturned: '10',
        executionTimeMillisEstimate: '52.564',
        inputStage: {
          stage: 'SORT',
          nReturned: '10',
          executionTimeMillisEstimate: '52.561',
          sortPattern: { ts: -1 },
          inputStage: {
            stage: 'COLLSCAN',
            nReturned: '20000',
            executionTimeMillisEstimate: '37.111'
          }
        }
      }
    }
  },
  serverInfo: { host: 'docdb-2025-05-02-12-45-45', port: 27017, version: '5.0.0' },
  ok: 1,
  operationTime: Timestamp({ t: 1746193272, i: 1 })
}

Although I can force the index with a hint, the process results in a full scan of all index entries. This method fails to apply the equality filter on the key and does not utilize the ordering of entries effectively:

rs0 [direct: primary] test> db.demo.find( { a: 1 }
).sort({ts:-1}).limit(10).hint({a:1,ts:-1}).explain("executionStats")

{
  queryPlanner: {
    plannerVersion: 1,
    namespace: 'test.demo',
    winningPlan: {
      stage: 'SUBSCAN',
      inputStage: {
        stage: 'LIMIT_SKIP',
        inputStage: {
          stage: 'SORT',
          sortPattern: { ts: -1 },
          inputStage: {
            stage: 'FETCH',
            inputStage: { stage: 'IXSCAN', indexName: 'a_1_ts_-1' }
          }
        }
      }
    }
  },
  executionStats: {
    executionSuccess: true,
    executionTimeMillis: '27.382',
    planningTimeMillis: '0.241',
    executionStages: {
      stage: 'SUBSCAN',
      nReturned: '10',
      executionTimeMillisEstimate: '27.102',
      inputStage: {
        stage: 'LIMIT_SKIP',
        nReturned: '10',
        executionTimeMillisEstimate: '27.094',
        inputStage: {
          stage: 'SORT',
          nReturned: '10',
          executionTimeMillisEstimate: '27.091',
          sortPattern: { ts: -1 },
          inputStage: {
            stage: 'FETCH',
            nReturned: '20000',
            executionTimeMillisEstimate: '15.592',
            inputStage: {
              stage: 'IXSCAN',
              nReturned: '20000',
              executionTimeMillisEstimate: '4.338',
              indexName: 'a_1_ts_-1'
            }
          }
        }
      }
    }
  },
  serverInfo: { host: 'docdb-2025-05-02-12-45-45', port: 27017, version: '5.0.0' },
  ok: 1,
  operationTime: Timestamp({ t: 1746193437, i: 1 })
}

Surprisingly, the result appears to be ordered when I use an index hint. I verified this on my result, although it may be a side effect of the implementation:

db.demo.find({ a: 1 }).hint({ a: 1, ts: -1 }).toArray().forEach((doc, index, docsArray) => {
    if (index > 0) {
        const previousDoc = docsArray[index - 1];
        if (doc.ts > previousDoc.ts) {
            console.log("The results are not ordered by 'ts' in descending order.");
        }
    }
});

But wait, I inserted a batch of scalars and then a batch of arrays. It is possible that each has a different partial index in the underlying PostgreSQL. I started with only scalar values, so one index, and was able to use its full filtering and ordering capabilities. In a second batch, with all arrays, a second index may have been created for them. It can use both indexes, as I've seen when forcing it with a hint, but with two branches in the execution plan, one for each index. With a concatenation, similar to a UNION ALL, PostgreSQL can preserve the order with a merge sort. However, if a bitmap scan is used, the ordering is lost. As my sort() in on the insertion timestamp, it is possible that, by chance, the indexes were scanned in the right order even of the ordering is not guaranteed.

To validate the ordering can be lost, reason why the query planner adds a sort stage, I inserted one more document with a scalar and ran my ordering test again:

rs0 [direct: primary] test> db.demo.insertOne( { a: 1 , b: Math.random(), ts: new Date() } )

{
  acknowledged: true,
  insertedId: ObjectId('68151d8420044b5ec3d9aea2')
}

rs0 [direct: primary] test> db.demo.find({ a: 1 }).hint({ a: 1, ts: -1 }).toArray().forEach((doc, index, docsArray) => {
...     if (index > 0) {
...         const previousDoc = docsArray[index - 1];
...         if (doc.ts > previousDoc.ts) {
...             console.log("The results are not ordered by 'ts' in descending order.");
...         }
...     }
... });

The results are not ordered by 'ts' in descending order.

Now that scalar values and arrays are interleaved, reading from two distinct indexes does not preserve the order. That would explain why the query planner cannot optimize the query.

The document model's primary advantage lies in having a single index over a flexible schema, as opposed to two indexes on separate tables in a normalized relational database. If my hypothesis holds true, the existence of multiple indexes due to the flexible schema negates this key benefit of a document database.

I tried multiple combinations of index creation and insertions. If I delete all rows (db.demo.deleteMany({})) it still behaves like a multi-key where the index cannot be used for equality and sort.
Dropping and re-creating the index resets this state. However, if I insert rows with scalar and arrays before creating the index, I got a case where I cannot force the index with a hint:

rs0 [direct: primary] test> db.demo.dropIndex({ "a": 1 , ts: -1 }) ; 

{                                                                                                                                                                                                         
  nIndexesWas: 2,                                                                                                                                                                                         
  ok: 1,                                                                                                                                                                                                  
  operationTime: Timestamp({ t: 1746215608, i: 1 })                                                                                                                                                       
}                         

rs0 [direct: primary] test> db.demo.insertOne( { a: 1    , b: Math.random(), ts: new Date() } )   

{                                                                                                                                                                                                         
  acknowledged: true,                                                                                                                                                                                     
  insertedId: ObjectId('681522ba20044b5ec3d9fcc8')                                                                                                                                                        
} 

rs0 [direct: primary] test> db.demo.insertOne( { a: [ 1,2,3 ] , b: Math.random(), ts: new Date() } )    

{                                                                                                                                                                                                         
  acknowledged: true,                                                                                                                                                                                     
  insertedId: ObjectId('681522bb20044b5ec3d9fcc9')                                                                                                                                                        
}                

rs0 [direct: primary] test> db.demo.createIndex({ "a": 1 , ts: -1 }) ;    

a_1_ts_-1      

rs0 [direct: primary] test> db.demo.find( { a: 1 }
).hint({a:1,ts:-1}).explain("executionStats")  

MongoServerError: Cannot use Hint for this Query. 
Index is multi key index , partial index or sparse index and query is not optimized to use this index.

The available statistics and logs are insufficient to identify the type of index in use. It is neither a MongoDB index, as it doesn't have the same capabilities, nor does it behave like one single PostgreSQL index.
In PostgreSQL, an inverted index, such as GIN, utilizes bitmaps for scanning and does not maintain order, even with scalar values. While an expression index can facilitate equality, sorting, and range queries, it cannot be established on a flexible schema with arrays in the key.
I suspect that multiple partial expression indexes are created, and full features can only be utilized before a multi-key is detected. Maybe the SUBSCAN is the operation that combines results from multiple partial indexes, to emulate MongoDB flexible schema, but loosing the optimization of pagination queries.

To effectively compare databases with a compatible API, focus on indexing options and execution plans for OLTP query patterns. Benchmarks must run over extended periods with increasing data, testing critical query patterns like find().sort().limit(). Always examine execution plans first to avoid wasting cloud resources in long tests. You can compare plans scanning MongoDB compound indexes.
Note that Amazon DocumentDB may have other advantages, like the integration with some Amazon services, and Aurora-like separation of compute and storage, but when it comes to take the full advantages of a document database, MongoDB is more advanced, and is available on AWS, like other clouds, with Atlas.