You can understand how MongoDB stores documents internally with simple queries that rely on the physical storage ordering. Some databases store records (called rows or tuples) in heap tables, using their physical location in the data files, such as ROWID in Oracle or CTID in PostgreSQL, to reference those records from index entries. In contrast, databases like MySQL's InnoDB or YugabyteDB store records in the primary key index ("clustered index", or "index organized table"), storing them by the logical order of their primary key values, so that secondary indexes point to these logical locations with the primary key, or an encoded version of it.
MongoDB default collections are similar to heap tables because their documents are stored independently of the primary key ("_id") exposed to the application. Internally, the WiredTiger storage engine organizes collection documents using a B+Tree structure, with an internal RecordId as the key, assigned by MongoDB. This structure resembles a clustered index, but it is clustered on an internal key rather than the primary key.
MongoDB’s approach improves on traditional heap tables, especially for storing variable-size documents, because WiredTiger uses B+Tree nodes for efficient space management, reusing space and splitting pages as needed, rather than relying on settings like PCTFREE or FILLFACTOR to reserve space for updates, or SHRINK/VACUUM operations to defragment after deletes.
To cover all cases, with clustered collections, MongoDB can generate the RecordId from the "_id", the primary key exposed to the application, making storage similar to how some databases organize tables in clustered indexes, as "_id" can be a generated ObjectId or defined by the application at insert time. So, when looking at storage internals levels, there are two keys:
-
"_id" is the application's primary key, a generated surrogate key, or a natural key. It is always indexed, and this unique index, like other secondary indexes, references the document with a RecordId
-
RecordId is the internal key. It can be generated from "_id" (in clustered collections), but is more generally generated as a monotonically increasing 64-bit integer during inserts. It can be considered a physical address by the query layer, but it is not directly mapped to an address in the filesystem because files are B+Tree structures.
This offers physical data independence since the primary key generation pattern does not affect the storage organization. However, it is helpful to understand how it functions when reviewing execution plans.
Another perspective is that, aside from clustered collections, all indexes in MongoDB, including the primary key index on "_id", are essentially secondary indexes, similar to those found in heap table databases (such as Db2, Oracle, and PostgreSQL). However, instead of a heap table with a fixed block size and row identification tied to their physical location, MongoDB documents are stored within an internal B+Tree index using the WiredTiger engine. Both approaches have their rationale:
-
SQL databases are primarily optimized for fixed, normalized schemas with small, uniform row lengths, where a PCTFREE or FILLFACTOR can be set according to the expected updates. Storing larger types involves row chaining or slicing (like Oracle LOB chunks or PostgreSQL TOAST)
-
MongoDB is designed for flexible schemas, and collections can contain documents of any size up to the BSON limit. Typically, documents are tens to hundreds of kilobytes, with some larger ones reaching a few megabytes. This flexibility requires adaptable storage management and efforts to minimize fragmentation beyond a small, fixed page size. A B+Tree with a flexible leaf block size is a suitable structure for this purpose.
The document size is flexible, thanks to the storage described above, but the ideal document size is a frequent question. Until I write a blog post on this, here’s a slide: the green area indicates where the most efficient access is, the red side is acceptable for outliers if they don't grow further, but may be a sign of embedding too much, and the blue side works for small documents inserted and queried together, but it may also be a sign that you did unnecessary normalization and should embed more to avoid runtime scattered lookups.
An example to understand the internal ordering
To demonstrate how it works, I generate ten documents and insert them asynchronously, so they may be written to the database in a random order:
db.collection.drop();
Array.from({ length: 10 }, (_, i) => {
db.collection.insertOne({
_id: `#${String(i).padStart(5, '0')}` ,
val: Math.random()
});
});
If I query without any filter or sort, the query planner chooses a COLLSCAN, which reads the records in the order of their RecordID, in the order they were inserted:
test> db.collection.find();
[
{ _id: '#00002', val: 0.07658988613973294 },
{ _id: '#00008', val: 0.39893981577036675 },
{ _id: '#00009', val: 0.5279631881196858 },
{ _id: '#00007', val: 0.8445363162277748 },
{ _id: '#00006', val: 0.01935050813731909 },
{ _id: '#00004', val: 0.0732484258238264 },
{ _id: '#00005', val: 0.7733464850237388 },
{ _id: '#00003', val: 0.3356001641172073 },
{ _id: '#00000', val: 0.8956753135566624 },
{ _id: '#00001', val: 0.4952318922619017 }
]
Keep in mind that I'm working with just a single node here. Sharding and parallel processing might retrieve rows in a different order than how they're stored. You should not rely on any "natural" order. Instead, unless you're conducting this type of investigation, where you're guessing the physical ordering from the query layer, ensure that you use an explicit sort operation to specify the expected order of results.
I can display the RecordId with .showRecordId(), which adds it to the cursor projection:
test> db.collection.find().showRecordId();
[
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') }
]
Documentation: showRecordId()
Forcing an index with a hint
I can force an index with a hint, for example the index on "_id" which was created automatically:
test> db.collection.find().hint( { _id: 1} ).showRecordId();
[
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]
This runs a IXSCAN instead of a COLLSCAN and returns the documents in the order of the index. You can verify it with .explain(), but it is also perceptible from the order of the document fetched, which follows the order of "_id" rather than the order of insertion as before (also called "natural" order).
Rather than using a hint, I can add a filter, and the query planner chooses the index. A filter like {$gt:MinKey} or {$lt:MaxKey} does not change the result, but changes the execution plan to an IXSCAN:
test> db.collection.find({_id:{$gt:MinKey}}).showRecordId();
[
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]
An equality filter will also run an IXSCAN, and we observe the result fetched in that order:
test> db.collection.find({_id:{$ne:null}}).showRecordId();
[
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]
This technique is used to add an unbounded range predicate on the indexed sort field to get the index used for the sort in the absence of an equality predicate: MongoDB Equality, Sort, Range (ESR) without Equality (SR)
Forcing a full scan with a hint for natural order
Hints specify the index definition, and you may wonder how to force a full scan instead of the index scan chosen by the query planner. Remember that it's an index on RecordId that stores the documents. So you can hint this internal index using the $natural operator - asking for natural order of the collection documents:
test> db.collection.find({_id:{$ne:null}}).hint({$natural:1}).showRecordId();
[
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') }
]
The documents are fetched in order of RecordId from a COLLSCAN. The hint syntax allows an ascending or descending option to start at the beginning or end of the collection. I'm showing this to explain how records are stored internally. However, if you need a specific order, you should use sort() and let the query planner decide whether to use the index to avoid a sort operation.
MongoDB is more than a NoSQL database:
- Like many NoSQL databases, it allows you to query the indexes directly with
.hint(), forcing the access path
- Like all SQL databases, it has a query planner offering data independence, allowing you to declare the collection and expected order with
.sort() and let the database optimize the access path.
Avoid combining storage-level instructions, such as .hint(), .min(), or .max(), with declarative query filters in find() or $match, as this can undermine the query planner's guarantees that results match the query predicates. For example, hinting at a partial index might lead to incomplete results.
Covering indexes and "_id" projection
Understanding what is stored in the index entries helps optimize queries to use an index-only scan (covering index).
For example, the following query reads the index on "_id" and projects only "_id" (which is by default) and "val":
test> db.collection.find(
{ _id: { $ne: null } },
{ val: 1 }
).explain().queryPlanner.winningPlan;
{
isCached: false,
stage: 'PROJECTION_SIMPLE',
transformBy: { val: 1 },
inputStage: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { _id: 1 },
indexName: '_id_',
isMultiKey: false,
isUnique: true,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { _id: [ '[MinKey, null)', '(null, MaxKey]' ] }
}
}
}
Because the index on "_id" holds only the key ("_id") and RecordId, it must fetch the document (FETCH) before the projection (PROJECTION_SIMPLE). Even if it is a primary index from the application's point of view, it is physically equivalent to a secondary index.
I can see the same with another secondary index:
test> db.collection.createIndex( { val: 1 } );
val_1
test> db.collection.find(
{ val:{$gt: 0} },
{ val: 1 }
).explain().queryPlanner.winningPlan;
{
isCached: false,
stage: 'PROJECTION_SIMPLE',
transformBy: { val: 1 },
inputStage: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { val: 1 },
indexName: 'val_1',
isMultiKey: false,
multiKeyPaths: { val: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { val: [ '(0, inf.0]' ] }
}
}
}
Such query projects "_id" because it is there by default, and then the index on "val" is not covering all fields. To avoid the FETCH, I need to remove "_id" from the projection explicitly:
test> db.collection.find(
{ val:{$gt: 0} },
{ val: 1 , _id: 0 }
).explain().queryPlanner.winningPlan;
{
isCached: false,
stage: 'PROJECTION_COVERED',
transformBy: { val: 1, _id: 0 },
inputStage: {
stage: 'IXSCAN',
keyPattern: { val: 1 },
indexName: 'val_1',
isMultiKey: false,
multiKeyPaths: { val: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { val: [ '(0, inf.0]' ] }
}
}
Another possibility: if I need to project "_id", I can add it to the index definition, making it a covering index for my query:
test> db.collection.createIndex( { val: 1 , _id: 1 } );
val_1__id_1
test> db.collection.find(
{
by Franck Pachot
Franck Pachot
You can understand how MongoDB stores documents internally with simple queries that rely on the physical storage ordering. Some databases store records (called rows or tuples) in heap tables, using their physical location in the data files, such as ROWID in Oracle or CTID in PostgreSQL, to reference those records from index entries. In contrast, databases like MySQL's InnoDB or YugabyteDB store records in the primary key index ("clustered index", or "index organized table"), storing them by the logical order of their primary key values, so that secondary indexes point to these logical locations with the primary key, or an encoded version of it.
MongoDB default collections are similar to heap tables because their documents are stored independently of the primary key ("_id") exposed to the application. Internally, the WiredTiger storage engine organizes collection documents using a B+Tree structure, with an internal RecordId as the key, assigned by MongoDB. This structure resembles a clustered index, but it is clustered on an internal key rather than the primary key.
MongoDB’s approach improves on traditional heap tables, especially for storing variable-size documents, because WiredTiger uses B+Tree nodes for efficient space management, reusing space and splitting pages as needed, rather than relying on settings like PCTFREE or FILLFACTOR to reserve space for updates, or SHRINK/VACUUM operations to defragment after deletes.
To cover all cases, with clustered collections, MongoDB can generate the RecordId from the "_id", the primary key exposed to the application, making storage similar to how some databases organize tables in clustered indexes, as "_id" can be a generated ObjectId or defined by the application at insert time. So, when looking at storage internals levels, there are two keys:
-
"_id" is the application's primary key, a generated surrogate key, or a natural key. It is always indexed, and this unique index, like other secondary indexes, references the document with a RecordId
-
RecordId is the internal key. It can be generated from "_id" (in clustered collections), but is more generally generated as a monotonically increasing 64-bit integer during inserts. It can be considered a physical address by the query layer, but it is not directly mapped to an address in the filesystem because files are B+Tree structures.
This offers physical data independence since the primary key generation pattern does not affect the storage organization. However, it is helpful to understand how it functions when reviewing execution plans.
Another perspective is that, aside from clustered collections, all indexes in MongoDB, including the primary key index on "_id", are essentially secondary indexes, similar to those found in heap table databases (such as Db2, Oracle, and PostgreSQL). However, instead of a heap table with a fixed block size and row identification tied to their physical location, MongoDB documents are stored within an internal B+Tree index using the WiredTiger engine. Both approaches have their rationale:
-
SQL databases are primarily optimized for fixed, normalized schemas with small, uniform row lengths, where a PCTFREE or FILLFACTOR can be set according to the expected updates. Storing larger types involves row chaining or slicing (like Oracle LOB chunks or PostgreSQL TOAST)
-
MongoDB is designed for flexible schemas, and collections can contain documents of any size up to the BSON limit. Typically, documents are tens to hundreds of kilobytes, with some larger ones reaching a few megabytes. This flexibility requires adaptable storage management and efforts to minimize fragmentation beyond a small, fixed page size. A B+Tree with a flexible leaf block size is a suitable structure for this purpose.
The document size is flexible, thanks to the storage described above, but the ideal document size is a frequent question. Until I write a blog post on this, here’s a slide: the green area indicates where the most efficient access is, the red side is acceptable for outliers if they don't grow further, but may be a sign of embedding too much, and the blue side works for small documents inserted and queried together, but it may also be a sign that you did unnecessary normalization and should embed more to avoid runtime scattered lookups.
An example to understand the internal ordering
To demonstrate how it works, I generate ten documents and insert them asynchronously, so they may be written to the database in a random order:
db.collection.drop();
Array.from({ length: 10 }, (_, i) => {
db.collection.insertOne({
_id: `#${String(i).padStart(5, '0')}` ,
val: Math.random()
});
});
If I query without any filter or sort, the query planner chooses a COLLSCAN, which reads the records in the order of their RecordID, in the order they were inserted:
test> db.collection.find();
[
{ _id: '#00002', val: 0.07658988613973294 },
{ _id: '#00008', val: 0.39893981577036675 },
{ _id: '#00009', val: 0.5279631881196858 },
{ _id: '#00007', val: 0.8445363162277748 },
{ _id: '#00006', val: 0.01935050813731909 },
{ _id: '#00004', val: 0.0732484258238264 },
{ _id: '#00005', val: 0.7733464850237388 },
{ _id: '#00003', val: 0.3356001641172073 },
{ _id: '#00000', val: 0.8956753135566624 },
{ _id: '#00001', val: 0.4952318922619017 }
]
Keep in mind that I'm working with just a single node here. Sharding and parallel processing might retrieve rows in a different order than how they're stored. You should not rely on any "natural" order. Instead, unless you're conducting this type of investigation, where you're guessing the physical ordering from the query layer, ensure that you use an explicit sort operation to specify the expected order of results.
I can display the RecordId with .showRecordId(), which adds it to the cursor projection:
test> db.collection.find().showRecordId();
[
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') }
]
Documentation: showRecordId()
Forcing an index with a hint
I can force an index with a hint, for example the index on "_id" which was created automatically:
test> db.collection.find().hint( { _id: 1} ).showRecordId();
[
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]
This runs a IXSCAN instead of a COLLSCAN and returns the documents in the order of the index. You can verify it with .explain(), but it is also perceptible from the order of the document fetched, which follows the order of "_id" rather than the order of insertion as before (also called "natural" order).
Rather than using a hint, I can add a filter, and the query planner chooses the index. A filter like {$gt:MinKey} or {$lt:MaxKey} does not change the result, but changes the execution plan to an IXSCAN:
test> db.collection.find({_id:{$gt:MinKey}}).showRecordId();
[
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]
An equality filter will also run an IXSCAN, and we observe the result fetched in that order:
test> db.collection.find({_id:{$ne:null}}).showRecordId();
[
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]
This technique is used to add an unbounded range predicate on the indexed sort field to get the index used for the sort in the absence of an equality predicate: MongoDB Equality, Sort, Range (ESR) without Equality (SR)
Forcing a full scan with a hint for natural order
Hints specify the index definition, and you may wonder how to force a full scan instead of the index scan chosen by the query planner. Remember that it's an index on RecordId that stores the documents. So you can hint this internal index using the $natural operator - asking for natural order of the collection documents:
test> db.collection.find({_id:{$ne:null}}).hint({$natural:1}).showRecordId();
[
{ _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
{ _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
{ _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') },
{ _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
{ _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
{ _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
{ _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
{ _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
{ _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
{ _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') }
]
The documents are fetched in order of RecordId from a COLLSCAN. The hint syntax allows an ascending or descending option to start at the beginning or end of the collection. I'm showing this to explain how records are stored internally. However, if you need a specific order, you should use sort() and let the query planner decide whether to use the index to avoid a sort operation.
MongoDB is more than a NoSQL database:
- Like many NoSQL databases, it allows you to query the indexes directly with
.hint(), forcing the access path
- Like all SQL databases, it has a query planner offering data independence, allowing you to declare the collection and expected order with
.sort() and let the database optimize the access path.
Avoid combining storage-level instructions, such as .hint(), .min(), or .max(), with declarative query filters in find() or $match, as this can undermine the query planner's guarantees that results match the query predicates. For example, hinting at a partial index might lead to incomplete results.
Covering indexes and "_id" projection
Understanding what is stored in the index entries helps optimize queries to use an index-only scan (covering index).
For example, the following query reads the index on "_id" and projects only "_id" (which is by default) and "val":
test> db.collection.find(
{ _id: { $ne: null } },
{ val: 1 }
).explain().queryPlanner.winningPlan;
{
isCached: false,
stage: 'PROJECTION_SIMPLE',
transformBy: { val: 1 },
inputStage: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { _id: 1 },
indexName: '_id_',
isMultiKey: false,
isUnique: true,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { _id: [ '[MinKey, null)', '(null, MaxKey]' ] }
}
}
}
Because the index on "_id" holds only the key ("_id") and RecordId, it must fetch the document (FETCH) before the projection (PROJECTION_SIMPLE). Even if it is a primary index from the application's point of view, it is physically equivalent to a secondary index.
I can see the same with another secondary index:
test> db.collection.createIndex( { val: 1 } );
val_1
test> db.collection.find(
{ val:{$gt: 0} },
{ val: 1 }
).explain().queryPlanner.winningPlan;
{
isCached: false,
stage: 'PROJECTION_SIMPLE',
transformBy: { val: 1 },
inputStage: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
keyPattern: { val: 1 },
indexName: 'val_1',
isMultiKey: false,
multiKeyPaths: { val: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { val: [ '(0, inf.0]' ] }
}
}
}
Such query projects "_id" because it is there by default, and then the index on "val" is not covering all fields. To avoid the FETCH, I need to remove "_id" from the projection explicitly:
test> db.collection.find(
{ val:{$gt: 0} },
{ val: 1 , _id: 0 }
).explain().queryPlanner.winningPlan;
{
isCached: false,
stage: 'PROJECTION_COVERED',
transformBy: { val: 1, _id: 0 },
inputStage: {
stage: 'IXSCAN',
keyPattern: { val: 1 },
indexName: 'val_1',
isMultiKey: false,
multiKeyPaths: { val: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { val: [ '(0, inf.0]' ] }
}
}
Another possibility: if I need to project "_id", I can add it to the index definition, making it a covering index for my query:
test> db.collection.createIndex( { val: 1 , _id: 1 } );
val_1__id_1
test> db.collection.find(
{
by Franck Pachot
Percona Database Performance Blog
As enterprise software vendors race toward proprietary cloud ecosystems, some features long relied upon by businesses are being quietly deprecated. One recent example is MongoDB Enterprise Advanced and Atlas dropping support for LDAP authentication, a foundational identity protocol for countless organizations. At Percona, we’re taking a different path. We’ve supported LDAP in Percona Server for MongoDB for […]
by Radoslaw Szulgo
August 06, 2025
Murat Demirbas
This paper from HotStorage'25 presents OrcaCache, a design proposal for a coordinated caching framework tailored to disaggregated storage systems. In a disaggregated architecture, compute and storage resources are physically separated and connected via high-speed networks. These became increasingly common in modern data centers as they enable flexible resource scaling and improved fault isolation. (Follow the money as they say!) But accessing remote storage introduces serious latency and efficiency challenges. The paper positions OrcaCache as a solution to mitigate these challenges by orchestrating caching logic across clients and servers. Important note: in the paper's terminology the server means the storage node, and the client means the compute node.
As we did last week for another paper, Aleksey and I live-recorded our reading/discussion of this paper. We do this to teach the thought-process and mechanics of how experts read papers in real time. Check our discussion video below (please listen at 1.5x, I sound less horrible at that speed). The paper I annotated during our discussion is also available here.
The problem
Caching plays a crucial role in reducing the overheads of disaggregated storage, but the paper claims that current strategies (client-local caching, server-only caching, and independent client-server caching) fall short. Client-local caching is simple and avoids server overhead but underutilizes memory on the server. Server-only caching can reduce backend I/O pressure but comes at the cost of network round-trips and significant server CPU load. Independent client-server caching combines the two but lacks coordination between the caches, leading to data duplication, inefficient eviction and prefetching policies, and causes fairness issues in multi-client environments.
The proposed design
OrcaCache proposes to address these shortcomings by shifting the cache index and coordination responsibilities to the client side. Clients maintain a global view of the cache and communicate directly with the server-side cache using RDMA, which enables bypassing the server CPU in the common case. Server-side components are minimized to a daemon that tracks resource usage and allocates memory based on fairness and pressure.
Discussion
OrcaCache stops short of addressing the core system-level challenges in a realistic multi-client deployment. A single server single client setup is used in experiments in Figure 1, and also for most of the description in the paper. The paper's solution to dealing with multiple clients is to use a separate namespace for each client, but then at the server-side this uses up a lot of resources, cause duplication of cached items. There is no mutual benefit and collaboration among clients in this setup.
The paper also mentions how clients could interact with a server-side daemon, how RDMA-based lookups and cache updates would be issued, and how resources might be allocated based on monitored pressure, but many of these mechanisms remain speculative. The authors mention about flexible eviction and prefetching but do not explore the complexity of maintaining consistency or fairness across diverse workloads. AI/ML workloads mentioned/alluded but not really tested in the paper.
In the end, the paper's contribution lies more in reopening a line of thought from 1990s cooperative caching and global memory management research: how to make cache coherence across disaggregated compute and storage both efficient and scalable. The idea OrcaCache seems to lean on is that rather than burden the server, it makes the client responsible for coordination, enabled by fast networks and abundant memory.
Also despite the title, there was not much Tango in the paper. It was mostly cache.
by Murat (noreply@blogger.com)
Percona Database Performance Blog
If you’re running MySQL 8.0 databases, you need to know this: Oracle will stop supporting them in April 2026. That means no more security patches, bug fixes, or help when things go wrong. Maybe you’re thinking, “But April 2026 feels far away!“. But once that date hits, every day you keep running MySQL 8.0 makes […]
by David Quilty
Murat Demirbas
This paper from SIGMOD 2016 proposes a transaction healing approach to improve the scalability of Optimistic Concurrency Control (OCC) in main-memory OLTP systems running on multicore architectures. Instead of discarding the entire execution when validation fails, the system repairs only the inconsistent operations to improve throughput in high-contention scenarios.
If this sounds familiar, it's because we recently reviewed the Morty paper from EuroSys 2023, which applied healing ideas to interactive transactions using continuations to support re-execution. This 2016 Transaction Healing paper is scoped to static stored procedures, and focuses more on integrating healing into OCC for stored procedures.
Key Ideas
OCC works well under low contention because it separates reads from writes and keeps critical sections short (only for validation). But under high contention, especially in workloads with skewed access patterns (like Zipfian distributions), transactions are frequently invalidated by concurrent updates. The naive OCC response of abort and restart leads to wasting CPU cycles and degrading cache locality.
Transaction healing aims to address this problem by observing/betting that most validation failures affect only a subset of a transaction's operations. If only the affected operations can be detected and recovered, the system can avoid redoing the entire transaction. They implement this by leveraging two components.
First, a static analysis phase extracts operation dependencies from the stored procedure a priori. The dependency analysis distinguishes between two types of relations: key-dependencies, where the result of one operation determines the lookup key for another; and value-dependencies, where the value produced by one operation is used in a subsequent one. With this graph in hand, transaction healing can surgically repair any non-serializable operation at runtime.
Second, a runtime access cache, maintained per thread, tracks the behavior of each executed operation (its inputs, outputs, effects, and the memory addresses it accessed) and identifies conflicted parts of a transaction at runtime. The access cache supports this by recording memory addresses (avoiding repeated index lookups) and allowing efficient reuse of unaffected results.
Transaction healing
The healing process is triggered during the validation phase, when an inconsistency is detected in the read/write set. Rather than aborting immediately, the system identifies the earliest affected operation (using its dependency graph), and restores it. If the operation is value-dependent, healing updates its effects based on cached inputs and outputs. If it's key-dependent, a re-execution is necessary since the accessed record may change. The healing propagates forward through the dependency graph, recursively restoring all operations affected by the initial inconsistency.
The healing mechanism is built to preserve serializability. Validation acquires locks in a globally consistent order (e.g., sorted by memory address) to avoid deadlocks. If during healing a lock must be acquired out of order (e.g., due to new dependencies introduced by re-executed operations), the transaction is aborted in order not to risk a deadlock. The paper says this situation is rare due to validation-order optimizations. Despite occasional aborts, transaction healing guarantees forward progress and eventual termination: each transaction's read/write set is finite and every element is validated at most once, which ensures that healing either succeeds or fails definitively.
Evaluation Highlights
They implemented a C++ in-memory database engine, THEDB, to test these ideas. THEDB employs LLVM to perform static dependency analysis on stored procedures and includes support for standard database features like inserts, deletes, and range queries (the latter protected against phantoms via B+-tree versioning, as in Silo). The authors evaluate THEDB on a 48-core AMD machine using two common benchmarks: TPC-C and Smallbank. THEDB is compared against five systems: variants of OCC (including Silo-style), 2PL, a hybrid OCC-2PL approach, and a deterministic partitioned system.
The results show that, under high contention, THEDB significantly outperforms the alternatives, achieving up to 6.2x higher throughput than Silo and approaching the performance of an idealized OCC system with validation disabled. This shows that transaction healing adds minimal overhead and successfully eliminates the restart costs that dominate OCC's performance under load. Moreover, THEDB maintains stable throughput as contention increases (e.g., under more skewed Zipfian distributions), while traditional OCC and Silo degrade rapidly. Scalability is also great up to 48 cores.
Discussion
**** What are the limitations of static analysis used?
Transaction healing proposed here is limited to stored procedures because it relies on static dependency extraction. Unlike Morty, which handles interactive transactions using runtime continuations, this work cannot deal with dynamic control flow or unknown transaction logic at runtime. As a result, ad-hoc queries revert to standard OCC, where any healing benefit is lost.
On the other hand, there is some subtlety here. Transaction healing does not require read/write sets to be declared in advance as the deterministic systems like Calvin do. Deterministic systems must know the exact records a transaction will access before it begins execution, so they can assign transactions to partitions and establish a global execution order. Transaction healing avoids this rigidity. It doesn't need to know which specific records a transaction will access ahead of time. Instead, it relies on static analysis to extract the structure of the transaction logic, namely which operations depend on which others. These dependencies, such as key or value dependencies between operations, are known statically because the transaction logic is written as a stored procedure. But the actual keys and values involved are discovered dynamically as the transaction executes. The system uses an access cache to record which memory locations were read or written, and validation happens afterward. This flexibility allows transaction healing to support dynamic, cross-partition access patterns without prior declaration.
**** How does this compare with Morty?
Transaction Healing is designed for in-memory OLTP systems running with OCC on multicore machines, where the workload consists of static stored procedures. Morty, in contrast, is built for a distributed geo-replicated system and handles interactive transactions with dynamic control flow. It uses MVTSO, with speculative execution and a priori ordering. Unlike THEDB, Morty allows transactions to read from uncommitted versions, exposing concurrency that traditional systems suppress. It tracks execution through continuation-passing style (CPS) in order to make control dependencies explicit and enable partial re-execution of logic branches. While transaction healing employed LLVM to automatically perform static dependency analysis on stored procedures, Morty did not automate translation of transaction program to CPS program. Finally, since it is distributed and deployed over WAN, Morty integrates concurrency control with replication to reduce latency and uses quorum voting to maintain fault-tolerant correctness without centralized logging.
by Murat (noreply@blogger.com)
Tinybird Engineering Blog
Learn ClickHouse® ReplacingMergeTree with examples and real-world use cases. Master deduplication, upserts, and streaming data performance tuning.
by Cameron Archer
August 05, 2025
Percona Database Performance Blog
PostgreSQL 18 is on the way, bringing a set of improvements that many organizations will find useful. It’s not a revolutionary release, but it does move things in a good direction, especially in performance, replication, and simplifying daily operations. For teams already using PostgreSQL, it’s a good time to look into what’s new. For others […]
by Jan Wieremjewicz
Murat Demirbas
This paper (PODC'2016) presents a clean and declarative treatment of Snapshot Isolation (SI) using dependency graphs. It builds on the foundation laid by prior work, including the SSI paper we reviewed recently, which had already identified that SI permits cycles with two adjacent anti-dependency (RW) edges, the so-called inConflict and outConflict edges. While the SSI work focused on algorithmic results and implementation, this paper focuses more on the theory (this is PODC after all) of defining a declarative dependency-graph-based model for SI. It strips away implementation details such as commit timestamps and lock management, and provides a purely symbolic framework. It also proves a soundness result (Theorem 10), and leverages the model for two practical static analyses: transaction chopping and robustness under isolation-level weakening.
Soundness result and dependency graph model
Let's begin with Theorem 10, which establishes both the soundness and completeness of the dependency graph characterization of SI. The soundness direction states that any dependency graph satisfying the SI condition (i.e., every cycle contains at least two adjacent RW edges) corresponds to a valid SI execution. The completeness direction, which follows from prior work, asserts that every valid SI execution induces such a dependency graph. The proof of soundness is technically involved, requiring the authors to construct valid SI executions from dependency graphs by solving a system of relational constraints that preserve the required visibility and ordering properties.
Building on foundational work by Adya, this model represents executions as graphs whose nodes are transactions and whose edges capture observable transactional dependencies in terms of 3 edge types: write-read (WR), write-write (WW), and the anti-dependency capturing read-write (RW) edges. The SI, Serializability (SER), and Parallel SI (PSI) isolation levels are then defined in terms of the structural properties in these graphs, specifically by the presence or absence of certain cycles. This abstraction supports symbolic reasoning about anomalies like write skew or long fork manifest as specific, checkable subgraphs. Informally, a WR edge from T to S means that S reads T’s write to some object x; a WW edge means that S overwrites T’s write; and a RW edge indicates that S overwrites the value of x read by T, introducing an anti-dependency.
Definition 4 and Figure 1 provide an elegant axiomatization of abstract executions. The visibility relation (VIS) must always be a subset of the commit order (CO), and in the case of Serializability, the two are equal. In my mind, this captures the key conceptual divide between SER and SI: Serializability enforces a total order over committed transactions, wheras SI permits partial orders.
Figure 2 illustrates the anomalies that differentiate SER, SI, and PSI. Figure 2(d) captures the classic write skew anomaly, which SI allows but SER prohibits. This scenario arises when two transactions read disjoint keys and then write disjoint values based on those reads, each unaware of the other's effects. SI permits this since it allows partial visibility so long as snapshots are consistent. On the other hand, the long fork anomaly shown in Figure 2(c) is prohibited by SI but allowed by PSI, which weakens the snapshot guarantees further.
Applications of the model
The second half of the paper shows applications of the model for static analyses. The first application is transaction chopping, where large transactions are split into smaller subtransactions to improve performance. The challenge here is to ensure that the interleaving of chopped pieces does not introduce new behaviors/anomalies that the original monolithic transaction would have prevented. This is captured through spliceability: whether an execution of chopped transactions can be "stitched back" into an execution that would have been legal under SI for the unchopped program. Spliceability is formulated through a chopping graph, which augments standard dependencies with session-local ordering among chopped subtransactions. A cycle in the chopping graph DCG(G) is considered critical if (1) it does not contain two occurrences of the same vertex, (2) it includes a sequence of three edges where a conflict edge is followed by a session (predecessor) edge and then another conflict edge, and (3) any two RW (anti-dependency) edges in the cycle are not adjacent. Such critical cycles represent dependency patterns that cannot be reconciled with the atomicity guarantees expected by the original transaction, and thus cannot be realized under SI. Figures 4, 5 and 6 illustrate how small structural differences in the chop can lead to either results that are sound (Figure 6) or unsound (Figure 5 creates a critical cycle). Compared to serializability, SI's more relaxed visibility rules allow for a wider range of safe chops, but care must still be taken to avoid dependency structures that violate snapshot consistency.
The second application of the dependency graph model is in analyzing robustness across isolation levels. The central question is whether a program behaves identically under SI and a weaker or stronger model. An interesting case here is the relation between SI and Parallel SI (PSI). We covered PSI in our earlier review of Walter (SOSP 2011). PSI weakens SI by discarding the prefix requirement on snapshots: it ensures only that visibility is transitive, not that it forms a prefix of the commit order. Thus, PSI admits behaviors that SI prohibits. Theorem 22 formalizes one such divergence. It shows that if a cycle in the dependency graph contains at least two RW edges and no two of them are adjacent, then this cycle is allowed under PSI but not under SI. This captures the long fork anomaly, in which concurrent writers are seen inconsistently by different readers (each reader forming a different branch of the history).
To illustrate the long fork, consider a cycle where T1 and T2 are concurrent writers, and two readers, T3 and T4, observe them inconsistently.
- T1 --WR--> T3
- T2 --WR--> T4
- T3 --RW--> T2
- T4 --RW--> T1
In this scenario, T3 sees T1's write but not T2's, and T4 sees T2's write but not T1's. Both readers construct transitive but incompatible snapshots that fork the timeline. SI prohibits this because it cannot construct prefix-closed snapshots that explain both T3 and T4's observations. But since PSI lacks the prefix constraint, it allows this behavior, while still disallowing anomalies like lost update (through its NOCONFLICT axiom).
Robustness from SI to PSI therefore requires ruling out that specific structural pattern: cycles with multiple RW edges where none are adjacent. If such a cycle appears in the dependency graph, PSI will admit the behavior, while SI will not, and robustness would fail.
Discussion
This does invite comparison to the Seeing is Believing (SiB) paper (PODC'17), one of my favorite papers, and its state-centric formulation of isolation guarantees. In SiB, executions are modeled as sequences of global states and snapshots. Transactions observe one of these states and transition the system to a new one. Isolation models are defined in terms of whether there exists a sequence of global states consistent with the observations and effects of each transaction.
While structurally different, the two models are not in conflict. It appears feasible to translate between the dependency graph and state-centric views. The SI model used in this PODC2016 paper already adopts a declarative, axiomatic approach centered on visibility and commit order that is already close to SiB.
For static program analysis, the dependency graph model seems to offer advantages. By abstracting away from global states, it allows symbolic reasoning directly over transactional dependencies. This makes it well-suited to analyses like transaction chopping and robustness checking, which rely on detecting structural patterns such as cycles with certain edge configurations. While the SiB model is semantically expressive and well-suited to observational reasoning, it may be less conducive to structural checks like cycle-freedom or anti-dependency adjacency.
by Murat (noreply@blogger.com)
Franck Pachot
A benchmark sponsored by EDB, a PostgreSQL company, in 2019 contributed to the myth that MongoDB transactions are slow. Even though the work was done by the reputable OnGres team, the code wasn't properly designed to test MongoDB's scalability. At that time, the feature was new, likely not well-documented, and some demos overlooked the retry logic. In this context, no one is to blame for past publications, but analyzing this benchmark will help prevent the spread of these myths.
MongoDB uses lock-free optimistic concurrency control (OCC) with fail-on-conflict as soon as a write detects concurrent changes to the Multi-Version Concurrency Control (MVCC), requiring applications to manage transient errors differently than traditional RDBMS with pessimistic locking and wait-on-conflict behavior. The benchmark developers, PostgreSQL experts, likely missed this because they based the benchmark on a MongoDB demo focused on capabilities, not performance, and neglecting proper concurrency control.
We should disregard this benchmark today, but this blog post series offers an opportunity to analyze its flaws, debunk myths, and educate readers on effective transaction handling in MongoDB applications.
The problematic code in the MongoDB 4.0 demo from 7 years ago was:
def run_transaction_with_retry(functor, session):
assert (isinstance(functor, Transaction_Functor))
while True:
try:
with session.start_transaction():
result=functor(session) # performs transaction
commit_with_retry(session)
break
except (pymongo.errors.ConnectionFailure, pymongo.errors.OperationFailure) as exc:
# If transient error, retry the whole transaction
if exc.has_error_label("TransientTransactionError"):
print("TransientTransactionError, retrying "
"transaction ...")
continue
else:
raise
return result
It was translated to Java in the benchmark code as:
private void runWithRetry() {
while (true) {
try {
runnable.run();
break;
} catch (RetryUserOperationException ex) {
retryMeter.mark();
continue;
}
}
}
If you are familiar with Optimistic or Fail-on-Conflict Concurrency Control, you may recognize a significant issue: there is no wait (backoff) before retry. With such an infinite loop, high concurrency access acts like a DDoS attack on the database, rather than resolving contention.
A typical retry loop implements exponential backoff, and here is an example:
private void runWithRetry() {
final long initialDelayMillis = 5; // start with 5ms
final long maxDelayMillis = 1000; // max wait of 1s
long delay = initialDelayMillis;
while (true) {
try {
runnable.run();
break;
} catch (RetryUserOperationException ex) {
retryMeter.mark();
try {
// jitter by up to 50% to avoid thundering herd
long jitter = (long) (Math.random() * delay / 2);
long sleep = delay + jitter;
Thread.sleep(sleep);
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
// Optionally log or handle interruption here
throw new RuntimeException("Retry loop interrupted", ie);
}
delay = Math.min(delay * 2, maxDelayMillis);
continue;
}
}
}
This code makes the first retry wait 5–7 ms, then 10–13 ms, 20–25 ms, and so on, up to 1000–1500 ms. I you use Spring Data, you can simply annotate your @Transactional method with:
@Retryable(
value = RetryUserOperationException.class,
maxAttempts = 10,
backoff = @Backoff(
delay = 5, // first delay in ms
maxDelay = 1000, // max delay in ms
multiplier = 2, // exponential
random = true // adds jitter
)
)
MongoDB employs a similar approach for auto-commit single-document transactions, transparently, so that it appears as if the application is waiting for a lock to be acquired. However, it cannot automatically cancel and retry explicit transactions where the application might perform non-transactional work, such as writing to a file, pushing to a queue, or sending an email. For transactions involving multiple statements, no database can automatically retry the process. The application itself must handle retries.
In PostgreSQL, a conflict might cause a serializable error, even under the Read Committed isolation level, where deadlocks can still occur. PostgreSQL locks data while writing during a transaction using two-phase locking and typically waits for the lock to be released. In this case, the impact of an inefficient retry loop is minimal.
However, MongoDB is optimized for high concurrency, allowing it to avoid holding locks between database calls. Instead of waiting, it detects write conflicts instantly and raises a retriable error. Therefore, implementing an efficient retry mechanism is essential.
As I mentioned earlier, there's no one to blame for the benchmark's flaws, as it was created when transactions in MongoDB were relatively new and perhaps not well documented. The problem is people still referencing this benchmark without understanding what was wrong. The poor performance was due to unnecessary retries because there was no backoff implemented in the retry loop.
The authors of the benchmark have been looking for documentation that they believe explains this behavior, which likely contributed to their decision not to implement backoff in the application, mistakenly thinking it was handled by the database:
Since the probability of collision increases (possibly exponentially) with the effective number of transactions processed, it follows that MongoDB is more eager to retry transactions. This is consistent with the expectation set on MongoDB’s documentation about transactions and locking, which states that “by default, transactions waits up to 5 milliseconds to acquire locks required by the operations in the transaction. If the transaction cannot acquire its required locks within the 5 milliseconds, the transaction aborts”. This behavior can be changed by setting the maxTransactionLockRequestTimeoutMillis parameter.
What is called "lock" here is different from what SQL databases call "lock" with two-phase locking transactions where locks are acquired for the duration of the transaction. MongoDB is lock-free in that sense, using optimistic concurrency control rather than locking. What is called "lock" here is more similar to what SQL databases call "latch" or "lightweight locks", which are short duration and do not span multiple database calls. For such wait, five milliseconds is a good default. But this is not what the benchmark experienced.
Such timeout would raise the following exception: Unable to acquire lock ... within a max lock request timeout of '5ms' milliseconds.
What the benchmark catches in the retry loop is a write conflict, that happens before trying to acquire such short lock: Command failed with error 112 (WriteConflict): 'Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction.'
Such write conflict has nothing to do with maxTransactionLockRequestTimeoutMillis, it doesn't try to acquire a lock because it has nothing to write, as transaction isolation (the 'I' in 'ACID') is not possible. When reading, it has detected that the read snapshot, the state as of the beginning of the transaction, has been modified by another transaction. It doesn’t wait because the snapshot would be stale if the other transaction commits, and it immediately returns to the application. The application must compensate or roll back what it did during the transaction, wait a small amount of time (the exponential backoff), and retry.
In PostgreSQL, when operating under the Read Committed isolation level, it can wait because it allows reading a state that may not be consistent with the transaction's start time. If a concurrent transaction commits during this time, PostgreSQL simply continues to read the committed data, mixing data from different states. This is not permitted in higher isolation levels, like serializable, and a transient error must be raised, like in MongoDB, to guarantee ACID properties. However, PostgreSQL uses a pessimistic locking approach, waits to determine whether the other transaction commits, allowing it to retry immediately once the conflict is resolved. This is why the retry logic without backoff does not have the same consequences.
You may wonder why MongoDB doesn't implement waiting like PostgreSQL does. PostgreSQL is designed for a single-writer instance, which cannot scale horizontally, making it simple to use a wait queue in shared memory. However, when sharding PostgreSQL using the Citus extension, this design breaks down, leading to eventual consistency for cross-shard reads. In contrast, MongoDB is built for horizontal scalability and opts for optimistic concurrency control instead of a distributed wait queue, providing consistent cross shard reads across nodes (when the read concern is set to majority).
I prefer not to link the benchmark paper to avoid helping search engines or LLM crawlers find outdated content, but it is easy to find. The benchmark code is available in a repository. I prefer to link to the MongoDB transaction documentation instead. Now you know where the myth about slow transactions comes from: incorrect understanding of MongoDB lock-free ACID transactions.
There's more to say about this benchmark. In the benchmark code, there's a hotspot on the "audit" table which is indexed for the PostgreSQL definition, but not for the MongoDB definition. This is visible as MongoDB logs slow queries by default:
mongo-1 | {"t":{"$date":"2025-08-05T21:31:22.655+00:00"},"s":"I", "c":"WRITE", "id":51803, "ctx":"conn31"
,"msg":"Slow query","attr":{"type":"update","isFromUserConnection":true,"ns":"postgres.audit","collectionType":"normal"
,"command":{"q":{"schedule_id":4778,"day":{"$date":"2025-08-05T00:00:00.000Z"}},"u":{"$set":{"date":{"$date":"2025-08-05T21:31:22.533Z"}},"$inc":{"seats_occupied":1}},"multi":false,"upsert":true}
,"planSummary":"COLLSCAN","planningTimeMicros":255,"keysExamined":0,"docsExamined":5962,"nMatched":1,"nModified":1,"nUpserted":0,"keysInserted":0,"keysDeleted":0,"numYields":0,"planCacheShapeHash":"99470B66","queryHash":"99470B66","planCacheKey":"031BFB16"
,"locks":{"MultiDocumentTransactionsBarrier":{"acquireCount":{"w":1}},"ReplicationStateTransition":{"acquireCount":{"w":3}},"Global":{"acquireCount":{"w":1}},"Database":{"acquireCount":{"w":1}},"Collection":{"acquireCount":{"w":5}}},"flowControl":{"acquireCount":1},"readConcern":{"level":"snapshot","provenance":"clientSupplied"},"storage":{"data":{"txnBytesDirty":128}},"cpuNanos":32601533,"remote":"172.18.0.6:42292","queues":{"execution":{"admissions":1},"ingress":{"admissions":1}},"workingMillis":121,"durationMillis":121}}
To improve performance for scalability, create indexes and avoid hotspots. If hotspots are unavoidable, fail fast by performing operations subject to write conflict early in the transaction, rather than at the end like it is done here. The data model should allow critical transactions to be single-document, avoiding the need for normalization across multiple tables, but this benchmark uses the same normalized data model on both databases. Finally, no real application will perform business transaction like this: reserving a flight seat, recording payment, and incrementing an audit counter all in one database transaction. You don't want to maintain a database state with locks while waiting for payment validation that typically depends on an external service.
by Franck Pachot