a curated list of database news from authoritative sources

June 19, 2025

One million $lookup challenge

I you have read my previous post $lookup: more than just a SQL join, you understand that $lookup is not designed to join scalar values from thousands of documents. $lookup is useful at the end of an aggregation pipeline, not before the aggregation (examples in Comparison of JOINS πŸ‘‰πŸ» aggregation pipeline and CTEs) from a million documents collection. However, such collection should not require a join, as documents are designed to aggregate multiple related objects, unlike relational databases that normalize business data to multiple tables.

In a many-to-one relationship, it is common to embed fields, even when they are duplicated, in a document model. Normalization plays a crucial role in relational databases to prevent these duplicates, as RDBMS were designed for interactive users executing SQL statements. Missing updates can lead to data integrity issues. While triggers can help manage updates to duplicated values and prevent anomalies, they introduce new challenges as they operate behind the update statement.

When updates originate from well-reviewed and tested programs, it is manageable to modify data in multiple locations, particularly when such updates are infrequent. Let's illustrate joins and the absence of joins with a simple test.

To join multiple documents with a small lookup table, you can cache the lookup table in your application. In this post, I tested several methods for retrieving a value from a lookup table: using a collection, a map, and an array. I integrated these methods into an aggregation pipeline, but keep in mind that this can also be accomplished within the application itself.

I created a dimension table with one thousand documents, and a fact table with one million. The fact table has a "ref" field that references the "dimid" in the dimension table:

db.dim.drop();
db.fact.drop();

db.dim.insertMany(
    Array.from({ length: 1000 }, (_, i) => ({
        _id: i + 1,
        value: Math.random()
    }))
);

db.fact.insertMany(
    Array.from({ length: 1000000 }, () => ({
        ref: Math.ceil(Math.random() * 1000),
        value: Math.random()
    }))
);

Lookup (IndexedLoopJoin): 10 seconds

Here is an aggregation pipeline with a lookup.

x=db.fact.aggregate([
    {
        $lookup: {
            from: "dim",
            localField: "ref",
            foreignField: "_id",
            as: "dimData" ,
        }
    },
]).explain("executionStats")
;
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data, it runs in ten seconds. The query planner chooses an Index Nested Loop Join because there is an index. Without an index it could use a hash join.

Map to object and $getField: 61 seconds

To avoid the lookup, I read the dimension table into an object with a field per value, the field name being the "dimid", and get the value with $getField


const dimMap = {};  
db.dim.find().forEach(doc => {  
    dimMap[doc._id] = doc.value;  
});  
print( dimMap )


x=db.fact.aggregate([
    {  
        $addFields: {  
            dimValue: {  
                $getField: {  
                    field: { $toString: "$ref" },  
                    input: dimMap  
                }  
            }  
        }  
    }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data it runs in one minute. Accessing to a field by name is not an optimal operation and is O(n) so it is a viable solution only for very small lookup table.

Map to $switch branches: 23 seconds

Instead of using that map, I build a $switch statement to use in the aggregation pipeline.


const dimMap = {};  
db.dim.find().forEach(doc => {  
    dimMap[doc._id] = doc.value;  
});  
print( dimMap )

const switchBranches = Object.entries(dimMap).map(([id, value]) => ({  
    case: { $eq: ["$ref", parseInt(id)] },  
    then: value  
}));  
print( switchBranches )

x=db.fact.aggregate([  
    {  
        $addFields: {  
            dimValue: {  
                $switch: {  
                    branches: switchBranches,  
                    default: null  
                }  
            }  
        }  
    }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data, it runs in twenty seconds, and given that it puts the logic into the query, it is acceptable only for small lookup tables.

Map to array and $arrayElemAt: 1 second

Instead of a map, I use an array where the index is the "dimid". As I have no guarantee that the "dimid" is sequential with no gap, I build a sparse index that I fill with the existing values.


// Get the maximum ID
const maxId = db.dim.aggregate([
 {$group:{_id:null,max:{$max:"$_id"}}}
]).toArray()[0].max;  
// Create a sparse array for all values
const dimValues = new Array(maxId + 1).fill(null);  
// store the values at the right ID
db.dim.find({},{_id:1,value:1}).forEach(
 d => dimValues[d._id] = d.value
);  
print(dimValues)

//
x=db.fact.aggregate([  
    { $addFields: { dimValue: { $arrayElemAt: [dimValues, "$ref"] } } }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

This is fast and runs in one second. However, it works only when the lookup identifier are in control, ideally starting from one and in a no-gap sequence.

Embed rather than join (denormalization)

Finally, as recommended for a document model (Model One-to-Many Relationships with Embedded Documents), I duplicate the dimension value into each fact documents. I run this update with an aggregation pipeline.


const startTime = new Date(); 
db.fact.aggregate([  
    {  
        $lookup: {  
            from: "dim",  
            localField: "ref",  
            foreignField: "_id",  
            as: "dimData"  
        }  
    },  
    {  
        $out: "fact"  
    }  
])  

const endTime = new Date(); 
const executionTime = (endTime - startTime) / 1000;  
print(`Update execution time: ${executionTime} seconds`);

This should be executed once, and then only the updated dimension values should be synchronized. This update took 16 seconds on my data.

To compare, I can simply read the document and project the embedded value:

x=db.fact.aggregate([  
    {  
        $project: {  
            _id: 1,  
            ref: 1,  
            dimValue: 1,  // Simply project the pre-computed field  
            // Add any other fact fields you need  
            someFactField: 1  
        }  
    }  
]).explain("executionStats");  
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

This query takes around 0.5 seconds on my data. It is advisable unless you are dealing with frequently updated lookup tables. Additionally, in MongoDB, a single compound index can cover all fields within a document. Typically, when filtering on a dimension, lookup, or reference table, the filter is applied to a business field rather than the internal "_id".

Conclusion

I have tested my example using various cardinalities for both the fact table and the dimension lookup table. Below is the raw data.

dim fact lookup $getField $switch $arrayElemAt update single doc
10 1000 0.008s 0.002s 0.001s 0.001s 0.08s 0s
100 1000 0.008s 0.006s 0.005s 0.001s 0.078s 0s
1000 1000 0.011s 0.062s 0.033s 0.001s 0.082s 0s
10000 1000 0.013s 0.754s 0.067s 0.003s 0.08s 0s
10 10000 0.075s 0.021s 0.016s 0.012s 0.199s 0.005s
100 10000 0.078s 0.066s 0.055s 0.013s 0.191s 0.005s
1000 10000 0.105s 0.62s 0.292s 0.013s 0.229s 0.005s
10000 10000 0.104s 6.94s 0.305s 0.015s 0.237s 0.005s
10 100000 0.738s 0.215s 0.171s 0.129s 1.306s 0.052s
100 100000 0.781s 0.673s 0.571s 0.131s 1.359s 0.052s
1000 100000 1.044s 6.259s 2.71s 0.141s 1.756s 0.054s
10000 100000 1.068s 73.205s 2.702s 0.144s 1.769s 0.059s
10 1000000 7.583s 2.199s 1.761s 1.332s 12.524s 0.559s
100 1000000 7.992s 6.634s 5.741s 1.346s 13.03s 0.557s
1000 1000000 10.551s 62.385s 26.4s 1.398s 16.771s 0.557s
10000 1000000 10.794s 742.086s 26.039s 1.437s 17.008s 0.578s
10 10000000 76.225s 22.127s 17.795s 13.196s 124.922s 5.789s
100 10000000 80.828s 67.602s 57.981s 13.695s 131.738s 5.714s
1000 10000000 106.194s 622.382s 267.555s 14.054s 168.854s 5.778s
10000 10000000 107.211s 7351.675s 265.404s 14.046s 171.13s 5.767s

An array, when queried with $arrayElemAt, is optimized for quickly retrieving values, while other data structures have a complexity of O(n). However, arrays have fixed values, which limits their flexibility compared to tables or collections. You may find more suitable structures in your application language. These structures resemble how SQL databases use hash tables. MongoDB can utilize a hash join for $lookup when the lookup table is small, when spilling to disk is permissible, and when there's no index.

When the lookup table is infrequently updated, applying updates to the embedded values is generally preferable, paying the price once at write and getting faster reads. MongoDB offers developers greater control over data access patterns and cardinalities, rather than relying solely on the query planner, which can lead to plan instability and runaway queries. In contrast, SQL databases cannot implement this flexibility without violating Codd's rules on data independence for relational databases.

A key distinction between MongoDB and SQL databases, including those that use a MongoDB API on top of an RDBMS, is their physical data model capabilities. RDBMS systems prioritize normalization and utilize efficient join algorithms for relational data models. In contrast, MongoDB provides flexible schemas for application objects and supports a joins where the join key can be an array ($lookup: more than just a SQL join) as part of an aggregation pipeline. While this may be less efficient for simple many-to-one relationships with scalar values, MongoDB's document data model can often eliminate the need for joins altogether. Additionally, caching lookup values in the application is a viable option.

One million $lookup challenge

I you have read my previous post $lookup: more than just a SQL join, you understand that $lookup is not designed to join scalar values from thousands of documents. $lookup is useful at the end of an aggregation pipeline, not before the aggregation (examples in Comparison of JOINS πŸ‘‰πŸ» aggregation pipeline and CTEs) from a million documents collection. However, such collection should not require a join, as documents are designed to aggregate multiple related objects, unlike relational databases that normalize business data to multiple tables.

In a many-to-one relationship, it is common to embed fields, even when they are duplicated, in a document model. Normalization plays a crucial role in relational databases to prevent these duplicates, as RDBMS were designed for interactive users executing SQL statements. Missing updates can lead to data integrity issues. While triggers can help manage updates to duplicated values and prevent anomalies, they introduce new challenges as they operate behind the update statement.

When updates originate from well-reviewed and tested programs, it is manageable to modify data in multiple locations, particularly when such updates are infrequent. Let's illustrate joins and the absence of joins with a simple test.

To join multiple documents with a small lookup table, you can cache the lookup table in your application. In this post, I tested several methods for retrieving a value from a lookup table: using a collection, a map, and an array. I integrated these methods into an aggregation pipeline, but keep in mind that this can also be accomplished within the application itself.

I created a dimension table with one thousand documents, and a fact table with one million. The fact table has a "ref" field that references the "dimid" in the dimension table:

db.dim.drop();
db.fact.drop();

db.dim.insertMany(
    Array.from({ length: 1000 }, (_, i) => ({
        _id: i + 1,
        value: Math.random()
    }))
);

db.fact.insertMany(
    Array.from({ length: 1000000 }, () => ({
        ref: Math.ceil(Math.random() * 1000),
        value: Math.random()
    }))
);

Lookup (IndexedLoopJoin): 10 seconds

Here is an aggregation pipeline with a lookup.

x=db.fact.aggregate([
    {
        $lookup: {
            from: "dim",
            localField: "ref",
            foreignField: "_id",
            as: "dimData" ,
        }
    },
]).explain("executionStats")
;
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data, it runs in ten seconds. The query planner chooses an Index Nested Loop Join because there is an index. Without an index it could use a hash join.

Map to object and $getField: 61 seconds

To avoid the lookup, I read the dimension table into an object with a field per value, the field name being the "dimid", and get the value with $getField


const dimMap = {};  
db.dim.find().forEach(doc => {  
    dimMap[doc._id] = doc.value;  
});  
print( dimMap )


x=db.fact.aggregate([
    {  
        $addFields: {  
            dimValue: {  
                $getField: {  
                    field: { $toString: "$ref" },  
                    input: dimMap  
                }  
            }  
        }  
    }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data it runs in one minute. Accessing to a field by name is not an optimal operation and is O(n) so it is a viable solution only for very small lookup table.

Map to $switch branches: 23 seconds

Instead of using that map, I build a $switch statement to use in the aggregation pipeline.


const dimMap = {};  
db.dim.find().forEach(doc => {  
    dimMap[doc._id] = doc.value;  
});  
print( dimMap )

const switchBranches = Object.entries(dimMap).map(([id, value]) => ({  
    case: { $eq: ["$ref", parseInt(id)] },  
    then: value  
}));  
print( switchBranches )

x=db.fact.aggregate([  
    {  
        $addFields: {  
            dimValue: {  
                $switch: {  
                    branches: switchBranches,  
                    default: null  
                }  
            }  
        }  
    }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data, it runs in twenty seconds, and given that it puts the logic into the query, it is acceptable only for small lookup tables.

Map to array and $arrayElemAt: 1 second

Instead of a map, I use an array where the index is the "dimid". As I have no guarantee that the "dimid" is sequential with no gap, I build a sparse index that I fill with the existing values.


// Get the maximum ID
const maxId = db.dim.aggregate([
 {$group:{_id:null,max:{$max:"$_id"}}}
]).toArray()[0].max;  
// Create a sparse array for all values
const dimValues = new Array(maxId + 1).fill(null);  
// store the values at the right ID
db.dim.find({},{_id:1,value:1}).forEach(
 d => dimValues[d._id] = d.value
);  
print(dimValues)

//
x=db.fact.aggregate([  
    { $addFields: { dimValue: { $arrayElemAt: [dimValues, "$ref"] } } }  
]).explain("executionStats")
;
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

This is fast and runs in one second. However, it works only when the lookup identifier are in control, ideally starting from one and in a no-gap sequence.

Embed rather than join (denormalization)

Finally, as recommended for a document model (Model One-to-Many Relationships with Embedded Documents), I duplicate the dimension value into each fact documents. I run this update with an aggregation pipeline.


const startTime = new Date(); 
db.fact.aggregate([  
    {  
        $lookup: {  
            from: "dim",  
            localField: "ref",  
            foreignField: "_id",  
            as: "dimData"  
        }  
    },  
    {  
        $out: "fact"  
    }  
])  

const endTime = new Date(); 
const executionTime = (endTime - startTime) / 1000;  
print(`Update execution time: ${executionTime} seconds`);

This should be executed once, and then only the updated dimension values should be synchronized. This update took 16 seconds on my data.

To compare, I can simply read the document and project the embedded value:

x=db.fact.aggregate([  
    {  
        $project: {  
            _id: 1,  
            ref: 1,  
            dimValue: 1,  // Simply project the pre-computed field  
            // Add any other fact fields you need  
            someFactField: 1  
        }  
    }  
]).explain("executionStats");  
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

This query takes around 0.5 seconds on my data. It is advisable unless you are dealing with frequently updated lookup tables. Additionally, in MongoDB, a single compound index can cover all fields within a document. Typically, when filtering on a dimension, lookup, or reference table, the filter is applied to a business field rather than the internal "_id".

Conclusion

I have tested my example using various cardinalities for both the fact table and the dimension lookup table. Below is the raw data.

dim fact lookup $getField $switch $arrayElemAt update single doc
10 1000 0.008s 0.002s 0.001s 0.001s 0.08s 0s
100 1000 0.008s 0.006s 0.005s 0.001s 0.078s 0s
1000 1000 0.011s 0.062s 0.033s 0.001s 0.082s 0s
10000 1000 0.013s 0.754s 0.067s 0.003s 0.08s 0s
10 10000 0.075s 0.021s 0.016s 0.012s 0.199s 0.005s
100 10000 0.078s 0.066s 0.055s 0.013s 0.191s 0.005s
1000 10000 0.105s 0.62s 0.292s 0.013s 0.229s 0.005s
10000 10000 0.104s 6.94s 0.305s 0.015s 0.237s 0.005s
10 100000 0.738s 0.215s 0.171s 0.129s 1.306s 0.052s
100 100000 0.781s 0.673s 0.571s 0.131s 1.359s 0.052s
1000 100000 1.044s 6.259s 2.71s 0.141s 1.756s 0.054s
10000 100000 1.068s 73.205s 2.702s 0.144s 1.769s 0.059s
10 1000000 7.583s 2.199s 1.761s 1.332s 12.524s 0.559s
100 1000000 7.992s 6.634s 5.741s 1.346s 13.03s 0.557s
1000 1000000 10.551s 62.385s 26.4s 1.398s 16.771s 0.557s
10000 1000000 10.794s 742.086s 26.039s 1.437s 17.008s 0.578s
10 10000000 76.225s 22.127s 17.795s 13.196s 124.922s 5.789s
100 10000000 80.828s 67.602s 57.981s 13.695s 131.738s 5.714s
1000 10000000 106.194s 622.382s 267.555s 14.054s 168.854s 5.778s
10000 10000000 107.211s 7351.675s 265.404s 14.046s 171.13s 5.767s

An array, when queried with $arrayElemAt, is optimized for quickly retrieving values, while other data structures have a complexity of O(n). However, arrays have fixed values, which limits their flexibility compared to tables or collections. You may find more suitable structures in your application language. These structures resemble how SQL databases use hash tables. MongoDB can utilize a hash join for $lookup when the lookup table is small, when spilling to disk is permissible, and when there's no index.

When the lookup table is infrequently updated, applying updates to the embedded values is generally preferable, paying the price once at write and getting faster reads. MongoDB offers developers greater control over data access patterns and cardinalities, rather than relying solely on the query planner, which can lead to plan instability and runaway queries. In contrast, SQL databases must do all optimizations in the query planner to follow Codd's rules on data independence for relational databases.

A key distinction between MongoDB and SQL databases, including those that use a MongoDB API on top of an RDBMS, is their physical data model capabilities. RDBMS systems prioritize normalization and utilize efficient join algorithms for relational data models. In contrast, MongoDB provides flexible schemas for application objects and supports a joins where the join key can be an array ($lookup: more than just a SQL join) as part of an aggregation pipeline. While this may be less efficient for simple many-to-one relationships with scalar values, MongoDB's document data model can often eliminate the need for joins altogether. Additionally, caching lookup values in the application is a viable option.

Looking Ahead: A Confident New Chapter for Percona

Percona has always been more than just a companyβ€”it’s a mission-driven community built on a strong foundation of innovation, resilience, and open source excellence. For nearly two decades, Percona has been a trusted partner to organizations around the world, remaining steadfast in our mission while continuing to grow, improve, and evolve. Today, we mark the […]

June 17, 2025

$lookup: more than just a SQL join

When transitioning from a SQL background to MongoDB, the $lookup operation in an aggregation pipeline resembles a LEFT OUTER JOIN. For instance, the following MongoDB query:

        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }

is often compared to the following SQL statement

        SELECT *  
        FROM xxa  
        LEFT JOIN xxb
        ON xxa.xxa1 = xxb.xxb1;  

If you assume that these two operations are identical across various databases, you may misinterpret the expected outcomes and their performance:

  1. In MongoDB, any field can be a scalar or an array. The $lookup function operates as an intersection of two sets, unlike SQL's equality operator (=) which operates on two values. It is more similar to PostgreSQL's && operator for arrays.
  2. The results in MongoDB do not duplicate outer rows for each inner row as in SQL. Instead, an array of matching inner documents is projected to the outer document. If an $unwind stage follows, the $lookup will then unnest this array into multiple documents.

To illustrate the semantics, consider this simple example: I will begin with scalars from two collections ("xxa" and "xxb") and then used arrays in their fields instead of scalars.

Scalar on the outer and inner collections

I define two small collections with some values in common and some which do not match:

db.xxa.drop();
db.xxa.insert([
  { xxa1: 101 },
  { xxa1: 102 },
]);

db.xxb.drop();
db.xxb.insert([
  { xxb1: 102 },
  { xxb1: 103 },
]);

The following query "joins" the two with a $lookup:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  { _id: ObjectId('68504d1df99599b54cd4b118'), xxa1: 101, xxb: [] },
  {
    _id: ObjectId('68504d1df99599b54cd4b119'),
    xxa1: 102,
    xxb: [ { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 } ]
  }
]

For each document from the outer collection, an array is created. If there is no matching document in the inner collection, the array remains empty. If there is one matching document, it will be included in the array. This looks like a LEFT OUTER JOIN in SQL that returns a structured one-to-many rather than a tabular result with duplicated values. However, this is a special case where we are joining on scalar fields.

Array on the outer, scalar in the inner

I replace the outer collection with some arrays:

db.xxa.drop();
db.xxa.insert([
  { xxa1: [100,101] },
  { xxa1: [101,102] },
  { xxa1: [102,103] },
]);

I run the same query:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d43f99599b54cd4b11c'),
    xxa1: [ 100, 101 ],
    xxb: []
  },
  {
    _id: ObjectId('68504d43f99599b54cd4b11d'),
    xxa1: [ 101, 102 ],
    xxb: [ { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 } ]
  },
  {
    _id: ObjectId('68504d43f99599b54cd4b11e'),
    xxa1: [ 102, 103 ],
    xxb: [
      { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 },
      { _id: ObjectId('68504d26f99599b54cd4b11b'), xxb1: 103 }
    ]
  }
]

The semantics are similar, except that a document from the inner collection matches as soon as its value exists in the outer array. When it is not contained, the added array is empty. When multiple values from the outer collection find a matching document from the inner collection, those multiple documents are added to the array in the result.

The array in the outer collection acts as a one-to-many reference. In SQL, one-to-many relationships cannot exist and are transformed into many-to-one on the opposite side, and an index is created on the foreign key to allow navigating in the other way.

Scalar on the outer, array in the inner

I do the opposite:

db.xxa.drop();
db.xxa.insert([
  { xxa1: 101 },
  { xxa1: 102 },
]);

db.xxb.drop();
db.xxb.insert([
  { xxb1: [100,101] },
  { xxb1: [101,102] },
  { xxb1: [102,103] },
]);

I run the same query:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d5ef99599b54cd4b11f'),
    xxa1: 101,
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] }
    ]
  },
  {
    _id: ObjectId('68504d5ef99599b54cd4b120'),
    xxa1: 102,
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  }
]

A document from the inner collection matches as soon as one of its values equals a value from the inner collection. If multiple inner documents match, they will all appear in the resulting document array.

Arrays on both side

The general case can handle arrays on both sides:

db.xxa.drop();
db.xxa.insert([
  { xxa1: [100,101] },
  { xxa1: [101,102] },
  { xxa1: [102,103] },
]);

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d7ef99599b54cd4b124'),
    xxa1: [ 100, 101 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] }
    ]
  },
  {
    _id: ObjectId('68504d7ef99599b54cd4b125'),
    xxa1: [ 101, 102 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  },
  {
    _id: ObjectId('68504d7ef99599b54cd4b126'),
    xxa1: [ 102, 103 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  }
]

In this context, 'matching' refers to the intersection between the set of keys from the outer document and the set of keys found in an inner document.

Representing the same in a SQL database is much more complex than two tables, as it needs additional tables instead of arrays:

-- Main entities  
CREATE TABLE xxa (  
    id SERIAL PRIMARY KEY  
    -- other scalar fields if any  
);  

CREATE TABLE xxb (  
    id SERIAL PRIMARY KEY  
    -- other scalar fields if any  
);  

-- Junction tables to represent the arrays  
CREATE TABLE xxa_values (  
    xxa_id INT REFERENCES xxa(id),  
    value INT,  
    PRIMARY KEY (xxa_id, value)  
);  

CREATE TABLE xxb_values (  
    xxb_id INT REFERENCES xxb(id),  
    value INT,  
    PRIMARY KEY (xxb_id, value)  
);  

Joining all tables and deduplicating the results serves as the equivalent of a MongoDB lookup:

SELECT   
    xxa.id as xxa_id,  
    ARRAY_AGG(DISTINCT xxa_vals.value) as xxa1,  
    ARRAY_AGG(  
        DISTINCT jsonb_build_object(  
            'id', xxb.id,  
            'xxb1', ARRAY(  
                SELECT value   
                FROM xxb_values   
                WHERE xxb_id = xxb.id   
                ORDER BY value  
            )  
        )  
    ) FILTER (WHERE xxb.id IS NOT NULL) as xxb  
FROM xxa  
LEFT JOIN xxa_values xxa_vals ON xxa.id = xxa_vals.xxa_id  
LEFT JOIN xxb_values xxb_vals ON xxa_vals.value = xxb_vals.value  
LEFT JOIN xxb ON xxb_vals.xxb_id = xxb.id  
GROUP BY xxa.id  
ORDER BY xxa.id;  

The next time you hear that a lookup in MongoDB is the same as a SQL join and don't understand the result or the performance, you should remember that it is a different, higher-level operation, on a flexible schema.
A simple left outer join in SQL often disappears in MongoDB, as the interconnected objects belong to a single document, like an aggregate in Domain Driver Design.

Conclusion

I used abstract names for collections and fields, in order to explain the behavior, but many modern applications benefit from a small array and an additional association table may complicate things unnecessarily. In contemporary apps, users are identified by one or more email addresses, resources by tags, blogs by categories, and videos by interests, and many queries look for common interests, categories, or tags.

$lookup: more than just a SQL join

When transitioning from a SQL background to MongoDB, the $lookup operation in an aggregation pipeline resembles a LEFT OUTER JOIN. For instance, the following MongoDB query:

        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }

is often compared to the following SQL statement

        SELECT *  
        FROM xxa  
        LEFT JOIN xxb
        ON xxa.xxa1 = xxb.xxb1;  

If you assume that these two operations are identical across various databases, you may misinterpret the expected outcomes and their performance:

  1. In MongoDB, any field can be a scalar or an array. The $lookup function operates as an intersection of two sets, unlike SQL's equality operator (=) which operates on two values. It is more similar to PostgreSQL's && operator for arrays.
  2. The results in MongoDB do not duplicate outer rows for each inner row as in SQL. Instead, an array of matching inner documents is projected to the outer document. If an $unwind stage follows, the $lookup will then unnest this array into multiple documents.

To illustrate the semantics, consider this simple example: I will begin with scalars from two collections ("xxa" and "xxb") and then used arrays in their fields instead of scalars.

Scalar on the outer and inner collections

I define two small collections with some values in common and some which do not match:

db.xxa.drop();
db.xxa.insert([
  { xxa1: 101 },
  { xxa1: 102 },
]);

db.xxb.drop();
db.xxb.insert([
  { xxb1: 102 },
  { xxb1: 103 },
]);

The following query "joins" the two with a $lookup:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  { _id: ObjectId('68504d1df99599b54cd4b118'), xxa1: 101, xxb: [] },
  {
    _id: ObjectId('68504d1df99599b54cd4b119'),
    xxa1: 102,
    xxb: [ { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 } ]
  }
]

For each document from the outer collection, an array is created. If there is no matching document in the inner collection, the array remains empty. If there is one matching document, it will be included in the array. This looks like a LEFT OUTER JOIN in SQL that returns a structured one-to-many rather than a tabular result with duplicated values. However, this is a special case where we are joining on scalar fields.

Array on the outer, scalar in the inner

I replace the outer collection with some arrays:

db.xxa.drop();
db.xxa.insert([
  { xxa1: [100,101] },
  { xxa1: [101,102] },
  { xxa1: [102,103] },
]);

I run the same query:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d43f99599b54cd4b11c'),
    xxa1: [ 100, 101 ],
    xxb: []
  },
  {
    _id: ObjectId('68504d43f99599b54cd4b11d'),
    xxa1: [ 101, 102 ],
    xxb: [ { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 } ]
  },
  {
    _id: ObjectId('68504d43f99599b54cd4b11e'),
    xxa1: [ 102, 103 ],
    xxb: [
      { _id: ObjectId('68504d26f99599b54cd4b11a'), xxb1: 102 },
      { _id: ObjectId('68504d26f99599b54cd4b11b'), xxb1: 103 }
    ]
  }
]

The semantics are similar, except that a document from the inner collection matches as soon as its value exists in the outer array. When it is not contained, the added array is empty. When multiple values from the outer collection find a matching document from the inner collection, those multiple documents are added to the array in the result.

The array in the outer collection acts as a one-to-many reference. In SQL, one-to-many relationships cannot exist and are transformed into many-to-one on the opposite side, and an index is created on the foreign key to allow navigating in the other way.

Scalar on the outer, array in the inner

I do the opposite:

db.xxa.drop();
db.xxa.insert([
  { xxa1: 101 },
  { xxa1: 102 },
]);

db.xxb.drop();
db.xxb.insert([
  { xxb1: [100,101] },
  { xxb1: [101,102] },
  { xxb1: [102,103] },
]);

I run the same query:

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d5ef99599b54cd4b11f'),
    xxa1: 101,
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] }
    ]
  },
  {
    _id: ObjectId('68504d5ef99599b54cd4b120'),
    xxa1: 102,
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  }
]

A document from the inner collection matches as soon as one of its values equals a value from the inner collection. If multiple inner documents match, they will all appear in the resulting document array.

Arrays on both side

The general case can handle arrays on both sides:

db.xxa.drop();
db.xxa.insert([
  { xxa1: [100,101] },
  { xxa1: [101,102] },
  { xxa1: [102,103] },
]);

db.xxa.aggregate([
    {
        $lookup: {
            from: "xxb",
            localField: "xxa1",
            foreignField: "xxb1",
            as: "xxb"
        }
    }
]);

[
  {
    _id: ObjectId('68504d7ef99599b54cd4b124'),
    xxa1: [ 100, 101 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] }
    ]
  },
  {
    _id: ObjectId('68504d7ef99599b54cd4b125'),
    xxa1: [ 101, 102 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b121'), xxb1: [ 100, 101 ] },
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  },
  {
    _id: ObjectId('68504d7ef99599b54cd4b126'),
    xxa1: [ 102, 103 ],
    xxb: [
      { _id: ObjectId('68504d65f99599b54cd4b122'), xxb1: [ 101, 102 ] },
      { _id: ObjectId('68504d65f99599b54cd4b123'), xxb1: [ 102, 103 ] }
    ]
  }
]

In this context, 'matching' refers to the intersection between the set of keys from the outer document and the set of keys found in an inner document.

Representing the same in a SQL database is much more complex than two tables, as it needs additional tables instead of arrays:

-- Main entities  
CREATE TABLE xxa (  
    id SERIAL PRIMARY KEY  
    -- other scalar fields if any  
);  

CREATE TABLE xxb (  
    id SERIAL PRIMARY KEY  
    -- other scalar fields if any  
);  

-- Junction tables to represent the arrays  
CREATE TABLE xxa_values (  
    xxa_id INT REFERENCES xxa(id),  
    value INT,  
    PRIMARY KEY (xxa_id, value)  
);  

CREATE TABLE xxb_values (  
    xxb_id INT REFERENCES xxb(id),  
    value INT,  
    PRIMARY KEY (xxb_id, value)  
);  

Joining all tables and deduplicating the results serves as the equivalent of a MongoDB lookup:

SELECT   
    xxa.id as xxa_id,  
    ARRAY_AGG(DISTINCT xxa_vals.value) as xxa1,  
    ARRAY_AGG(  
        DISTINCT jsonb_build_object(  
            'id', xxb.id,  
            'xxb1', ARRAY(  
                SELECT value   
                FROM xxb_values   
                WHERE xxb_id = xxb.id   
                ORDER BY value  
            )  
        )  
    ) FILTER (WHERE xxb.id IS NOT NULL) as xxb  
FROM xxa  
LEFT JOIN xxa_values xxa_vals ON xxa.id = xxa_vals.xxa_id  
LEFT JOIN xxb_values xxb_vals ON xxa_vals.value = xxb_vals.value  
LEFT JOIN xxb ON xxb_vals.xxb_id = xxb.id  
GROUP BY xxa.id  
ORDER BY xxa.id;  

The next time you hear that a lookup in MongoDB is the same as a SQL join and don't understand the result or the performance, you should remember that it is a different, higher-level operation, on a flexible schema.
A simple left outer join in SQL often disappears in MongoDB, as the interconnected objects belong to a single document, like an aggregate in Domain Driver Design.

Conclusion

I used abstract names for collections and fields, in order to explain the behavior, but many modern applications benefit from a small array and an additional association table may complicate things unnecessarily. In contemporary apps, users are identified by one or more email addresses, resources by tags, blogs by categories, and videos by interests, and many queries look for common interests, categories, or tags.

Percona Software for MongoDB Release Plan Updates

Percona commits to delivering robust, enterprise-grade, and community-driven software for MongoDB. We build our databases and tools to meet evolving data needs. As MongoDB continues to evolve, we’ve updated and clarified our release strategy for the Percona Software for MongoDB suite. This suite includes: Percona Server for MongoDB (PSMDB) Percona Backup for MongoDB (PBM) Percona […]

June 16, 2025

Implement row-level security in Amazon Aurora MySQL and Amazon RDS for MySQL

Row-level security (RLS) is a security mechanism that enhances data protection in scalable applications by controlling access at the individual row level. It enables organizations to implement fine-grained access controls based on user attributes, so users can only view and modify data they’re authorized to access. This post focuses on implementing a cost-effective custom RLS solution using native MySQL features, making it suitable for a wide range of use cases without requiring additional software dependencies. This solution is applicable for both Amazon Relational Database Service (Amazon RDS) for MySQL and Amazon Aurora MySQL-Compatible Edition, providing flexibility for users of either service.

Integrating Repmgr with Barman: Managing Backups During Switchovers

Repmgr is an open source tool suite for managing replication and failover in a cluster of PostgreSQL servers. It enhances PostgreSQL’s built-in hot-standby capabilities with tools to set up standby servers, monitor replication, and perform administrative tasks such as failover or manual switchover operations. If you’re already familiar with Patroni, think of repmgr as a […]

June 15, 2025

Queries on JSON πŸ‘‰πŸ» compound indexes (Equality, Sort, Range)

In a blog post titled New Benchmarks Show Postgres Dominating MongoDB in Varied Workloads, EDB claims that Postgres outperforms MongoDB in document-based data tests. While I generally find such marketing benchmarks useless, they highlight mistakes made by vendors comparing a database where they are experts with one they don't know and are unwilling to learn about. This provides an opportunity for educating on efficient index and query design.

There are four queries in this benchmark, with data loaded from github archive from 2015. This data is in JSON, which makes it suitable to test queries on documents, and queries are OLAP style, using aggregation pipelines in MongoDB.

Load documents to a collection

I used the same method to load data in a small lab to reproduce the queries:

for file in http://data.gharchive.org/2015-{01..12}-{01..31}-{0..23}.json.gz
 do
  wget -q -o /dev/null -O - $file |
  gunzip -c |
  mongoimport --collection="github2015" 
 done

Here is an example of one document:

db.github2015.find().limit(1);

[
  {
    _id: ObjectId('684ee281d8d90a383a078112'),
    id: '2489368070',
    type: 'PushEvent',
    actor: {
      id: 9152315,
      login: 'davidjhulse',
      gravatar_id: '',
      url: 'https://api.github.com/users/davidjhulse',
      avatar_url: 'https://avatars.githubusercontent.com/u/9152315?'
    },
    repo: {
      id: 28635890,
      name: 'davidjhulse/davesbingrewardsbot',
      url: 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot'                                                                      
    },
    payload: {
      push_id: 536740396,
      size: 1,
      distinct_size: 1,
      ref: 'refs/heads/master',
      head: 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
      before: '86ffa724b4d70fce46e760f8cc080f5ec3d7d85f',
      commits: [
        {
          sha: 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
          author: {
            email: 'da8d7d1118ca5befd4d0d3e4f449c76ba6f1ee7e@live.com',
            name: 'davidjhulse'
          },
          message: 'Altered BingBot.jar\n\nFixed issue with multiple account support',                                                         
          distinct: true,
          url: 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81'                 
        }
      ]
    },
    public: true,
    created_at: '2015-01-01T00:00:00Z'
  }
]

This dataset is ideal for testing a document database because:

  • documents have an average size of three kilobytes, with some up to two megabytes.
  • it features a structured format with sub-documents like "actor" and "payload."
  • it contains arrays, such as "payload.commit," necessitating multi-key or inverted indexes.

I will test the four queries used by the benchmark and include an additional one that queries the array "payload.commit," which the benchmark overlooked. MongoDB's multi-key indexes significantly outperform all SQL databases in this regard (see the Multi-key Indexes series). While many vendor benchmarks limit their test coverage to create a favorable impression, my objective is to transparently showcase indexing best practices.

Single-field indexes

Here are the indexes that were created for the vendor benchmark:

 db.github2015.createIndex( {type:1} )
 db.github2015.createIndex( {"repo.name":1} )
 db.github2015.createIndex( {"payload.action":1} )
 db.github2015.createIndex( {"actor.login":1} )
 db.github2015.createIndex( {"payload.issue.comments":1} )

Seeing the index definitions, I already know why they got better results on PostgreSQL. Complex queries rarely filter or sort on a single field, and a well-designed database should have compound indexes. Without the right compound indexes, PostgreSQL can combine multiple indexes with bitmap scans, which is not ideal, as it cannot cover range filters or sort orders. However, it can be used to reduce the number of indexes created, as they negatively impact the vacuum process.

The document model in MongoDB offers the advantage of having all important fields consolidated within a single document. This allows for the use of a compound index that can effectively handle equality, range, and sort order. These indexes can be applied to both scalar values and arrays.

Benchmark queries

I ran benchmark queries in my lab to educate on indexing practices. I created the appropriate indexes, and the best starting point for documentation is The ESR (Equality, Sort, Range) Guideline.

Query a) Repositories order by most open issues quantity

The query is:

db.github2015.aggregate([  
  { $match: { $and: [ { type: "IssuesEvent"} , { "payload.action" : "opened" } ] }},  
  { $group: { _id: "$repo.name", total: { $sum: 1 } } },  
  { $sort: { total: -1 } }  
])

Index according to the ESR guideline:

  • Equality: "type" and "payload.action", with the less selective first (better compression)
  • Sort: "repo.name", as grouping is faster on sorted keys
db.github2015.createIndex(
 { "payload.action": 1, "type": 1, "repo.name": 1 }
)  

In my test, the query scanned nearly three hundred thousands index keys and returned in 700 milliseconds:

        executionStats: {                                                                                                                       
          executionSuccess: true,                                                                                                               
          nReturned: 72361,                                                                                                                     
          executionTimeMillis: 777,                                                                                                             
          totalKeysExamined: 278489,                                                                                                            
          totalDocsExamined: 0,    

Query b) Return git event type order by quantity from major to minor

The query is:

db.github2015.aggregate([  
  { $group: { _id: "$type", total: { $sum: 1 } } },  
  { $sort: { total: -1 } }  
])

Index according to the ESR guideline:

  • Sort: "repo.name", as grouping is faster on sorted keys
db.github2015.createIndex({ "type": 1 })  

This index was created by the benchmark, however it is important to know that the query planner will not choose the index without an equality or range predicate, even if it could help with sorting or grouping.

The query should simply add an unbounded range on the sort key:

db.github2015.aggregate([  
  { $match: { "type": { $gt: MinKey}  }},  
  { $group: { _id: "$type", total: { $sum: 1 } } },  
  { $sort: { total: -1 } }  
])

In my test, the query aggregated ten million keys in 4.5 seconds:

        executionStats: {
          executionSuccess: true,
          nReturned: 14,
          executionTimeMillis: 4585,
          totalKeysExamined: 9480600,
          totalDocsExamined: 0, 

Query c) Return the top 10 most active actors

The query is:

db.github2015.aggregate([  
  { $group: { _id: "$actor.login", events: { $sum: 1 } } },  
  { $sort: { events: -1 } },  
  { $limit: 10 }  
])

Index according to the ESR guideline:

  • Sort: "repo.name", as grouping is faster on sorted keys
db.github2015.createIndex({ "actor.login": 1 })  

As seen above, we need to add an unbounded range:

db.github2015.aggregate([  
  { $match: { "actor.login": { $gt: MinKey}  }}, 
  { $group: { _id: "$actor.login", events: { $sum: 1 } } },  
  { $sort: { events: -1 } },  
  { $limit: 10 }  
])

In my test, the query aggregated ten million keys in 13 seconds:

        executionStats: {                                                                                                                       
          executionSuccess: true,                                                                                                               
          nReturned: 727312,
          executionTimeMillis: 13498,
          totalKeysExamined: 9480600,
          totalDocsExamined: 0,

If you need to optimize this access pattern further, use the computed design pattern, and increment a login counter in an actor's collection, leveraging the atomic $inc operator.

Query d) Return repositories that have more than two comments and a specific event type, order by average comments from major to minor

The query is:

db.github2015.aggregate([  
  { $match: { "type": "PushEvent", "payload.issue.comments": { $gt : 2 } } },  
  { $group: { _id: "$repo.name", avg: { $avg: "$payload.issue.comments" } } },  
  { $sort: { avg: -1 } }  
])

Index according to the ESR guideline:

  • Equality: "type"
  • Sort: "repo.name", as grouping is faster on sorted keys
  • Range: "payload.issue.comments"
db.github2015.createIndex(
 { "type": 1, "repo.name": 1, "payload.issue.comments": 1  }
)

I don't know if it was done on purpose for the benchmark, but the dataset has no "PushEvent" with a "payload.issue" so the query returns an empty result.
For this particular case, as there's no keys to sort, it is better to place the range key before the sort key:

db.github2015.createIndex(
 { "type": 1, "payload.issue.comments": 1, "repo.name": 1 }
)  

The index immediately finds that there are no keys for those bounds:


        executionStats: {
          executionSuccess: true,
          nReturned: 0,
          executionTimeMillis: 1,
          totalKeysExamined: 0,
          totalDocsExamined: 0,

Ad-Hoc queries for OLAP

While acceptable performance can be achieved with the right index, it may not be the optimal solution for OLAP use cases. Instead, consider creating a single Atlas Search index to handle all related queries. For further guidance, refer to my previous post: Search Index for Reporting.

Atlas Search Indexes are maintained asynchronously on a dedicated node, ensuring no impact on the operational database. They provide a near-real-time state without the complexity of streaming changes to another database.

Another Query: Recent push events by user's commit

The recent push by a user's commit is a relevant use-case for this dataset. However, the benchmark did not run any queries on "commits" since it is an array. While PostgreSQL supports JSON, it cannot be directly compared to a document database like MongoDB that handles non-scalar fields natively. If you run the benchmark on PostgreSQL, try this query:

SELECT   
 data->'repo'->>'name' as repo_name,  
 data->'payload'->'commits' as commits_info  
FROM "github2015"  
WHERE   
 data->>'type' = 'PushEvent'  
 AND data->'payload'->'commits' @> '[{"author": {"name": "ggolden@umich.edu"}}]'  
ORDER BY   
 data->>'created_at' DESC  
LIMIT 5;  

You can explore various indexes, such as a GIN index, but you will never find one that directly retrieves the five documents needed for the result. If you do, please correct me and show the execution plan in a comment.

On MongoDB, the same index guideline applies:

  • Equality: "type" (low selectivity) and "payload.commits.author.name" (will be multi-key)
  • Sort: "created_at" (a must for pagination query)
db.github2015.createIndex({  
  "type": 1,  
  "payload.commits.author.name": 1,  
  "created_at": -1  
}) 

The query is simple and doesn't even need an aggregation pipeline:

db.github2015.find({      
  "type": "PushEvent",      
  "payload.commits.author.name": "ggolden@umich.edu"      
}, {  
  "repo.name": 1,  
  "payload.commits.author.name": 1,  
  "payload.commits.message": 1,  
  _id: 0  
}).sort({      
  "created_at": -1      
}).limit(5) 

The execution statistics indicate that only 5 documents have been read, which is the minimum required for the results:

  executionStats: {                                                                                                                                                                                          
    executionSuccess: true,                                                                                                                                                                                  
    nReturned: 5,                                                                                                                                                                                            
    executionTimeMillis: 0,                                                                                                                                                                                  
    totalKeysExamined: 5,                                                                                                                                                                                    
    totalDocsExamined: 5,                                                                                                                                                                                    
    executionStages: {                                                                                                                                                                                       

The execution plan illustrates how a single seek has read five index entries by utilizing a multi-key index. It applies the index bounds for both the equality filters and the sort order, ensuring an efficient retrieval process:

  stage: 'IXSCAN',          
  nReturned: 5,          
  executionTimeMillisEstimate: 0,          
...       
  isMultiKey: true,          
  multiKeyPaths: {          
    type: [],          
    'payload.commits.author.name': [ 'payload.commits' ],          
    created_at: []          
  },          
...                                                                                                                                                                                        
  direction: 'forward',
  indexBounds: {
    type: [ '["PushEvent", "PushEvent"]' ],
    'payload.commits.author.name': [ '["ggolden@umich.edu", "ggolden@umich.edu"]' ],
    created_at: [ '[MaxKey, MinKey]' ]
  },
  keysExamined: 5,
  seeks: 1, 
  dupsTested: 5,

Conclusion

Vendor benchmarks can be misleading with their time comparisons and performance claims. However, when analyzed critically, they serve an educational role by highlighting common design mistakes in the database with which they compare to.

In this post, I applied the [The ESR (Equality, Sort, Range) Guideline], looked at the execution plan, and tested queries that are relevant to a document model, to show the power of MongoDB to query JSON data.

Queries on JSON πŸ‘‰πŸ» compound indexes (Equality, Sort, Range)

In a blog post titled New Benchmarks Show Postgres Dominating MongoDB in Varied Workloads, EDB claims that Postgres outperforms MongoDB in document-based data tests. While I generally find such marketing benchmarks useless, they highlight mistakes made by vendors who compare a database where they are experts with one they don't know and have no desire to learn. This provides an opportunity to remind some basics of on efficient index and query design.

There are four queries in this benchmark, with data loaded from github archive from 2015. This data is in JSON, which makes it suitable to test queries on documents, and queries are OLAP style, using aggregation pipelines in MongoDB.

Load documents to a collection

I used the same method to load data in a small lab to reproduce the queries:

for file in http://data.gharchive.org/2015-{01..12}-{01..31}-{0..23}.json.gz
 do
  wget -q -o /dev/null -O - $file |
  gunzip -c |
  mongoimport --collection="github2015" 
 done

Here is an example of one document:

db.github2015.find().limit(1);

[
  {
    _id: ObjectId('684ee281d8d90a383a078112'),
    id: '2489368070',
    type: 'PushEvent',
    actor: {
      id: 9152315,
      login: 'davidjhulse',
      gravatar_id: '',
      url: 'https://api.github.com/users/davidjhulse',
      avatar_url: 'https://avatars.githubusercontent.com/u/9152315?'
    },
    repo: {
      id: 28635890,
      name: 'davidjhulse/davesbingrewardsbot',
      url: 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot'                                                                      
    },
    payload: {
      push_id: 536740396,
      size: 1,
      distinct_size: 1,
      ref: 'refs/heads/master',
      head: 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
      before: '86ffa724b4d70fce46e760f8cc080f5ec3d7d85f',
      commits: [
        {
          sha: 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
          author: {
            email: 'da8d7d1118ca5befd4d0d3e4f449c76ba6f1ee7e@live.com',
            name: 'davidjhulse'
          },
          message: 'Altered BingBot.jar\n\nFixed issue with multiple account support',                                                         
          distinct: true,
          url: 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81'                 
        }
      ]
    },
    public: true,
    created_at: '2015-01-01T00:00:00Z'
  }
]

This dataset is ideal for testing a document database because:

  • documents have an average size of three kilobytes, with some up to two megabytes.
  • it features a structured format with sub-documents like "actor" and "payload."
  • it contains arrays, such as "payload.commit," necessitating multi-key or inverted indexes.

I will test the four queries used by the benchmark and include an additional one that queries the array "payload.commit," which the benchmark overlooked. MongoDB's multi-key indexes significantly outperform all SQL databases in this regard (see the Multi-key Indexes series). While many vendor benchmarks limit their test coverage to create a favorable impression, my objective is to transparently showcase indexing best practices.

Single-field indexes

Here are the indexes that were created for the vendor benchmark:

 db.github2015.createIndex( {type:1} )
 db.github2015.createIndex( {"repo.name":1} )
 db.github2015.createIndex( {"payload.action":1} )
 db.github2015.createIndex( {"actor.login":1} )
 db.github2015.createIndex( {"payload.issue.comments":1} )

Seeing the index definitions, I already know why they got better results on PostgreSQL for some queries. Complex queries rarely filter or sort on a single field, and a well-designed database should have compound indexes. Without the right compound indexes, PostgreSQL can combine multiple indexes with bitmap scans, which is not ideal, as it cannot cover range filters or sort orders. However, it can be used to reduce the number of indexes created, as they negatively impact the vacuum process.

The document model in MongoDB offers the advantage of having all important fields consolidated within a single document. This allows for the use of a compound index that can effectively handle equality, range, and sort order. These indexes can be applied to both scalar values and arrays.

Benchmark queries

I ran benchmark queries in my lab to educate on indexing practices. I created the appropriate indexes, and the best starting point for documentation is The ESR (Equality, Sort, Range) Guideline.

Query a) Repositories order by most open issues quantity

The query is:

db.github2015.aggregate([  
  { $match: { $and: [ { type: "IssuesEvent"} , { "payload.action" : "opened" } ] }},  
  { $group: { _id: "$repo.name", total: { $sum: 1 } } },  
  { $sort: { total: -1 } }  
]).explain("executionStats")

Index according to the ESR guideline:

  • Equality: "type" and "payload.action", with the less selective first (better compression)
  • Sort: "repo.name", as grouping is faster on sorted keys
db.github2015.createIndex(
 { "payload.action": 1, "type": 1, "repo.name": 1 }
)  

In my test, the query scanned nearly three hundred thousands index keys and returned in 700 milliseconds:

        executionStats: {                                                                                                                       
          executionSuccess: true,                                                                                                               
          nReturned: 72361,                                                                                                                     
          executionTimeMillis: 777,                                                                                                             
          totalKeysExamined: 278489,                                                                                                            
          totalDocsExamined: 0,    

In the vendor's benchmark results, MongoDB outperformed PostgreSQL for this query, even if it had to fetch half of the documents due to an index being used for a single filter. This suggests how MongoDB can be significantly faster than PostgreSQL with the appropriate indexing.

Query b) Return git event type order by quantity from major to minor

The query is:

db.github2015.aggregate([  
  { $group: { _id: "$type", total: { $sum: 1 } } },  
  { $sort: { total: -1 } }  
])

Index according to the ESR guideline:

  • Sort: "repo.name", as grouping is faster on sorted keys
db.github2015.createIndex({ "type": 1 })  

This index was created by the benchmark, however it is important to know that the query planner will consider the index without a filter on the key prefix (this may be improved in the future if SERVER-13197 is implemented).

The query should simply add an unbounded range on the sort key:

db.github2015.aggregate([  
  { $match: { "type": { $gt: MinKey}  }},  
  { $group: { _id: "$type", total: { $sum: 1 } } },  
  { $sort: { total: -1 } }  
])

In my test, the query aggregated ten million keys in 4.5 seconds:

        executionStats: {
          executionSuccess: true,
          nReturned: 14,
          executionTimeMillis: 4585,
          totalKeysExamined: 9480600,
          totalDocsExamined: 0, 

I enhanced performance significantly by optimizing the query without creating in new index. MongoDB provides users with greater control over data access than PostgreSQL, even allowing for the use of query planner hints to improve efficiency. In this case, it is sufficient to add a {$gt: MinKey} or { $lt: MaxKey}.

Query c) Return the top 10 most active actors

The query is:

db.github2015.aggregate([  
  { $group: { _id: "$actor.login", events: { $sum: 1 } } },  
  { $sort: { events: -1 } },  
  { $limit: 10 }  
])

Index according to the ESR guideline:

  • Sort: "repo.name", as grouping is faster on sorted keys
db.github2015.createIndex({ "actor.login": 1 })  

As seen above, we need to add an unbounded range filter to get the idnex considered by the query planner:

db.github2015.aggregate([  
  { $match: { "actor.login": { $gt: MinKey}  }}, 
  { $group: { _id: "$actor.login", events: { $sum: 1 } } },  
  { $sort: { events: -1 } },  
  { $limit: 10 }  
])

In my test, the query aggregated ten million keys in 13 seconds:

        executionStats: {                                                                                                                       
          executionSuccess: true,                                                                                                               
          nReturned: 727312,
          executionTimeMillis: 13498,
          totalKeysExamined: 9480600,
          totalDocsExamined: 0,

If you need to optimize this access pattern further, use the computed design pattern, and increment a login counter in an actor's collection, leveraging the atomic $inc operator.

Query d) Return repositories that have more than two comments and a specific event type, order by average comments from major to minor

The query is:

db.github2015.aggregate([  
  { $match: { "type": "PushEvent", "payload.issue.comments": { $gt : 2 } } },  
  { $group: { _id: "$repo.name", avg: { $avg: "$payload.issue.comments" } } },  
  { $sort: { avg: -1 } }  
])

Index according to the ESR guideline:

  • Equality: "type"
  • Sort: "repo.name", as grouping is faster on sorted keys
  • Range: "payload.issue.comments"
db.github2015.createIndex(
 { "type": 1, "repo.name": 1, "payload.issue.comments": 1  }
)

I don't know if it was done on purpose for the benchmark, but the dataset has no "PushEvent" with a "payload.issue" so the query returns an empty result.
For this particular case, as there's no keys to sort, it is better to place the range key before the sort key:

db.github2015.createIndex(
 { "type": 1, "payload.issue.comments": 1, "repo.name": 1 }
)  

The index immediately finds that there are no keys for those bounds:


        executionStats: {
          executionSuccess: true,
          nReturned: 0,
          executionTimeMillis: 1,
          totalKeysExamined: 0,
          totalDocsExamined: 0,

The query may mistakenly expects a "PushEvent" to be an "IssuesEvent," as it looks for "payload.issue" comments. To eliminate this ambiguity, MongoDB offers schema validation (see "Schema Later" considered harmful).

Ad-Hoc queries for OLAP

While acceptable performance can be achieved with the right index, it may not be the optimal solution for OLAP use cases. Instead, consider creating a single Atlas Search index to handle all related queries. For further guidance, refer to my previous post: Search Index for Reporting.

Atlas Search Indexes are maintained asynchronously on a dedicated node, ensuring no impact on the operational database. They provide a near-real-time state without the complexity of streaming changes to another database.

Another Query: Recent push events by user's commit

The recent push by a user's commit is a relevant use-case for this dataset. However, the benchmark did not run any queries on "commits" since it is an array. While PostgreSQL supports JSON, it cannot be directly compared to a document database like MongoDB that handles non-scalar fields natively. If you run the benchmark on PostgreSQL, try this query:

SELECT   
 data->'repo'->>'name' as repo_name,  
 data->'payload'->'commits' as commits_info  
FROM "github2015"  
WHERE   
 data->>'type' = 'PushEvent'  
 AND data->'payload'->'commits' @> '[{"author": {"name": "ggolden@umich.edu"}}]'  
ORDER BY   
 data->>'created_at' DESC  
LIMIT 5;  

You can explore various indexes, such as a GIN index, but you will never find one that directly retrieves the five documents needed for the result. If you do, please correct me and show the execution plan in a comment.

On MongoDB, the same index guideline applies:

  • Equality: "type" (low selectivity) and "payload.commits.author.name" (will be multi-key)
  • Sort: "created_at" (a must for pagination query)
db.github2015.createIndex({  
  "type": 1,  
  "payload.commits.author.name": 1,  
  "created_at": -1  
}) 

The query is simple and doesn't even need an aggregation pipeline:

db.github2015.find({      
  "type": "PushEvent",      
  "payload.commits.author.name": "ggolden@umich.edu"      
}, {  
  "repo.name": 1,  
  "payload.commits.author.name": 1,  
  "payload.commits.message": 1,  
  _id: 0  
}).sort({      
  "created_at": -1      
}).limit(5) 

The execution statistics indicate that only 5 documents have been read, which is the minimum required for the results:

  executionStats: {                                                                                                                                                                                          
    executionSuccess: true,                                                                                                                                                                                  
    nReturned: 5,                                                                                                                                                                                            
    executionTimeMillis: 0,                                                                                                                                                                                  
    totalKeysExamined: 5,                                                                                                                                                                                    
    totalDocsExamined: 5,                                                                                                                                                                                    
    executionStages: {                                                                                                                                                                                       

The execution plan illustrates how a single seek has read five index entries by utilizing a multi-key index. It applies the index bounds for both the equality filters and the sort order, ensuring an efficient retrieval process:

  stage: 'IXSCAN',          
  nReturned: 5,          
  executionTimeMillisEstimate: 0,          
...       
  isMultiKey: true,          
  multiKeyPaths: {          
    type: [],          
    'payload.commits.author.name': [ 'payload.commits' ],          
    created_at: []          
  },          
...                                                                                                                                                                                        
  direction: 'forward',
  indexBounds: {
    type: [ '["PushEvent", "PushEvent"]' ],
    'payload.commits.author.name': [ '["ggolden@umich.edu", "ggolden@umich.edu"]' ],
    created_at: [ '[MaxKey, MinKey]' ]
  },
  keysExamined: 5,
  seeks: 1, 
  dupsTested: 5,

Conclusion

Benchmarks are often biased because vendors optimize tests for their own databases, neglecting others. While their performance claims and time comparisons lack value, looking at the queries and execution plans may serve an educational role by highlighting common design mistakes they made for the databases they are comparing to.

In this post, I applied the The ESR (Equality, Sort, Range) Guideline, examined the execution plan, and tested queries relevant to a document model to demonstrate the power of MongoDB in querying JSON data. After replacing two indexes and improving two queries, you should notice that MongoDB performs significantly faster than PostgreSQL on those four queries.

June 13, 2025

DuckDB to query MongoDB

DuckDB is a high-performance analytical database, but it is single-process, so it cannot replace an operational database. MongoDB is a general-purpose database for operational data in a flexible format. DuckDB lacks a native connector to MongoDB, but you can combine pg_duckdb and mongo_fdw, two PostgreSQL extensions, to query MongoDB with DuckDB.

A PostgreSQL Docker container with DuckDB and Mongo_FDW

Here is an example. I built a pg_duckdb image, With is PostgreSQL with DuckDB as an extension, where I install mongo_fdw, the Foreign Data Wrapper to access MongoDB from PostgreSQL. Here is my dockerfile:

ARG pg_version=17

# build stage to compole the MongoDB Foreign Data Wrapper

FROM pgduckdb/pgduckdb:${pg_version}-main as build

ARG pg_version

ARG MONGO_FDW_VERSION=5_5_2
ARG MONGO_FDW_URL=https://github.com/EnterpriseDB/mongo_fdw/archive/REL-${MONGO_FDW_VERSION}.tar.gz
ARG SOURCE_FILES=/tmp/mongo_fdw

ENV PKG_CONFIG_PATH=${SOURCE_FILES}/mongo-c-driver/src/libmongoc/src:${SOURCE_FILES}/mongo-c-driver/src/libbson/src
ENV LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu
ENV MONGOC_INSTALL_DIR=${LD_LIBRARY_PATH}
ENV JSONC_INSTALL_DIR=${LD_LIBRARY_PATH}

USER root

# dependencies (compilation and runtime)
RUN apt-get update && apt-get install -y --no-install-recommends wget ca-certificates make gcc cmake pkg-config postgresql-server-dev-${pg_version} libssl-dev libzstd-dev libmongoc-dev libjson-c-dev libsnappy1v5 libmongocrypt0 
# compile mongo_fdw
WORKDIR /var/tmp
ADD ${MONGO_FDW_URL} /var/tmp
WORKDIR /var/tmp
RUN tar -zxvf $(basename ${MONGO_FDW_URL})
# install monfo_fdw
WORKDIR /var/tmp/mongo_fdw-REL-${MONGO_FDW_VERSION}
RUN ./autogen.sh && make USE_PGXS=1 && make USE_PGXS=1 install;

# add mongosh because it can be useful
WORKDIR /var/tmp
ADD https://downloads.mongodb.com/compass/mongosh-2.5.2-linux-x64.tgz /tmp
RUN tar -xvf /tmp/mongosh-2.5.2-linux-x64.tgz
RUN cp ./mongosh-2.5.2-linux-x64/bin/mongosh /usr/local/bin

# final stage to add mongo_fdw to pgduckdb

FROM pgduckdb/pgduckdb:${pg_version}-main

USER root

ARG pg_version
ARG extdir=/usr/share/postgresql/${pg_version}/extension
ARG extlibdir=/usr/lib/postgresql/${pg_version}/lib
ARG libdir=/usr/lib/x86_64-linux-gnu

COPY --from=build ${extdir}/mongo_fdw* ${extdir}/
COPY --from=build ${extlibdir}/mongo_fdw.so ${extlibdir}/
COPY --from=build ${libdir}/libmongoc-1.0.so.0.0.0 \
                  ${libdir}/libbson-1.0.so.0.0.0 \
                  ${libdir}/libmongocrypt.so.0.0.0 \
                  ${libdir}/libsnappy.so.1.1.9 \
                  ${libdir}/ 

RUN cd ${libdir} && \
    ln -sf libmongoc-1.0.so.0.0.0 libmongoc-1.0.so.0 && \
    ln -sf libmongoc-1.0.so.0 libmongoc-1.0.so && \
    ln -sf libbson-1.0.so.0.0.0 libbson-1.0.so.0 && \
    ln -sf libbson-1.0.so.0 libbson-1.0.so && \
    ln -sf libmongocrypt.so.0.0.0 libmongocrypt.so.0 && \
    ln -sf libmongocrypt.so.0 libmongocrypt.so && \
    ln -sf libsnappy.so.1.1.9 libsnappy.so.1 && \
    ln -sf libsnappy.so.1 libsnappy.so;

COPY --from=build /usr/local/bin /usr/local/bin

USER postgres

Start the databases

I started a MongoDB container:

docker run --name mongodb $e -d mongodb/mongodb-community-server:latest

I built the image and started a container from my image, with network link to mongodb:

docker build -t duckduckmongo . 

docker run --name duckduckmongo -d --link mongodb:mongodb -e POSTGRES_PASSWORD=postgres4mongo duckduckmongo

Create MongoDB collections

In this container, where I've installed the MongoDB Shell, I connected to test the connection to the MongoDB database:

docker exec -it duckduckmongo mongosh mongodb:27017/test

While there, I created a user:

db.createUser( { 
  user: "duckduckmongo",
  pwd: "mongo4postgres",
  roles: [ { role: "readWrite", db: "test" } ] 
} );

I created two collections with some data:

db.dim.insertMany( Array.from({ length: 42 }, (_, i) => (
 { 
   dimid: i + 1, 
   dimvalue: Math.random(),
   tags: [ "fr" , "de" , "it" ],
   coord: { x:1, y:-11 } 
})) );

db.fact.insertMany( Array.from({ length: 100000 }, () => (
 { 
   ref: Math.ceil(Math.random() * 42), 
   value: Math.random(),
}
)) );

From the field names, you can guess that I wanted to test some joins, with one collection referencing the other.

Create the Foreign Tables

I connected to PostgreSQL:

docker exec -it duckduckmongo psql -U postgres 

I enabled the extensions:

create extension if not exists mongo_fdw;
create extension if not exists pg_duckdb;

I declared the Foreign Data Wrapper to connect to MongoDB:


CREATE SERVER "MongoDB server" FOREIGN DATA WRAPPER mongo_fdw
  OPTIONS ( address 'mongodb', port '27017' )
;

GRANT USAGE ON FOREIGN SERVER "MongoDB server" TO postgres;

CREATE USER MAPPING FOR postgres SERVER "MongoDB server"
  OPTIONS ( username 'duckduckmongo', password 'mongo4postgres' )
;

I declared the foreign tables that map to the MongoDB collections:


-- Note: first column of the table must be "_id" of type "name" according to the doc

CREATE FOREIGN TABLE dim ( 
   _id name,
   dimid int,
   dimvalue float,
   "coord.x" int,    -- mapping nested object fields to column
   "coord.y" int,
   tags text[]       -- mappint arrays as arrays
 ) SERVER "MongoDB server"
 OPTIONS ( database 'test', collection 'dim' )
;

CREATE FOREIGN TABLE fact ( 
   _id name, 
   ref int, 
   value float 
 ) SERVER "MongoDB server"
 OPTIONS ( database 'test', collection 'fact' )
;

Those foreign tables are like views and can be queried from PostgreSQL, and the Foreign Data Wrapper will query the MongoDB database server.

Query MongoDB from PostgreSQL with join

I run the following query, with a filter and a join:

select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;

It takes several second and I can check the execution plan:

explain (analyze, costs off, verbose)
select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Foreign Scan (actual time=16.379..7545.432 rows=49889 loops=1)
   Output: fact._id, fact.ref, fact.value, dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
   Foreign Namespace: (test.fact) INNER JOIN (test.dim)
 Planning Time: 2.226 ms
 Execution Time: 7548.817 ms

The slowness occurs because the join in MongoDB uses a $lookup within an aggregation pipeline, which isn't optimized for fifty thousand documents.

In MongoDB, the $lookup operation can be slower than joins in an RDBMS due to its document model, which is designed for embedding related data and reducing the need for joins. The flexible schema allows fields to be scalars, arrays, or nested documents, making $lookup operations more complex than traditional equi-joins. A lookup is a key-intersection join, which requires implicit array traversal and matching across two sets of keys, resembling a semijoin but including all matching elements from the inner collection, typically as an array that can also be unwound to multiple documents.

When querying thousands of documents, it's more effective to run two separate queries instead of one $lookup. This can be done transparently through the Foreign Data Wrapper, which can avoid pushing down the join. A hash join is preferable for my query as it requires most rows from the "dim' collection. Note that MongoDB can also use a hash join for lookups but it is only effective if the table has fewer than ten thousand documents and temporary disk usage is allowed.

Query MongoDB collection but join in PostgreSQL

To execute the join in PostgreSQL, I disabled the join pushdown:

set mongo_fdw.enable_join_pushdown to false;

explain (analyze, costs off, verbose)
select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;

                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Hash Join (actual time=1.524..102.797 rows=49889 loops=1)
   Output: fact._id, fact.ref, fact.value, dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
   Hash Cond: (fact.ref = dim.dimid)
   ->  Foreign Scan on public.fact (actual time=0.923..90.954 rows=49889 loops=1)
         Output: fact._id, fact.ref, fact.value
         Foreign Namespace: test.fact
   ->  Hash (actual time=0.575..0.575 rows=42 loops=1)
         Output: dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
         Buckets: 1024  Batches: 1  Memory Usage: 15kB
         ->  Foreign Scan on public.dim (actual time=0.488..0.552 rows=42 loops=1)
               Output: dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
               Foreign Namespace: test.dim
 Planning Time: 2.341 ms
 Execution Time: 105.252 ms

This has pushed the filter to MongoDB but not the join, so it has read all documents from "dim" in 0.6 milliseconds, to build the hashed table, and fifty thousand documents from "fact" in 92 milliseconds, serving as the probe table for the hash join.

I disabled join pushdown for my session, but it can be an option of the foreign table declaration.

Query planner cardinalities

When not pushing down the joins, it is important to get good cardinality estimations because it uses the PostgreSQL cost-based optimizer. By default the MongoDB Foreign Data Wrapper uses the same constant for all tables:

explain
select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;

                                 QUERY PLAN                                 
----------------------------------------------------------------------------
 Hash Join  (cost=1062.50..2240.00 rows=5000 width=192)
   Hash Cond: (fact.ref = dim.dimid)
   ->  Foreign Scan on fact  (cost=25.00..1025.00 rows=1000 width=76)
         Foreign Namespace: test.fact
   ->  Hash  (cost=1025.00..1025.00 rows=1000 width=116)
         ->  Foreign Scan on dim  (cost=25.00..1025.00 rows=1000 width=116)
               Foreign Namespace: test.dim

I was fortunate to have the small table hashed, but it is not a guarantee, as both options have the same cost. To get more accurate cardinality estimations, I enabled the remote estimate feature, which requests a count estimation from the MongoDB database:


ALTER SERVER "MongoDB server" 
 OPTIONS (ADD use_remote_estimate 'true')
;

explain
select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;

                              QUERY PLAN                              
----------------------------------------------------------------------
 Hash Join  (cost=14.04..3887.03 rows=7000 width=192)
   Hash Cond: (fact.ref = dim.dimid)
   ->  Foreign Scan on fact  (cost=5.00..3682.98 rows=33333 width=76)
         Foreign Namespace: test.fact
   ->  Hash  (cost=8.52..8.52 rows=42 width=116)
         ->  Foreign Scan on dim  (cost=5.00..8.52 rows=42 width=116)
               Foreign Namespace: test.dim

To determine the correct join order, even an estimated selectivity is sufficient. In a hash join, the smaller table should be the build table, while the larger one serves as the probe table.

Invoke DuckDB to query the foreign tables

This didn't go through DuckDB because DuckDB, here, is an extension within PostgreSQL. To run the same though DuckDB I force it:

set duckdb.force_execution to true;

explain (analyze, costs off, verbose)
select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;

WARNING:  Unsupported Postgres type: No conversion to DuckDB available for type with oid=19)

                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Hash Join (actual time=1.598..102.117 rows=49889 loops=1)
   Output: fact._id, fact.ref, fact.value, dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
   Hash Cond: (fact.ref = dim.dimid)
   ->  Foreign Scan on public.fact (actual time=1.018..90.344 rows=49889 loops=1)
         Output: fact._id, fact.ref, fact.value
         Foreign Namespace: test.fact
   ->  Hash (actual time=0.558..0.558 rows=42 loops=1)
         Output: dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
         Buckets: 1024  Batches: 1  Memory Usage: 15kB
         ->  Foreign Scan on public.dim (actual time=0.464..0.527 rows=42 loops=1)
               Output: dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
               Foreign Namespace: test.dim
 Planning Time: 25.667 ms
 Execution Time: 104.390 ms

The execution looks the same because it has actually been run by PostgreSQL - it is a PostgreSQL execution plan. Note that I got the following warning:

WARNING:  Unsupported Postgres type: No conversion to DuckDB available for type with oid=19)

To analyze the query prepared by the DuckDB extension, I enabled DuckDB debug messages:

set client_min_messages   to debug;
set duckdb.log_pg_explain to true;

explain (analyze, costs off, verbose)
select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;

DEBUG:  (PGDuckDB/DuckdbPrepare) Preparing: 
 SELECT fact._id, fact.ref, fact.value, dim._id, dim.dimid, dim.dimvalue , dim."coord.x", dim."coord.y", dim.tags
 FROM (pgduckdb.public.fact 
 JOIN pgduckdb.public.dim ON ((fact.ref = dim.dimid))) 
 WHERE (fact.value > (0.5)::double precision)

DEBUG:  (DuckDB/SetTableInfo) Column name: '_id', Type: "No conversion to DuckDB available for type with oid=19)"
DEBUG:  (DuckDB/SetTableInfo) Column name: 'ref', Type: INTEGER
DEBUG:  (DuckDB/SetTableInfo) Column name: 'value', Type: DOUBLE
DEBUG:  (DuckDB/SetTableInfo) Column name: '_id', Type: "No conversion to DuckDB available for type with oid=19)"
DEBUG:  (DuckDB/SetTableInfo) Column name: 'dimid', Type: INTEGER
DEBUG:  (DuckDB/SetTableInfo) Column name: 'dimvalue', Type: DOUBLE
DEBUG:  (DuckDB/SetTableInfo) Column name: 'coord.x', Type: INTEGER
DEBUG:  (DuckDB/SetTableInfo) Column name: 'coord.y', Type: INTEGER
DEBUG:  (DuckDB/SetTableInfo) Column name: 'tags', Type: VARCHAR[]

I ran the same without the "_id" columns which I don't need and get a DuckDB execution plan:

explain (analyze, costs off, verbose)
select fact.ref, fact.value, dim.dimid, dim.dimvalue, dim.dimvalue 
 , dim."coord.x", dim."coord.y", dim.tags
 from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;
                                                                                                                                                                                                                                                                                                                                             QUERY PLAN                                                                                                                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Custom Scan (DuckDBScan) (actual time=0.001..0.002 rows=0 loops=1)
   Output: duckdb_scan.ref, duckdb_scan.value, duckdb_scan.dimid, duckdb_scan.dimvalue, duckdb_scan.dimvalue_1, duckdb_scan."coord.x", duckdb_scan."coord.y", duckdb_scan.tags
   DuckDB Execution Plan: 

 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
 β”‚β”‚    Query Profiling Information    β”‚β”‚
 β”‚β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 EXPLAIN ANALYZE  SELECT fact.ref, fact.value, dim.dimid, dim.dimvalue, dim.dimvalue, dim.
                                    
                                    
                                    
                                    
                                

DuckDB to query MongoDB

DuckDB is a high-performance analytical database, but it is single-process, so it cannot replace an operational database. MongoDB is a general-purpose database for operational data in a flexible format. DuckDB lacks a native connector to MongoDB, but you can combine pg_duckdb and mongo_fdw, two PostgreSQL extensions, to query MongoDB with DuckDB.

  • A PostgreSQL Docker container with DuckDB and Mongo_FDW
  • Start the databases
  • Create MongoDB collections
  • Create the Foreign Tables
  • Query MongoDB from PostgreSQL with join
  • Query MongoDB collection but join in PostgreSQL
  • Query planner cardinalities
  • Invoke DuckDB to query the foreign tables
  • Cache static tables as temporary tables
  • Map the whole document

A PostgreSQL Docker container with DuckDB and Mongo_FDW

Here is an example. I built a pg_duckdb image, With is PostgreSQL with DuckDB as an extension, where I install mongo_fdw, the Foreign Data Wrapper to access MongoDB from PostgreSQL. Here is my dockerfile:

ARG pg_version=17

# build stage to compole the MongoDB Foreign Data Wrapper

FROM pgduckdb/pgduckdb:${pg_version}-main as build

ARG pg_version

ARG MONGO_FDW_VERSION=5_5_2
ARG MONGO_FDW_URL=https://github.com/EnterpriseDB/mongo_fdw/archive/REL-${MONGO_FDW_VERSION}.tar.gz
ARG SOURCE_FILES=/tmp/mongo_fdw

ENV PKG_CONFIG_PATH=${SOURCE_FILES}/mongo-c-driver/src/libmongoc/src:${SOURCE_FILES}/mongo-c-driver/src/libbson/src
ENV LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu
ENV MONGOC_INSTALL_DIR=${LD_LIBRARY_PATH}
ENV JSONC_INSTALL_DIR=${LD_LIBRARY_PATH}

USER root

# dependencies (compilation and runtime)
RUN apt-get update && apt-get install -y --no-install-recommends wget ca-certificates make gcc cmake pkg-config postgresql-server-dev-${pg_version} libssl-dev libzstd-dev libmongoc-dev libjson-c-dev libsnappy1v5 libmongocrypt0 
# compile mongo_fdw
WORKDIR /var/tmp
ADD ${MONGO_FDW_URL} /var/tmp
WORKDIR /var/tmp
RUN tar -zxvf $(basename ${MONGO_FDW_URL})
# install monfo_fdw
WORKDIR /var/tmp/mongo_fdw-REL-${MONGO_FDW_VERSION}
RUN ./autogen.sh && make USE_PGXS=1 && make USE_PGXS=1 install;

# add mongosh because it can be useful
WORKDIR /var/tmp
ADD https://downloads.mongodb.com/compass/mongosh-2.5.2-linux-x64.tgz /tmp
RUN tar -xvf /tmp/mongosh-2.5.2-linux-x64.tgz
RUN cp ./mongosh-2.5.2-linux-x64/bin/mongosh /usr/local/bin

# final stage to add mongo_fdw to pgduckdb

FROM pgduckdb/pgduckdb:${pg_version}-main

USER root

ARG pg_version
ARG extdir=/usr/share/postgresql/${pg_version}/extension
ARG extlibdir=/usr/lib/postgresql/${pg_version}/lib
ARG libdir=/usr/lib/x86_64-linux-gnu

COPY --from=build ${extdir}/mongo_fdw* ${extdir}/
COPY --from=build ${extlibdir}/mongo_fdw.so ${extlibdir}/
COPY --from=build ${libdir}/libmongoc-1.0.so.0.0.0 \
                  ${libdir}/libbson-1.0.so.0.0.0 \
                  ${libdir}/libmongocrypt.so.0.0.0 \
                  ${libdir}/libsnappy.so.1.1.9 \
                  ${libdir}/ 

RUN cd ${libdir} && \
    ln -sf libmongoc-1.0.so.0.0.0 libmongoc-1.0.so.0 && \
    ln -sf libmongoc-1.0.so.0 libmongoc-1.0.so && \
    ln -sf libbson-1.0.so.0.0.0 libbson-1.0.so.0 && \
    ln -sf libbson-1.0.so.0 libbson-1.0.so && \
    ln -sf libmongocrypt.so.0.0.0 libmongocrypt.so.0 && \
    ln -sf libmongocrypt.so.0 libmongocrypt.so && \
    ln -sf libsnappy.so.1.1.9 libsnappy.so.1 && \
    ln -sf libsnappy.so.1 libsnappy.so;

COPY --from=build /usr/local/bin /usr/local/bin

USER postgres

Start the databases

I started a MongoDB container:

docker run --name mongodb $e -d mongodb/mongodb-community-server:latest

I built the image and started a container from my image, with network link to mongodb:

docker build -t duckduckmongo . 

docker run --name duckduckmongo -d --link mongodb:mongodb -e POSTGRES_PASSWORD=postgres4mongo duckduckmongo

Create MongoDB collections

In this container, where I've installed the MongoDB Shell, I connected to test the connection to the MongoDB database:

docker exec -it duckduckmongo mongosh mongodb:27017/test

While there, I created a user:

db.createUser( { 
  user: "duckduckmongo",
  pwd: "mongo4postgres",
  roles: [ { role: "readWrite", db: "test" } ] 
} );

I created two collections with some data:

db.dim.insertMany( Array.from({ length: 42 }, (_, i) => (
 { 
   dimid: i + 1, 
   dimvalue: Math.random(),
   tags: [ "fr" , "de" , "it" ],
   coord: { x:1, y:-11 } 
})) );

db.fact.insertMany( Array.from({ length: 100000 }, () => (
 { 
   ref: Math.ceil(Math.random() * 42), 
   value: Math.random(),
}
)) );

From the field names, you can guess that I wanted to test some joins, with one collection referencing the other.

Create the Foreign Tables

I connected to PostgreSQL:

docker exec -it duckduckmongo psql -U postgres 

I enabled the extensions:

create extension if not exists mongo_fdw;
create extension if not exists pg_duckdb;

I declared the Foreign Data Wrapper to connect to MongoDB:


CREATE SERVER "MongoDB server" FOREIGN DATA WRAPPER mongo_fdw
  OPTIONS ( address 'mongodb', port '27017' )
;

GRANT USAGE ON FOREIGN SERVER "MongoDB server" TO postgres;

CREATE USER MAPPING FOR postgres SERVER "MongoDB server"
  OPTIONS ( username 'duckduckmongo', password 'mongo4postgres' )
;

I declared the foreign tables that map to the MongoDB collections:


-- Note: first column of the table must be "_id" of type "name" according to the doc

CREATE FOREIGN TABLE dim ( 
   _id name,
   dimid int,
   dimvalue float,
   "coord.x" int,    -- mapping nested object fields to column
   "coord.y" int,
   tags text[]       -- mappint arrays as arrays
 ) SERVER "MongoDB server"
 OPTIONS ( database 'test', collection 'dim' )
;

CREATE FOREIGN TABLE fact ( 
   _id name, 
   ref int, 
   value float 
 ) SERVER "MongoDB server"
 OPTIONS ( database 'test', collection 'fact' )
;

Those foreign tables are like views and can be queried from PostgreSQL, and the Foreign Data Wrapper will query the MongoDB database server.

Query MongoDB from PostgreSQL with join

I run the following query, with a filter and a join:

select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;

It takes several second and I can check the execution plan:

explain (analyze, costs off, verbose)
select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Foreign Scan (actual time=16.379..7545.432 rows=49889 loops=1)
   Output: fact._id, fact.ref, fact.value, dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
   Foreign Namespace: (test.fact) INNER JOIN (test.dim)
 Planning Time: 2.226 ms
 Execution Time: 7548.817 ms

The slowness occurs because the join in MongoDB uses a $lookup within an aggregation pipeline, which isn't optimized for fifty thousand documents.

In MongoDB, the $lookup operation can be slower than joins in an RDBMS due to its document model, which is designed for embedding related data and reducing the need for joins. The flexible schema allows fields to be scalars, arrays, or nested documents, making $lookup operations more complex than traditional equi-joins. A lookup is a key-intersection join, which requires implicit array traversal and matching across two sets of keys, resembling a semijoin but including all matching elements from the inner collection, typically as an array that can also be unwound to multiple documents.

When querying thousands of documents, it's more effective to run two separate queries instead of one $lookup. This can be done transparently through the Foreign Data Wrapper, which can avoid pushing down the join. A hash join is preferable for my query as it requires most rows from the "dim' collection. Note that MongoDB can also use a hash join for lookups but it is only effective if the table has fewer than ten thousand documents and temporary disk usage is allowed.

Query MongoDB collection but join in PostgreSQL

To execute the join in PostgreSQL, I disabled the join pushdown:

set mongo_fdw.enable_join_pushdown to false;

explain (analyze, costs off, verbose)
select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;

                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Hash Join (actual time=1.524..102.797 rows=49889 loops=1)
   Output: fact._id, fact.ref, fact.value, dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
   Hash Cond: (fact.ref = dim.dimid)
   ->  Foreign Scan on public.fact (actual time=0.923..90.954 rows=49889 loops=1)
         Output: fact._id, fact.ref, fact.value
         Foreign Namespace: test.fact
   ->  Hash (actual time=0.575..0.575 rows=42 loops=1)
         Output: dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
         Buckets: 1024  Batches: 1  Memory Usage: 15kB
         ->  Foreign Scan on public.dim (actual time=0.488..0.552 rows=42 loops=1)
               Output: dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
               Foreign Namespace: test.dim
 Planning Time: 2.341 ms
 Execution Time: 105.252 ms

This has pushed the filter to MongoDB but not the join, so it has read all documents from "dim" in 0.6 milliseconds, to build the hashed table, and fifty thousand documents from "fact" in 92 milliseconds, serving as the probe table for the hash join.

I disabled join pushdown for my session, but it can be an option of the foreign table declaration.

Query planner cardinalities

When not pushing down the joins, it is important to get good cardinality estimations because it uses the PostgreSQL cost-based optimizer. By default the MongoDB Foreign Data Wrapper uses the same constant for all tables:

explain
select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;

                                 QUERY PLAN                                 
----------------------------------------------------------------------------
 Hash Join  (cost=1062.50..2240.00 rows=5000 width=192)
   Hash Cond: (fact.ref = dim.dimid)
   ->  Foreign Scan on fact  (cost=25.00..1025.00 rows=1000 width=76)
         Foreign Namespace: test.fact
   ->  Hash  (cost=1025.00..1025.00 rows=1000 width=116)
         ->  Foreign Scan on dim  (cost=25.00..1025.00 rows=1000 width=116)
               Foreign Namespace: test.dim

I was fortunate to have the small table hashed, but it is not a guarantee, as both options have the same cost. To get more accurate cardinality estimations, I enabled the remote estimate feature, which requests a count estimation from the MongoDB database:


ALTER SERVER "MongoDB server" 
 OPTIONS (ADD use_remote_estimate 'true')
;

explain
select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;

                              QUERY PLAN                              
----------------------------------------------------------------------
 Hash Join  (cost=14.04..3887.03 rows=7000 width=192)
   Hash Cond: (fact.ref = dim.dimid)
   ->  Foreign Scan on fact  (cost=5.00..3682.98 rows=33333 width=76)
         Foreign Namespace: test.fact
   ->  Hash  (cost=8.52..8.52 rows=42 width=116)
         ->  Foreign Scan on dim  (cost=5.00..8.52 rows=42 width=116)
               Foreign Namespace: test.dim

To determine the correct join order, even an estimated selectivity is sufficient. In a hash join, the smaller table should be the build table, while the larger one serves as the probe table.

Invoke DuckDB to query the foreign tables

This didn't go through DuckDB because DuckDB, here, is an extension within PostgreSQL. To run the same though DuckDB I force it:

set duckdb.force_execution to true;

explain (analyze, costs off, verbose)
select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;

WARNING:  Unsupported Postgres type: No conversion to DuckDB available for type with oid=19)

                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Hash Join (actual time=1.598..102.117 rows=49889 loops=1)
   Output: fact._id, fact.ref, fact.value, dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
   Hash Cond: (fact.ref = dim.dimid)
   ->  Foreign Scan on public.fact (actual time=1.018..90.344 rows=49889 loops=1)
         Output: fact._id, fact.ref, fact.value
         Foreign Namespace: test.fact
   ->  Hash (actual time=0.558..0.558 rows=42 loops=1)
         Output: dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
         Buckets: 1024  Batches: 1  Memory Usage: 15kB
         ->  Foreign Scan on public.dim (actual time=0.464..0.527 rows=42 loops=1)
               Output: dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
               Foreign Namespace: test.dim
 Planning Time: 25.667 ms
 Execution Time: 104.390 ms

The execution looks the same because it has actually been run by PostgreSQL - it is a PostgreSQL execution plan. Note that I got the following warning:

WARNING:  Unsupported Postgres type: No conversion to DuckDB available for type with oid=19)

To analyze the query prepared by the DuckDB extension, I enabled DuckDB debug messages:

set client_min_messages   to debug;
set duckdb.log_pg_explain to true;

explain (analyze, costs off, verbose)
select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;

DEBUG:  (PGDuckDB/DuckdbPrepare) Preparing: 
 SELECT fact._id, fact.ref, fact.value, dim._id, dim.dimid, dim.dimvalue , dim."coord.x", dim."coord.y", dim.tags
 FROM (pgduckdb.public.fact 
 JOIN pgduckdb.public.dim ON ((fact.ref = dim.dimid))) 
 WHERE (fact.value > (0.5)::double precision)

DEBUG:  (DuckDB/SetTableInfo) Column name: '_id', Type: "No conversion to DuckDB available for type with oid=19)"
DEBUG:  (DuckDB/SetTableInfo) Column name: 'ref', Type: INTEGER
DEBUG:  (DuckDB/SetTableInfo) Column name: 'value', Type: DOUBLE
DEBUG:  (DuckDB/SetTableInfo) Column name: '_id', Type: "No conversion to DuckDB available for type with oid=19)"
DEBUG:  (DuckDB/SetTableInfo) Column name: 'dimid', Type: INTEGER
DEBUG:  (DuckDB/SetTableInfo) Column name: 'dimvalue', Type: DOUBLE
DEBUG:  (DuckDB/SetTableInfo) Column name: 'coord.x', Type: INTEGER
DEBUG:  (DuckDB/SetTableInfo) Column name: 'coord.y', Type: INTEGER
DEBUG:  (DuckDB/SetTableInfo) Column name: 'tags', Type: VARCHAR[]

I ran the same without the "_id" columns which I don't need and get a DuckDB execution plan:

explain (analyze, costs off, verbose)
select fact.ref, fact.value, dim.dimid, dim.dimvalue, dim.dimvalue 
 , dim."coord.x", dim."coord.y", dim.tags
 from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;
                                                                                                                                                                                                                                                                                                                                             QUERY PLAN                                                                                                                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Custom Scan (DuckDBScan) (actual time=0.001..0.002 rows=0 loops=1)
   Output: duckdb_scan.ref, duckdb_scan.value, duckdb_scan.dimid, duckdb_scan.dimvalue, duckdb_scan.dimvalue_1, duckdb_scan."coord.x", duckdb_scan."coord.y", duckdb_scan.tags
   DuckDB Execution Plan: 

 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
 β”‚β”‚    Query Profiling Information    β”‚β”‚
 β”‚β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 EXPLAIN ANALYZE  SELECT fact.ref, fact
                                    
                                    
                                    
                                    
                                

Percona Earns Kubernetes Certified Services Provider Status for All Three Major Open Source Databases

As a member of the Kubernetes Certified Services Provider program, Percona is now part of a select, β€œpre-qualified tier of vetted service providers who have deep experience helping enterprises successfully adopt Kubernetes…” Kubernetes (frequently abbreviated as K8s) has come a long way over the past decade. From being used almost exclusively for orchestrating stateless container […]

June 11, 2025

Postgres 18 beta1: small server, IO-bound Insert Benchmark (v2)

This is my second attempt at an IO-bound Insert Benchmark results with a small server. The first attempt is here and has been deprecated because sloppy programming by me meant the benchmark client was creating too many connections and that hurt results in some cases for Postgres 18 beta1.

There might be regressions from 17.5 to 18 beta1

  • QPS decreases by ~5% and CPU increases by ~5% on the l.i2 (write-only) step
  • QPS decreases by <= 2% and CPU increases by ~2% on the qr* (range query) steps
There might be regressions from 14.0 to 18 beta1
  • QPS decreases by ~6% and ~18% on the write-heavy steps (l.i1, l.i2)

Builds, configuration and hardware

I compiled Postgres from source using -O2 -fno-omit-frame-pointer for versions  14.0, 14.18, 15.0, 15.13, 16.0, 16.9, 17.0, 17.5 and 18 beta1.

The server is an ASUS ExpertCenter PN53 with and AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, 32G of RAM and one NVMe device for the database. The OS has been updated to Ubuntu 24.04. More details on it are here.

For Postgres versions 14.0 through 17.5 the configuration files are in the pg* subdirectories here with the name conf.diff.cx10a_c8r32. For Postgres 18 beta1 I used 3 variations, which are here:
  • conf.diff.cx10b_c8r32
    • uses io_method='sync' to match Postgres 17 behavior
  • conf.diff.cx10c_c8r32
    • uses io_method='worker' and io_workers=16 to do async IO via a thread pool. I eventually learned that 16 is too large.
  • conf.diff.cx10d_c8r32
    • uses io_method='io_uring' to do async IO via io_uring
The Benchmark

The benchmark is explained here and is run with 1 client and 1 table with 800M rows. I provide two performance reports:
  • one to compare Postgres 14.0 through 18 beta1, all using synchronous IO
  • one to compare Postgres 17.5 with 18 beta1 using 3 configurations for 18 beta1 -- one for each of io_method= sync, workers and io_uring.
The benchmark steps are:

  • l.i0
    • insert 20 million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts 4M rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and 1M rows are inserted and deleted per table.
    • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: overview

The performance report is here for Postgres 14 through 18 and here for Postgres 18 configurations.

The summary sections (herehere and here) have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for the benchmark steps. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS (rQPS) to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from Postgres 17.5.

When rQPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. When it is 0.90 then I claim there is a 10% regression. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with red for <= 0.97, green for >= 1.03 and grey for values between 0.98 and 1.02.

Results: Postgres 14.0 through 18 beta1

The performance summary is here

See the previous section for the definition of relative QPS (rQPS). For the rQPS formula, Postgres 14.0 is the base version and that is compared with more recent Postgres versions. The results here are similar to what I reported prior to fixing the too many connections problem in the benchmark client.

For 14.0 through 18 beta1, QPS on ...
  • the initial load (l.i0)
    • Performance is stable across versions
    • 18 beta1 and 17.5 have similar performance
    • rQPS for (17.5, 18 beta1 with io_method=sync) is (1.00, 0.99)
  • create index (l.x)
    • ~10% faster starting in 15.0
    • 18 beta1 and 17.5 have similar performance
    • rQPS for (17.5, 18 beta1 with io_method=sync) is (1.11, 1.12)
  • first write-only step (l.i1)
    • Performance decreases ~7% from version 16.9 to 17.0. CPU overhead (see cpupq here) increases by ~5% in 17.0.
    • 18 beta1 and 17.5 have similar performance
    • rQPS for (17.5, 18 beta1 with io_method=sync) is (0.93, 0.94)
  • second write-only step (l.i2)
    • Performance decreases ~6% in 15.0, ~8% in 17.0 and then ~5% in 18.0. CPU overhead (see cpupq here) increases ~5%, ~6% and ~5% in 15.0, 17.0 and 18beta1. Of all benchmark steps, this has the largest perf regression from 14.0 through 18 beta1 which is ~20%.
    • 18 beta1 is ~4% slower than 17.5
    • rQPS for (17.5, 18 beta1 with io_method=sync) is (0.86, 0.82)
  • range query steps (qr100, qr500, qr1000)
    • 18 beta1 and 17.5 have similar performance, but 18 beta1 is slightly slower
    • rQPS for (17.5, 18 beta1 with io_method=sync) is (1.00, 0.99) for qr100, (0.97, 0.98) for qr500 and (0.97, 0.95) for qr1000. The issue is new CPU overhead, see cpupq here.
  • point query steps (qp100, qp500, qp1000)
    • 18 beta1 and 17.5 have similar performance but 18 beta1 is slightly slower
    • rQPS for (17.5, 18 beta1 with io_method=sync) is (1.00, 0.98) for qp100, (0.99, 0.98) for qp500 and (0.97, 0.96) for qp1000. The issue is new CPU overhead, see cpupq here.
Results: Postgres 17.5 vs 18 beta1

The performance summary is here.

See the previous section for the definition of relative QPS (rQPS). For the rQPS formula, Postgres 17.5 is the base version and that is compared with results from 18 beta1 using the three configurations explained above:
  • x10b with io_method=sync
  • x10c with io_method=worker and io_workers=16
  • x10d with io_method=io_uring
The summary is:
  • initial load step (l.i0)
    • rQPS for (x10b, x10c, x10d) was (0.99, 1001.00)
  • create index step (l.x)
    • rQPS for (x10b, x10c, x10d) was (1.011.021.02)
  • first write-heavy ste (l.i1)
    • for l.i1 the rQPS for (x10b, x10c, x10d) was (1.00, 0.99, 1.01)
  • second write-heavy step (l.i2)
    • for l.i2 the rQPS for (x10b, x10c, x10d) was (0.960.930.94)
    • CPU overhead (see cpupq here) increases by ~5% in 18 beta1
  • range query steps (qr100, qr500, qr1000)
    • for qr100 the rQPS for (x10b, x10c, x10d) was (1.00, 0.99, 0.99)
    • for qr500 the rQPS for (x10b, x10c, x10d) was (1.00, 0.97, 0.99)
    • for qr1000 the rQPS for (x10b, x10c, x10d) was (0.99, 0.98, 0.97)
    • CPU overhead (see cpupq here, here and here) increases by ~2% in 18 beta1
  • point query steps (qp100, qp500, qp1000)
    • for qp100 the rQPS for (x10b, x10c, x10d) was (0.98, 0.99, 0.99)
    • for qp500 the rQPS for (x10b, x10c, x10d) was (0.991.001.00)
    • for qp1000 the rQPS for (x10b, x10c, x10d) was (0.99, 0.990.99)










June 10, 2025

Percona Named to DBTA’s 2025 List of 100 Companies That Matter Most in Data

We’re proud to share that Percona has been named to the 2025 DBTA 100β€”Database Trends and Applications’ annual list of “The Companies That Matter Most in Data.” This recognition highlights our success in empowering organizations to build, scale, and optimize open source database environments for today’s most demanding applications. At Percona, we believe open source […]