Filter on Children, Sort by Parent: One-to-Many Compound Index Strategies in PostgreSQL
Before looking at PostgreSQL, I'll first introduce the problem by showing how MongoDB's document model and multi-key indexes handle compound indexes across a one-to-many relationship. Then we will see how to approximate this in PostgreSQL by denormalizing on the "many" side or the "one" side, how to maintain consistency with cascade foreign keys or triggers, and how to accelerate filtering with sorted pagination using B-tree, GIN, or RUM indexes.
Multi-key indexes in MongoDB allow compound indexes on one-to-many relationships. For example, the following index covers fields from both the children (child_value) and the parent (parent_value) when children are embedded as an array (in a children field) within the parent document:
db.parent.createIndex(
{ "children.child_value": 1, parent_value: 1 }
);
Such an index can efficiently support an equality filter on child_value combined with a sort and pagination on parent_value. For instance, finding the top 10 parents, ordered by parent_value, where at least one child has a child_value of 0.9:
db.parent.find(
{ "children.child_value": 0.9 }
).sort({ "parent_value": 1 }).limit(10);
It is important to understand the semantics of predicates on embedded arrays. This query does not return the top 10 children with this value — a parent may have multiple matching children, but we get one result per parent. This differs from a relational join (or $unwind in a MongoDB aggregation pipeline), which would produce one row per child. Here, the predicate tests only for the existence of at least one child with the specified value. The result set contains distinct parents, not (parent, child) pairs.
A multi-key index stores multiple index entries per parent document — one for each distinct child_value in the array — paired with the parent's parent_value. The number of index entries per parent equals the number of distinct child values, which can be less than the total number of children when multiple children share the same value. For example, given these documents:
{
_id: 24,
parent_value: "Y",
children: [
{ child_num: 1, child_value: 0.9 },
{ child_num: 2, child_value: 0.8 }
]
},
{
_id: 42,
parent_value: "X",
children: [
{ child_num: 1, child_value: 0.5 },
{ child_num: 2, child_value: 0.9 },
{ child_num: 3, child_value: 0.9 },
{ child_num: 4, child_value: 0.7 }
]
}
The index contains three entries for the second document, because the two children with child_value: 0.9 produce only one index entry:
(0.5, "X", 42) → record ID of { _id: 42 }
(0.7, "X", 42) → record ID of { _id: 42 }
(0.8, "Y", 24) → record ID of { _id: 24 }
(0.9, "X", 42) → record ID of { _id: 42 }
(0.9, "Y", 24) → record ID of { _id: 24 }
The index scan for {child_value: 0.9} finds two entries, already ordered by the second field of the compound key, parent_value. MongoDB can read them in order — "X" before "Y" — and fetch the corresponding documents without an additional sort. With a .limit(10), the scan stops after 10 distinct parents are found, making the query efficient regardless of how many total documents exist in the collection.
There is no direct equivalent of this in PostgreSQL. If the one-to-many relationship is stored as JSONB, GIN indexes can locate documents containing {child_value: 0.9} but do not preserve ordering on other fields. The query must fetch all matching documents, sort them on parent_value, and only then return the top 10. If the one-to-many relationship is normalized into two tables, no single index can span columns from both tables — a join is required first, and the sort optimization depends on the join strategy chosen by the planner.
Normalized One-to-Many
Now that the problem is defined, let's see how we can work around this limitation in a normalized model:
drop table if exists parent, child cascade;
create table parent (
parent_id bigserial primary key,
parent_value float
);
create table child (
child_id bigserial primary key,
parent_id int references parent (parent_id),
child_value float
);
insert into parent (parent_value)
select random() from generate_series(1, 1000)
;
insert into child (parent_id, child_value)
select parent_id, round(random()::numeric, 4) from parent, generate_series(1, 1000)
;
create index on parent (parent_value);
create index on child (child_value);
vacuum analyze parent, child;
The following query finds the top 10 parents, by parent_value, that have at least one child with a child_value of 0.9999:
-- explain (analyze, buffers, verbose, costs off)
select *
from parent p
where exists (
select 1 from child c
where c.parent_id = p.parent_id
and c.child_value = 0.9999
)
order by parent_value
limit 10;
parent_id | parent_value
-----------+----------------------
956 | 0.013518349069161273
789 | 0.022813950892476287
307 | 0.024740344416860793
833 | 0.041118650516225985
493 | 0.07587347477845374
402 | 0.08400127026866477
256 | 0.0980166832909124
342 | 0.11175505348300807
846 | 0.12047992766758941
590 | 0.16840580135419425
(10 rows)
The execution plan shows that:
- The index on
child_valuewas scanned to find all index entries with the expected value — hererows=99.0— which are then used to locate the corresponding blocks in thechildtable (this is a Bitmap Index Scan). - The Bitmap Heap Scan on the
childtable returned all children withchild_value = 0.9999(there are 95 if we count the actual matches — a bitmap scan can have some false positives). The results were stored in memory (Materialize). - The index on
parent_valuewas scanned to retrieve parents in the desired order, checking for each whether itsparent_idappears in the materialized buffer. It had to readrows=169.00parents before findingrows=10.00that satisfied the condition, with lookups on the materialized result (loops=169).
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Limit (actual time=0.366..2.272 rows=10.00 loops=1)
Output: p.parent_id, p.parent_value
Buffers: shared hit=245
-> Nested Loop Semi Join (actual time=0.365..2.270 rows=10.00 loops=1)
Output: p.parent_id, p.parent_value
Join Filter: (p.parent_id = c.parent_id)
Rows Removed by Join Filter: 16246
Buffers: shared hit=245
-> Index Scan using parent_parent_value_idx on public.parent p (actual time=0.012..0.080 rows=169.00 loops=1)
Output: p.parent_id, p.parent_value
Index Searches: 1
Buffers: shared hit=142
-> Materialize (actual time=0.000..0.006 rows=96.19 loops=169)
Output: c.parent_id
Storage: Memory Maximum Storage: 20kB
Buffers: shared hit=103
-> Bitmap Heap Scan on public.child c (actual time=0.032..0.151 rows=99.00 loops=1)
Output: c.parent_id
Recheck Cond: (c.child_value = '0.9999'::double precision)
Heap Blocks: exact=99
Buffers: shared hit=103
-> Bitmap Index Scan on child_child_value_idx (actual time=0.016..0.016 rows=99.00 loops=1)
Index Cond: (c.child_value = '0.9999'::double precision)
Index Searches: 1
Buffers: shared hit=4
The other index on parent_value can serve the sort but not the filter on child_value. In both cases, many more rows must be read to combine the two conditions after a Nested Loop Join. That is the price of normalization.
In a SQL database, if we want a single index to cover both fields, they must exist in the same table.
Denormalize into the "Many" Side (Children)
A common solution is to duplicate the parent value into the child table and create a compound index:
alter table child add parent_value float;
update child
set parent_value = parent.parent_value
from parent
where child.parent_id = parent.parent_id;
alter table child alter parent_value set not null;
create index on child (child_value, parent_value);
The query can now operate on the child table only:
-- explain (analyze, buffers, verbose, costs off)
select distinct parent_id, parent_value
from child
where child_value = 0.9999
order by parent_value
limit 10;
parent_id | parent_value
-----------+----------------------
956 | 0.013518349069161273
789 | 0.022813950892476287
307 | 0.024740344416860793
833 | 0.041118650516225985
493 | 0.07587347477845374
402 | 0.08400127026866477
256 | 0.0980166832909124
342 | 0.11175505348300807
846 | 0.12047992766758941
590 | 0.16840580135419425
(10 rows)
The execution plan shows a single index scan which covers:
- the filter:
Index Cond: child_value = 0.9999 - the order:
Presorted Key: child.parent_value
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Limit (actual time=0.067..0.072 rows=10.00 loops=1)
Output: parent_id, parent_value
Buffers: shared hit=36
-> Unique (actual time=0.066..0.070 rows=10.00 loops=1)
Output: parent_id, parent_value
Buffers: shared hit=36
-> Incremental Sort (actual time=0.066..0.066 rows=11.00 loops=1)
Output: parent_id, parent_value
Sort Key: child.parent_value, child.parent_id
Presorted Key: child.parent_value
Full-sort Groups: 1 Sort Method: quicksort Average Memory: 26kB Peak Memory: 26kB
Buffers: shared hit=36
-> Index Scan using child_child_value_parent_value_idx on public.child (actual time=0.014..0.054 rows=33.00 loops=1)
Output: parent_id, parent_value
Index Cond: (child.child_value = '0.9999'::double precision)
Index Searches: 1
Buffers: shared hit=36
Planning:
Buffers: shared hit=3
Planning Time: 0.132 ms
Execution Time: 0.088 ms
There is an additional sort because the result may contain duplicate parents when they have multiple children with the same value, but it is an incremental sort so it does not have to read all children — here only rows=33.00 before producing rows=10.00. This gets the top ten parents by parent_value without reading more child rows than necessary.
This is nearly optimal. The new index resembles a multi-key index, except that it may contain duplicate entries for the same parent — unnecessary for this query, but potentially useful for others that need the detail of individual children. Like the multi-key index, it must be maintained when the parent value changes. We have two solutions: referential integrity constraint and triggers.
Update with Cascade Constraint (Recommended)
One way to keep this column automatically updated is declarative. We add an additional foreign key that enforces that the parent_value in the child table matches the one in the parent, and declare it on update cascade:
alter table parent add unique (parent_id, parent_value);
alter table child
add constraint fk_child_parent
foreign key (parent_id, parent_value)
references parent (parent_id, parent_value)
on update cascade;
Unfortunately, PostgreSQL cannot use the same index for two unique constraints that share a prefix, so two indexes are maintained on the parent table. This solution is intended for cases where updates are not frequent.
I test the performance overhead when updating one parent:
\timing on
explain (analyze, buffers, costs off, verbose on)
update parent
set parent_value = parent_value - 0.000000000000000042
where parent_id = 789
;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Update on public.parent (actual time=0.063..0.064 rows=0.00 loops=1)
Buffers: shared hit=16
-> Index Scan using parent_parent_id_parent_value_key on public.parent (actual time=0.022..0.025 rows=1.00 loops=1)
Output: (parent_value - '4.2e-17'::double precision), ctid
Index Cond: (parent.parent_id = 789)
Index Searches: 1
Buffers: shared hit=5
Planning Time: 0.080 ms
Trigger RI_ConstraintTrigger_a_18059 for constraint fk_child_parent on parent: time=37.950 calls=1
Trigger RI_ConstraintTrigger_c_18061 for constraint fk_child_parent on child: time=6.767 calls=1000
Execution Time: 44.874 ms
There is an overhead: 37.9 milliseconds to cascade the update to a thousand of children (parent action), and 6.7 milliseconds to validate each _c_hild row (child check).
This is not problematic for occasional updates. Of course, if we update all parents, the row-by-row cascading constraints can be slow:
\timing on
explain (analyze, buffers, costs off, verbose on)
update parent
set parent_value = parent_value - 0.000000000000000042
;
vacuum parent, child;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Update on public.parent (actual time=5.944..5.945 rows=0.00 loops=1)
Buffers: shared hit=10975 dirtied=1
-> Seq Scan on public.parent (actual time=0.010..0.156 rows=1000.00 loops=1)
Output: (parent_value - '4.2e-17'::double precision), ctid
Buffers: shared hit=11
Planning:
Buffers: shared hit=86
Planning Time: 0.235 ms
Trigger RI_ConstraintTrigger_a_18059 for constraint fk_child_parent on parent: time=42659.655 calls=506
Trigger RI_ConstraintTrigger_c_18061 for constraint fk_child_parent on child: time=3398.155 calls=506000
Execution Time: 46103.577 ms
(11 rows)
Time: 46105.778 ms (00:46.106)
Obviously this is not a solution for frequently updated columns. However, the parent side of a one-to-many relationship often consists of static references, slowly changing dimensions, or immutable event headers, making this overhead acceptable. I also tested with the constraint set as DEFERRABLE. The elapsed time is similar, but you do not see it in the EXPLAIN output because the validation occurs at commit time.
Update with Trigger (Not Recommended)
The cascade constraint has the advantage of being declarative, but it executes internally as a trigger. We can instead create a custom trigger that cascades the update without re-validating the foreign key — assuming that nobody updates the denormalized column directly, and that it is set correctly on insert.
-- function to update the parent value in the children
create or replace function sync_parent_value() returns trigger as $$
begin
update child set parent_value = new.parent_value
where parent_id = new.parent_id;
return new;
end;
$$ language plpgsql;
-- trigger raised for each update
create trigger trg_sync_parent_value
after update of parent_value on parent
for each row
when (old.parent_value is distinct from new.parent_value)
execute function sync_parent_value();
-- drop the previous constraint
alter table child drop constraint fk_child_parent;
alter table parent drop constraint parent_parent_id_parent_value_key;
I test the update of all parents:
\timing on
explain (analyze, buffers, costs off, verbose on)
update parent
set parent_value = parent_value - 0.000000000000000042
;
vacuum parent, child;
QUERY PLAN
---------------------------------------------------------------------------------
Update on public.parent (actual time=4.547..4.547 rows=0.00 loops=1)
Buffers: shared hit=8453 dirtied=26
-> Seq Scan on public.parent (actual time=0.009..0.124 rows=1000.00 loops=1)
Output: (parent_value - '4.2e-17'::double precision), ctid
Buffers: shared hit=11
Planning:
Buffers: shared hit=8 dirtied=2
Planning Time: 0.108 ms
Trigger trg_sync_parent_value: time=42719.154 calls=506
Execution Time: 42724.042 ms
(10 rows)
Time: 42729.150 ms (00:42.729)
postgres=# vacuum parent, child;
VACUUM
Time: 176.585 ms
The performance advantage over the cascade constraint is minimal, and the risk is higher because parent_value in the child table could be updated independently without any check that it still matches the parent. We would also need an insert trigger to set the correct value, and an update trigger to prevent direct modifications. Ultimately, the cascade constraint solution is strongly preferable.
Denormalize into the "One" Side (Parent)
If we want to more closely mimic a MongoDB multi-key index, denormalizing into the child table is not an exact equivalent because there is no deduplication. It has its advantages — it can serve queries that need individual child rows — but a structure analogous to a multi-key index should be stored on the parent side, where each parent holds a summary of its children's values.
One approach is to add an array of child values to the parent table so that we can filter on it:
-- add array column
alter table parent add child_values float[];
-- populate from child
update parent p set child_values = (
select array_agg(distinct child_value)
from child where parent_id = p.parent_id
);
A trigger must maintain the array when children change:
create or replace function sync_child_values_array() returns trigger as $$
begin
update parent set child_values = (
select array_agg(distinct child_value)
from child where parent_id = coalesce(new.parent_id, old.parent_id)
) where parent_id = coalesce(new.parent_id, old.parent_id);
May 30, 2026
BSON and OSON: documents are designed to be nested, not flat
I like vendors benchmarks as they are often good illustrations of worst practice. Rather than focusing on real implementation trade-offs, they pick an extreme case that favors their own implementation by chance but is not optimized in the competitor's — because it is simply not how that technology is meant to be used. For example, Oracle published a "2x" benchmark results using YCSB, a key-value benchmark (slide 14 here), and a "529x" test misusing the raw BSON purpose (here). Both have something in common: they compare a relational database against a document database by using neither technology as it was designed to be used — no normalized tables, no nested documents, just flat key-value fields. The latter test is even worse: it uses 1,000 top-level fields in a single document. Don't do that!
While having thousands of columns in an SQL table is usually undesirable, it's acceptable to have hundreds or even thousands of fields in a document, as they represent multiple entities and value objects. However, nobody creates a flat structure with 1000 top-level fields. The advantage of JSON is its ability to organize entities into nested sub-documents. For instance, instead of storing first_name and last_name as separate fields, you can have a name field containing a sub-object with first and last. During queries, using dot notation to reference name.first and name.last makes no difference with first_name or last_name. It simplifies reading and displaying the document, and, as it is how it should be used, the binary JSON formats are optimized for it.
To illustrate this, I create two collections, flat with one thousand fields at the top level, and nest with ten top-level fields, each containing a sub-object with ten sub-objects.
{
echo field{0..9}{0..9}{0..9}
for i in {1..100000} ; do echo {0..9}{0..9}{0..9} ; done
} | sed 's/ /,/g' |
mongoimport --collection "flat" --type=csv --headerline --drop
{
echo field{0..9}.sub{0..9}.sub{0..9}
for i in {1..100000} ; do echo {0..9}{0..9}{0..9} ; done
} | sed 's/ /,/g' |
mongoimport --collection "nest" --type=csv --headerline --drop
The flat documents are like:
{ _id: ObjectId('6a1a1ed41d55219677a79c22'),
field000: 0, field001: 1, field002: 2, field003: 3, field004: 4, field005: 5, field006: 6, field007: 7, field008: 8, field009: 9,
field010: 10, field011: 11, field012: 12, field013: 13, field014: 14, field015: 15, field016: 16, field017: 17, field018: 18, field019: 19,
field020: 20, field021: 21, field022: 22, field023: 23, field024: 24, field025: 25, field026: 26, field027: 27, field028: 28, field029: 29,
field030: 30, field031: 31, field032: 32, field033: 33, field034: 34, field035: 35, field036: 36, field037: 37, field038: 38,...
The nested documents are like:
{
_id: ObjectId('6a1a160041c5538a7e93a9c5'),
field0: {
sub0: { sub0: 0, sub1: 1, sub2: 2, sub3: 3, sub4: 4, sub5: 5, sub6: 6, sub7: 7, sub8: 8, sub9: 9 },
sub1: { sub0: 10, sub1: 11, sub2: 12, sub3: 13, sub4: 14, sub5: 15, sub6: 16, sub7: 17, sub8: 18, sub9: 19 },
sub2: { sub0: 20, sub1: 21, sub2: 22, sub3: 23, sub4: 24, sub5: 25, sub6: 26, sub7: 27, sub8: 28, sub9: 29 },
sub3: { sub0: 30, sub1: 31, sub2: 32, sub3: 33, sub4: 34, sub5: 35, ...
BSON format (https://bsonspec.org/)
BSON is designed for efficient sequential scanning through network or disk access. Each field is stored one after another, with a small header containing the field type and name, followed by the value. Because every value type has either a fixed size (like a 64-bit integer or a double) or an explicit length prefix (like a string or a binary blob), the parser can skip over any field it doesn't care about without reading its value at all — it simply jumps forward by the declared length.
Nested sub-documents and arrays are stored the same way: the type is object or array, followed by the total byte length of the entire nested structure, followed by the nested fields themselves (for arrays, the field names are simply "0", "1", "2", and so on). That length prefix is the key optimization: if the parser is looking for a top-level field and the current field is a sub-document that doesn't match, it can skip the entire sub-document in one jump — all its nested fields, however deep — without reading a single byte inside it.
When you query a field using dot notation, such as "name.first", the parser works level by level. It scans the top-level fields looking for name. Any top-level field that doesn't match name is skipped, including any sub-documents, in a single jump. Once name is found and confirmed to be of type object, the parser steps into that sub-document and begins scanning its fields for first, again skipping anything that doesn't match. It never needs to read the sibling sub-documents of name at all.
The worst case is a field that doesn't exist: the parser must scan every field at the relevant nesting level before concluding it isn't there. But even then, sub-documents and arrays that are clearly irrelevant — because their parent name doesn't match — are each skipped in a single operation regardless of how large or deeply nested they are.
This is precisely why organizing many fields into a nested hierarchy is not just a cosmetic preference. A flat document with 1,000 top-level fields forces the parser to read 1,000 field names in the worst case. A document with 10 top-level fields, each containing 10 sub-fields, each containing 10 values, holds the same 1,000 values but the parser only ever reads at most 10 + 10 + 10 = 30 field names to locate any one of them. The BSON length prefix turns the nested structure into an implicit index, making the format genuinely faster to navigate at scale.
With the collection flat that has one thousand top-level fields from "field000" to "field999", looking for the last field "field999" takes more than one second:
db.flat.find(
{ "field999" : {$lt:42} }
).explain("executionStats").executionStats
;
{
executionSuccess: true,
nReturned: 0,
executionTimeMillis: 1081,
totalKeysExamined: 0,
totalDocsExamined: 100000,
executionStages: {
isCached: false,
stage: 'COLLSCAN',
filter: { field999: { '$lt': 42 } },
nReturned: 0,
executionTimeMillisEstimate: 1070,
works: 100001,
advanced: 0,
needTime: 100000,
needYield: 0,
saveState: 64,
restoreState: 64,
isEOF: 1,
direction: 'forward',
docsExamined: 100000
}
}
With the collection nest that has ten top-level fields from "field0" to "field9", and two levels of ten sub-objects from "sub0" to "sub9", looking for the last field "field9.sub9.sub9" is two times faster:
db.nest.find(
{ "field9.sub9.sub9" : {$lt:42} }
).explain("executionStats").executionStats
;
{
executionSuccess: true,
nReturned: 0,
executionTimeMillis: 424,
totalKeysExamined: 0,
totalDocsExamined: 100000,
executionStages: {
isCached: false,
stage: 'COLLSCAN',
filter: { 'field9.sub9.sub9': { '$lt': 42 } },
nReturned: 0,
executionTimeMillisEstimate: 420,
works: 100001,
advanced: 0,
needTime: 100000,
needYield: 0,
saveState: 21,
restoreState: 21,
isEOF: 1,
direction: 'forward',
docsExamined: 100000
}
}
A major part of the time is fetching the document from storage. Navigating into it is a smaller part, but the difference is still clearly visible: 1,081 ms versus 424 ms for 100,000 documents, with the same number of values, and the same worst-case field to find: the last one. The only difference is how those fields are arranged. Nesting them three levels deep, with ten branches at each level, makes the collection scan faster — with no index, no schema change, and no query rewrite beyond the dot notation that you would use anyway.
This is not a micro-optimization or an edge case. Any collection scan, whether triggered by a missing index, a low-selectivity filter, or a background analytics query, pays this cost on every document it reads. The deeper and wider your documents are, the more the BSON length-prefix trick pays off, because the parser can leap over entire branches of the document tree in a single bounds check.
What about Oracle's OSON format?
Oracle Database stores JSON documents in its proprietary binary format called OSON. At first glance, OSON appears to take a fundamentally different approach: rather than storing field names inline with each value as BSON does, OSON builds a field name dictionary at the beginning of the document. Think of it like a mini-datastore with it's catalog and indexes, rather than a protocol buffer. Each field in the document body then refers to its name by a short numeric ID rather than repeating the full string. This makes individual field names cheaper to compare and reduces document size when the same field name appears many times.
Given that design, you might expect nesting to make no difference in Oracle: if every field is just a numeric ID anyway, scanning 1,000 flat fields should cost the same as scanning 10 + 10 + 10 fields spread across three levels. The dictionary lookup cost is the same either way.
In practice, however, the same experiment on Oracle shows a very similar result to MongoDB.
Querying field999 on the flat table takes about 1.6 seconds:
set autotrace traceonly
select /*+ gather_plan_statistics */ "DATA" from "ORA"."flat"
where JSON_EXISTS("DATA",'$?(@.field999.numberOnly() < $B0)'
passing 42 as "B0" type(strict))
;
PLAN_TABLE_OUTPUT
________________________________________________________________________________________________
SQL_ID 1h98k751w7dws, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ "DATA" from "ORA"."flat" where
JSON_EXISTS("DATA",'$?(@.field999.numberOnly() < $B0)' passing 42 as
"B0" type(strict))
Plan hash value: 4073748891
---------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
---------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 0 |00:00:01.60 | 301K| 739 |
|* 1 | TABLE ACCESS FULL| flat | 1 | 12 | 0 |00:00:01.60 | 301K| 739 |
---------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(JSON_EXISTS2( /*+ QJSNMD_USE_SHR_XCTX */ "DATA" /*+ LOB_BY_VALUE */
FORMAT OSON , '$?(@.field999.numberOnly() < $B0)' /* json_path_str
$?(@.field999.numberOnly() < $B0) */ PASSING 42 AS "B0" FALSE ON ERROR
TYPE(STRICT) )=1)
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
Statistics
-----------------------------------------------------------
117 CPU used by this session
117 CPU used when call started
160 DB time
1174164 RM usage by this session
3 Requests to/from client
738 Session total flash IO requests
6053888 cell physical IO interconnect bytes
301074 consistent gets
301074 consistent gets from cache
301074 consistent gets pin
1068 consistent gets pin (fastpath)
739 gcs data block access records
1 messages sent
747 non-idle wait count
48 non-idle wait time
2 opened cursors cumulative
1 opened cursors current
738 physical read total IO requests
6053888 physical read total bytes
2 process last non-idle time
301074 session logical reads
48 user I/O wait time
4 user calls
Querying field9.sub9.sub9 on the nested table takes about 1.0 second:
select /*+ gather_plan_statistics */ "DATA" from "ORA"."nest"
where JSON_EXISTS("DATA",'$?(@.field9.sub9.sub9.numberOnly() < $B0)'
passing 42 as "B0" type(strict))
;
PLAN_TABLE_OUTPUT
__________________________________________________________________________________________________
SQL_ID 7yuwrup6rscrc, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ "DATA" from "ORA"."nest" where
JSON_EXISTS("DATA",'$?(@.field9.sub9.sub9.numberOnly() < $B0)'
passing 42 as "B0" type(strict))
Plan hash value: 3225864993
---------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
---------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 0 |00:00:01.04 | 100K| 1143 |
|* 1 | TABLE ACCESS FULL| nest | 1 | 1000 | 0 |00:00:01.04 | 100K| 1143 |
---------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(JSON_EXISTS2( /*+ QJSNMD_USE_SHR_XCTX */ "DATA" /*+ LOB_BY_VALUE */
FORMAT OSON , '$?(@.field9.sub9.sub9.numberOnly() < $B0)' /* json_path_str
$?(@.field9.sub9.sub9.numberOnly() < $B0) */ PASSING 42 AS "B0" FALSE ON ERROR
TYPE(STRICT) )=1)
Statistics
-----------------------------------------------------------
40 CPU used by this session
40 CPU used when call started
106 DB time
404934 RM usage by this session
4 Requests to/from client
1125 Session total flash IO requests
9363456 cell physical IO interconnect bytes
100221 consistent gets
100221 consistent gets from cache
100221 consistent gets pin
99096 consistent gets pin (fastpath)
1142 gcs data block access records
158 global enqueue gets sync
158 global enqueue releases
1134 non-idle wait count
71 non-idle wait time
2 opened cursors cumulative
1 opened cursors current
1125 physical read total IO requests
9363456 physical read total bytes
61 process last non-idle time
100221 session logical reads
71 user I/O wait time
4 user calls
The most telling number is not the execution time but the Buffers column: 301,074 consistent gets for the flat table versus 100,221 for the nested table. Oracle is reading three times as many data blocks for the flat documents. To understand why, I check the actual segment sizes:
with
table_seg (owner,table_name, table_blocks) as ( select owner, segment_name, sum(blocks) from dba_segments group by owner,segment_name),
lob_seg (owner,table_name, lob_blocks) as ( select owner, table_name, sum(blocks) from dba_lobs join dba_segments using (owner, segment_name) group by owner,table_name),
tab_stats (owner,table_name, avg_row_len) as ( select owner, table_name, avg_row_len from dba_tables)
select owner, table_name, table_blocks, nvl(lob_blocks, 0) as lob_blocks, table_blocks + nvl(lob_blocks, 0) as total_blocks, avg_row_len
from tab_stats natural join table_seg natural left join lob_seg
where owner = 'ORA' and table_name in ('flat', 'nest')
;
OWNER TABLE_NAME TABLE_BLOCKS LOB_BLOCKS TOTAL_BLOCKS AVG_ROW_LEN
________ _____________ _______________ _____________ _______________ ______________
ORA flat 1,152 317,608 318,760 166
ORA nest 100,992 32 101,024 7,724
The segment query reveals the full story. The nested documents have an average row length of 7,724 bytes and are stored entirely inline in the table segment, with almost no LOB blocks. The flat documents have an average row length of only 166 bytes in the table segment, but they spill into 317,608 LOB blocks. That tiny average row length for flat is not a sign of small documents — it is the sign of a pointer. The actual OSON bytes have been pushed out of the row into a separate LOB segment managed by Oracle's securefile infrastructure, and every document access requires an additional pointer dereference to fetch them from there.
The root cause is the OSON dictionary. OSON stores each distinct field name only once in a per-document dictionary and replaces every occurrence in the document body with a short numeric ID. The dictionary itself is the mechanism that compresses the field names. A flat document with 1,000 entirely unique field names like field000 through field999 requires a dictionary with 1,000 entries, one per distinct string, with no repetition to exploit. The resulting OSON document is large enough to exceed Oracle's block size and gets stored out of row as a LOB. A nested document whose entire structure uses only 20 distinct names — field0 through field9 and sub0 through sub9 — has a dictionary with just 20 short entries. Furthermore, objects that share the same field structure can reuse each other's field ID array entirely rather than repeating it. The resulting document is compact enough to be stored inline, directly in the table block, with no LOB indirection at all.
The consequence is dramatic. Scanning the flat collection means fetching a pointer from the table block, then chasing it to a LOB block, for every single one of the 100,000 documents. Scanning the nested collection means reading the document directly from the table block. That is the difference between 301,074 consistent gets and 100,221.
So the two formats reach the same conclusion by different paths. BSON benefits from nesting because the length-prefix on sub-documents lets the parser skip entire branches with a single jump, reducing the number of field names it reads. OSON benefits from nesting because the dictionary compresses repeated field names, shrinking the documents enough to keep them inline in the table and avoiding the cost of LOB storage entirely. The mechanism differs, but the advice is the same. Documents are nested structures to aggregate multiple entities and object values.
Conclusion
The practical takeaway is straightforward: model your documents the way you would naturally think about the data — as a hierarchy of related objects — and the binary JSON format your database uses will reward you for it, whether that is BSON in MongoDB or OSON in Oracle. Flat documents with hundreds of top-level fields are not just harder to read. They are measurably slower to query and heavier on storage when compression is applied. Nesting is not only good document design.
May 29, 2026
$exists and non-sparse indexes in MongoDB and in other DocumentDB
In SQL databases, NULL represents an unknown value — not the absence of a value. When a value is simply non-applicable for a given entity, the correct relational modeling approach is normalization: the entity gets no row in the relevant table at all, rather than a NULL in a column. This distinction becomes tricky with OUTER JOIN results, where the absence of a row is surfaced as NULL across all columns of the unmatched side, including key columns — making it easy to confuse "unknown value" with "no row existed."
MongoDB has its own subtlety: a field can be explicitly set to null or simply not exist in the document at all. In the BSON representation, these are distinct — one is a key with a null-typed value, the other is the absence of the key entirely. The schema is flexible: you can define a field or not. But in indexes, this distinction disappears. Except for partial indexes, indexes must have a key value for every document it covers. For documents where the field is missing, MongoDB uses null as a stand-in — the same key value used for explicit nulls. This means an index scan cannot distinguish between the two states, and resolving null vs. missing requires fetching the full document to apply a residual filter.
Consequently, a standard index scan with a filter on null or $exists is inexact: the query planner performs an index scan on the null key and then fetches the full document to verify whether the field is truly null or simply absent.
An example: { $exists: true } filter
When you query with { num: { $exists: true } }, you expect MongoDB to use an index on num. Let's test it on MongoDB, as well as some emulations: Oracle Database, Amazon DocumentDB (AWS), and DocumentDB extension on PostgreSQL (Microsoft).
Here is my test collection:
db.test.insertMany([
{ _id: 1, num: 42 },
{ _id: 2, num: 7 },
{ _id: 3, num: null },
{ _id: 4 },
{ _id: 5, num: 99 },
{ _id: 6 },
{ _id: 7, num: null },
{ _id: 8, num: 15 }
])
I have inserted eight documents:
- four with real values (
_id1, 2, 5, 8), - two with the field explicitly set to
null(_id3, 7), and - two where the field is entirely absent (
_id4, 6).
The query { num: { $exists: true } } should return six documents — everything except _id 4 and 6.
Before touching indexes, notice that $exists is not the same as a null check:
db.test.find({ num: null })
[
{ _id: 3, num: null },
{ _id: 4 },
{ _id: 6 },
{ _id: 7, num: null }
]
db.test.find({ num: { $exists: false } })
[
{ _id: 4 },
{ _id: 6 }
]
db.test.find({ num: { $exists: true } })
[
{ _id: 1, num: 42 },
{ _id: 2, num: 7 },
{ _id: 3, num: null },
{ _id: 5, num: 99 },
{ _id: 7, num: null },
{ _id: 8, num: 15 }
]
A field set to null exists. A field not written into the document does not. This distinction is perfectly clear at the document level. At the index level, it is not.
Null in the index key is ambiguous
When MongoDB builds a B-tree index on num, it must create an entry for every document. For documents with no num field, the index key exists with a null value. For documents where num is explicitly set to null, it also stores null. Both cases produce the same index key.
Here is how the non-sparse index looks:
Non-sparse index on { num: 1 }:
null → _id:3 { num: null } explicit null
null → _id:4 { } missing field
null → _id:6 { } missing field
null → _id:7 { num: null } explicit null
7 → _id:2
15 → _id:8
42 → _id:1
99 → _id:5
The four entries under the null key are indistinguishable from the index alone. To evaluate $exists, the engine must read the actual document. This is called a residual predicate — a filter condition the index cannot resolve, deferred to a later fetch stage.
Another way to look at it: the document schema is flexible, with no structure declared upfront and fields that may or may not exist, whereas indexes are different—their schema is declared, and the key fields always exist.
MongoDB with a non-sparse index
I create a regular index, which is by default non-sparse and has one index entry per document (or more for multi-key indexes).
db.test.createIndex({ num: 1 })
db.test.find({ num: { $exists: true } }).explain("executionStats")
The execution plan shows what happens across the IXSCAN and FETCH stages:
executionStats: {
nReturned: 6,
totalKeysExamined: 8,
totalDocsExamined: 8,
executionStages: {
stage: 'FETCH',
filter: { num: { '$exists': true } },
nReturned: 6,
docsExamined: 8,
inputStage: {
stage: 'IXSCAN',
nReturned: 8,
isSparse: false,
indexBounds: { num: [ '[MinKey, MaxKey]' ] },
keysExamined: 8
}
}
}
The IXSCAN returns all 8 index entries across the full [MinKey, MaxKey] range. The FETCH stage then reads all 8 documents and applies filter: { num: { $exists: true } } as a residual predicate, discarding _id 4 and 6. Notice docsExamined: 8 but nReturned: 6 — two fetches were wasted. The index was used, but the null bucket forced unnecessary work.
MongoDB with a sparse index
A sparse index excludes documents where the indexed field is entirely absent. It does not exclude explicit null values. Documents _id 3 and 7 have num: null and are still indexed.
db.test.createIndex({ num: 1 }, { sparse: true })
db.test.find({ num: { $exists: true } }).explain("executionStats")
As I have no projection, there is still a FETCH, but only for the documents in the final result:
executionStats: {
nReturned: 6,
totalKeysExamined: 6,
totalDocsExamined: 6,
executionStages: {
stage: 'FETCH',
nReturned: 6,
docsExamined: 6,
inputStage: {
stage: 'IXSCAN',
nReturned: 6,
isSparse: true,
indexBounds: { num: [ '[MinKey, MaxKey]' ] },
keysExamined: 6
}
}
}
keysExamined dropped from 8 to 6 — the two missing-field documents are not in the index. More importantly, the FETCH stage has no filter. There is no residual predicate. Every document pointed to by the sparse index either has a real value or has an explicit null — both satisfy $exists: true. The index structure itself proves the condition. The FETCH still happens because find() needs to return the documents, but it is doing useful work only, not wasted disambiguation.
Here is how the sparse index looks:
Sparse index on { num: 1 }:
null → _id:3 { num: null } explicit null — indexed
null → _id:7 { num: null } explicit null — indexed
7 → _id:2
15 → _id:8
42 → _id:1
99 → _id:5
_id:4 { } — not indexed
_id:6 { } — not indexed
The null bucket still exists in a sparse index, but it contains only explicit nulls. The ambiguity is gone.
Oracle Database
I reproduced the same on Oracle Database with the MongoDB emulation:
ora> db.test.createIndex({ num: 1 })
num_1
ora> db.test.find({ num: { $exists: true } }).explain("executionStats")
{
queryPlanner: {
namespace: 'ora.test',
parsedQuery: { num: { '$exists': true } },
rewrittenQuery: { num: { '$exists': true } },
generatedSql: `select "DATA",rawtohex("RESID"),"ETAG" from "ORA"."test" where JSON_EXISTS("DATA",'$?(exists(@.num)) ' type(strict))`,
winningPlan: ' Plan Hash Value : 3552627291 \n' +
'\n' +
'--------------------------------------------------------------------------------------------------\n' +
'| Id | Operation | Name | Rows | Bytes | Cost | Time |\n' +
'--------------------------------------------------------------------------------------------------\n' +
'| 0 | SELECT STATEMENT | | 1 | 24501 | 2 | 00:00:01 |\n' +
'| 1 | TABLE ACCESS BY INDEX ROWID BATCHED | test | 1 | 24501 | 2 | 00:00:01 |\n' +
'| 2 | HASH UNIQUE | | 1 | 24501 | | |\n' +
'| * 3 | INDEX RANGE SCAN (MULTI VALUE) | $ora:test.num_1 | 1 | | 1 | 00:00:01 |\n' +
'--------------------------------------------------------------------------------------------------\n' +
'\n' +
'Predicate Information (identified by operation id):\n' +
'------------------------------------------\n' +
`* 3 - access(JSON_QUERY("DATA" /*+ LOB_BY_VALUE */ FORMAT OSON , '$."num"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)\n` +
" MULTIVALUE)>HEXTORAW('01'))\n" +
'\n' +
'\n' +
'Notes\n' +
'-----\n' +
'- Dynamic sampling used for this statement ( level = 2 )\n' +
'\n'
},
serverInfo: { host: 'localhost', port: 27017, version: '7.0.22' },
ok: 1
}
ora>
It doesn't display the execution statistics, but I can get it from the SQL endpoint:
sql> select /*+ gather_plan_statistics */ "DATA",rawtohex("RESID"),"ETAG" from "ORA"."test" where JSON_EXISTS("DATA",'$?(exists(@.num)) ' type(strict));
DATA RAWTOHEX("RESID") ETAG
_______________________ ____________________ ___________________________________
{"_id":3,"num":null} 03C104 523160F3D2777CB2E0637B5B000A71CD
{"_id":7,"num":null} 03C108 523160F3D27F7CB2E0637B5B000A71CD
{"_id":2,"num":7} 03C103 523160F3D2757CB2E0637B5B000A71CD
{"_id":8,"num":15} 03C109 523160F3D2817CB2E0637B5B000A71CD
{"_id":1,"num":42} 03C102 523160F3D2737CB2E0637B5B000A71CD
{"_id":5,"num":99} 03C106 523160F3D27B7CB2E0637B5B000A71CD
6 rows selected.
sql> select * from dbms_xplan.display_cursor(format=>'allstats last');
PLAN_TABLE_OUTPUT
____________________________________________________________________________________________________________________
SQL_ID c08vsvqpn75vw, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ "DATA",rawtohex("RESID"),"ETAG"
from "ORA"."test" where JSON_EXISTS("DATA",'$?(exists(@.num)) '
type(strict))
Plan hash value: 3552627291
-----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
-----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 6 |00:00:00.01 | 2 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| test | 1 | 1 | 6 |00:00:00.01 | 2 |
| 2 | HASH UNIQUE | | 1 | 1 | 6 |00:00:00.01 | 1 |
|* 3 | INDEX RANGE SCAN (MULTI VALUE) | $ora:test.num_1 | 1 | 1 | 6 |00:00:00.01 | 1 |
-----------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("test"."SYS_NC00005$">HEXTORAW('01'))
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
26 rows selected.
The index range scan returned 6 entries, as if it were a sparse index. We cannot create a sparse index on Oracle Database:
ora> db.test.dropIndex({ num: 1 })
{ nIndexesWas: 2, ok: 1 }
ora> db.test.createIndex({ num: 1 } , { sparse: 1 })
MongoServerError[MONGO-67]: Unsupported index option: sparse
Amazon DocumentDB (AWS)
AWS DocumentDB speaks the MongoDB wire protocol but is built on a completely different architecture. The storage layer is distributed like Aurora, replicated across three availability zones. The query planner and storage engine are specific to Amazon DocumentDB and deliver performance characteristics that differ from both MongoDB and standard PostgreSQL.
Non-sparse index on Amazon DocumentDB
The { num: { $exists: true } } query does not use the non-sparse index (created as db.test.createIndex({ num: 1 })) on Amazon DocumentDB (tested on version 8, planner version 3):
queryPlanner: {
plannerVersion: 3,
winningPlan: { stage: 'COLLSCAN', filter: { num: { '$exists': true } } }
},
executionStats: {
nReturned: '6',
executionTimeMillis: '14.121',
planningTimeMillis: '14.019',
executionStages: {
stage: 'COLLSCAN',
nReturned: '6',
executionTimeMillisEstimate: '0.025'
}
}
The index is completely abandoned. The planner chose a full collection scan.
Sparse index on Amazon DocumentDB
With the index created as db.test.createIndex({ num: 1 }, { sparse: true }), the index is used:
queryPlanner: {
plannerVersion: 3,
winningPlan: { stage: 'IXSCAN', indexName: 'num_1', direction: 'forward' }
},
executionStats: {
nReturned: '6',
executionTimeMillis: '10.034',
planningTimeMillis: '8.128',
executionStages: {
stage: 'IXSCAN',
nReturned: '6',
executionTimeMillisEstimate: '1.842',
indexName: 'num_1',
direction: 'forward'
}
}
Every entry in the sparse index provably satisfies $exists: true. It scans 6 index entries and returns 6 documents. While a sparse index is optional in MongoDB, it is mandatory in Amazon DocumentDB to use an index for this query at all.
Microsoft DocumentDB on PostgreSQL
Microsoft DocumentDB is implemented as an open-source PostgreSQL extension, accessed via the MongoDB wire protocol through a compatible endpoint.
With DocumentDB on PostgreSQL, a sparse index is not required for an optimal access path. I created the index as db.test.createIndex({ num: 1 }) and used a hint to force the index, since on a small collection the cost-based planner would otherwise prefer a sequential scan:
db.test.find(
{ num: { $exists: true } }
).hint("num_1").explain("executionStats")
(Be careful when using a hint with MongoDB queries, as it may change the result, limiting the scan to what is indexed)
The execution plan reads only the necessary entries from the index:
executionStats: {
nReturned: Long('6'),
executionTimeMillis: 0.093,
executionStartAtTimeMillis: 0.089,
totalDocsExamined: Long('6'),
totalKeysExamined: Long('6'),
executionStages: {
stage: 'FETCH',
nReturned: Long('6'),
executionTimeMillis: 0.093,
executionStartAtTimeMillis: 0.089,
totalKeysExamined: 6,
numBlocksFromCache: 24,
inputStage: {
stage: 'IXSCAN',
nReturned: Long('6'),
executionTimeMillis: 0.093,
executionStartAtTimeMillis: 0.089,
indexName: 'num_1',
totalKeysExamined: 6,
numBlocksFromCache: 24
}
}
}
It shows the same count for index entries (totalKeysExamined: 6) and documents fetched (totalDocsExamined: 6).
Here we can go further: we can bypass the MongoDB layer entirely and query PostgreSQL directly, seeing exactly what the database engine sees.
To understand why, we can look at the underlying implementation from the PostgreSQL catalog:
\d documentdb_data.documents_15
Table "documentdb_data.documents_15"
Column | Type | Collation | Nullable | Default
-----------------+--------+-----------+----------+---------
shard_key_value | bigint | | not null |
object_id | bson | | not null |
document | bson | | not null |
Indexes:
"collection_pk_15" PRIMARY KEY, btree (shard_key_value, object_id)
"documents_rum_index_47" documentdb_extended_rum
(document bson_extended_rum_composite_path_ops
(pathspec='[ "num" ]', tl='2691'))
There are no individual columns for num, name, or any other document field. The entire document is stored as a single bson blob in the document column. PostgreSQL has no native knowledge of what is inside it. The collection name test maps to documents_15, where 15 is the collection's internal identifier.
The index is not a standard PostgreSQL B-tree. It is an Extended RUM index — documentdb_extended_rum — with a custom operator class: bson_extended_rum_composite_path_ops. RUM is an extension of GIN (Generalized Inverted Index) that adds support for ordering, range scans, and additional per-entry metadata. The operator class is the critical piece: it knows how to extract the num field from the opaque BSON blob and store it in a structure PostgreSQL can search. pathspec='[ "num" ]' tells it which field to index.
We can obtain the PostgreSQL execution plan directly using the DocumentDB API. I disabled sequential scans to override the cost-based planner's preference on this small table:
postgres=# set enable_seqscan to off;
explain (analyze, buffers, verbose, costs off)
select document from bson_aggregation_find(
'test',
'{
"find": "test",
"filter": { "num": { "$exists": true } }
}'::documentdb_core.bson
);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Index Scan using num_1 on documentdb_data.documents_15 collection (actual time=0.036..0.040 rows=6 loops=1)
Output: document
Index Cond: (collection.document @>= '{ "num" : { "$minKey" : 1 } }'::bson)
Buffers: shared hit=3
Planning:
Buffers: shared hit=26
Planning Time: 0.416 ms
Execution Time: 0.052 ms
(8 rows)
The $exists: true predicate has been translated into a PostgreSQL index condition: document @>= '{ "num": { "$minKey": 1 } }'::bson. This uses a custom BSON operator @>= meaning "document has field num with a value greater than or equal to MinKey."
MinKey is a special BSON sentinel value that sits below every other BSON value in the type ordering. The condition @>= MinKey therefore means "field num exists and has any BSON value at all" — which is exactly $exists: true. Existence becomes a range scan from the minimum possible value: an elegant encoding.
The RUM index is path-based and only creates entries for paths that actually exist in documents. However, documents where num is absent also have their index entry, that are scanned by the opposite filter { num: {$exists": false} }:
postgres=# explain (analyze, buffers, verbose, costs off)
select document from bson_aggregation_find(
'test',
'{
"find": "test",
"filter": { "num": { "$exists": false } }
}'::documentdb_core.bson
);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Index Scan using num_1 on documentdb_data.documents_15 collection (actual time=0.030..0.033 rows=2 loops=1)
Output: document
Index Cond: (collection.document @? '{ "num" : false }'::bson)
Buffers: shared hit=3
Planning:
Buffers: shared hit=80
Planning Time: 0.282 ms
Execution Time: 0.070 ms
(8 rows)
This has read the two rows (rows=2) without the num field.
The complete picture
Here is the summary for { $exists: true } queries on the example above:
| Non-sparse index | Sparse index | |
|---|---|---|
| MongoDB | FETCH ← IXSCAN, 8 keys, 8 docs, residual filter, 2 wasted fetches | FETCH ← IXSCAN, 6 keys, 6 docs, no residual filter |
| Amazon DocumentDB (AWS) | COLLSCAN, no index, 8 docs | IXSCAN, 6 keys |
| DocumentDB on PostgreSQL (Microsoft) | FETCH ← IXSCAN, 6 keys, 6 docs, no residua... (truncated)
May 28, 2026Porting PostgreSQL Extensions to MySQL using Agent Skills
Discover how to build and port PostgreSQL extensions to MySQL using VillageSQL's Extension Framework and our new automated CLI agent skill.
Percona Operator for PostgreSQL 3.0.0: Hard Fork, OLM Scoping, Major UpgradesThe Percona Operator for PostgreSQL 3.0.0 is here. This is the release that completes the hard fork of the operator from the Crunchy Data PostgreSQL Operator into a fully independent project, with a dedicated upstream.pgv2.percona.com API group for the inherited CRDs, an automatic CRD-rename rollout for existing 2.x installs on upgrade, and a public roadmap … Continued The post Percona Operator for PostgreSQL 3.0.0: Hard Fork, OLM Scoping, Major Upgrades appeared first on Percona. Guide your Amazon Aurora MySQL migration with Kiro powers
Today, we announce the Amazon Aurora MySQL power for Kiro. The power connects Kiro’s AI agent to Aurora MySQL and pairs live database access with curated best-practice guidance. You describe what you need in natural language. The agent generates the API calls, SQL, and configuration for you to review and run. In this post, we walk through how the power guides a production migration from Amazon Relational Database Service (Amazon RDS) for MySQL 8.0 to Aurora MySQL through four phases: assessment, replica creation, promotion, and post-cutover validation.
May 27, 2026AI-native, full-stack web apps with Vercel and AWS Databases
In this post, we show how the integration between Vercel and AWS Databases solves this and invite you to participate in the H0 hackathon.
Optimize costs in Amazon Aurora
By implementing modern optimization techniques for Aurora, you can achieve additional cost reduction beyond traditional methods alone. This isn’t only about spending less—it’s about building a more efficient, scalable, and resilient database environment. In this post, we show you a structured approach to optimizing Amazon Aurora database costs. It outlines specific strategies, implementation steps, and best practices across different optimization areas.
Migrate from Crunchy Data PostgreSQL Operator to Percona PostgreSQL Operator: Backup-Restore and PV ReuseA Percona PostgreSQL operator pgBackRest restore is the simplest way to move off the Crunchy Data PostgreSQL Operator: take a full Crunchy backup, point the new Percona cluster’s dataSource at the existing pgBackRest archive, and the cluster bootstraps from it before its first start. This post covers that path, plus a second option, persistent-volume reuse, for cases … Continued The post Migrate from Crunchy Data PostgreSQL Operator to Percona PostgreSQL Operator: Backup-Restore and PV Reuse appeared first on Percona. CedarDB: Features of April 2026This post takes a closer look at some of the most impactful features we have shipped in CedarDB across our recent releases. Whether you have been following along closely or are just catching up, here is a deeper look at the additions we are most excited about. Set-Returning Functions: Lock-Step Evaluationv2026-04-20 When handling bulk data transformations or speeding up database inserts, a popular developer trick is to use multiple set-returning functions side-by-side in the To guarantee seamless compatibility and keep your queries lightning-fast, CedarDB evaluates multiple set-returning functions in the Instead of exploding into 27 rows of useless, cross-joined data, CedarDB cleanly steps through the arrays row-by-row to return exactly 3 perfectly paired rows. If you rely on array unnesting to batch your application’s database inserts, you can now enjoy highly scalable performance and behavior that is completely identical to modern PostgreSQL.
ON UPDATE CASCADE: Keep Your Data in Sync Automaticallyv2026-04-20 Changing core identifiers, like a user’s handle or a department code, used to mean manually updating every referencing row to avoid breaking foreign key constraints. To make your life easier, CedarDB now supports Say you have a platform where posts reference an author’s username. If an author changes their handle, a single
pg_stat_database and pg_stat_activity: Observability Out of the Boxv2026-04-20 CedarDB now implements
VACUUM (TRUNCATE): Release Disk Space Back to the OSv2026-04-20 CedarDB’s storage footprint grows as your data grows, but until now, the main storage file never shrank. Dropped indexes, truncated tables, and deleted data all freed up pages internally, but the underlying file stayed the same size on disk. In some cases this could leave you with a much larger file than your actual data warrants, for example after building and then dropping a large index, or after rewriting
CedarDB also now properly returns pages to the free pool after
json_agg and json_build_array: JSON Aggregation in SQLv2026-04-27 Two commonly used JSON aggregation functions are now available in CedarDB:
Together, these two functions cover the most common patterns for producing JSON output directly in SQL, without needing to post-process results in application code. That’s it for now Questions or feedback? Join us on Slack or reach out directly. Do you want to try CedarDB straight away? Sign up for our free Enterprise Trial below. No credit card required. May 26, 2026Announcing VillageSQL Server 0.0.4
Explore VillageSQL Server 0.0.4: now featuring VEF v3, custom aggregates, parameter inference, and preview capabilities like background threads.
The Autovacuum Scale Factor Problem at Scale - Know Your DefaultsIn PostgreSQL, autovacuum and autoanalyze exist to clean up dead tuples (old versions of updated/deleted rows) and update query planner statistics, respectively. The challenge is running them frequently enough so that query plans and execution do not degrade after data modifications, but not so frequently as to cause excessive I/O overhead. Databases often maintain a counter of the number of modifications to trigger these background jobs. Oracle Database and MySQL use a stale percentage (the ratio of modifications to total rows) for statistics gathering. SQL Server uses a dynamically decreasing percentage to ensure statistics do not remain stale for too long on massive tables. PostgreSQL uses a hybrid approach: a fixed base threshold combined with a scale factor (a percentage) that grows proportionally with the table size. This hybrid approach hits the sweet spot for most workloads, but it often requires tuning based on your specific data. The key factor to watch is the amount of static, "cold" data in your tables. Because the scale factor is calculated against the total table size, a large volume of cold data will significantly inflate the threshold. This can delay maintenance on the active working set—the "hot" data actually used by your queries—leaving it vulnerable to stale statistics or bloat. Here are the default base thresholds:
At first glance, this suggests tables are analyzed when 50 rows are modified, and vacuumed when 50 dead tuples accumulate (from deletes or updates) or 1,000 rows are inserted. But this is only true without the scale factor—10% for statistics, 20% for vacuum:
Because of the scale factor, the actual trigger thresholds increase with the size of the table. For the default settings, the formulas are:
As these formulas show, a larger table requires a much larger accumulation of changes before maintenance fires. This is perfectly acceptable if data churn is uniformly distributed, as small changes across a massive dataset will not drastically impact query cost estimations. However, data distribution is rarely uniform and evolves over time (e.g., seasonal sales spikes, market expanding to new countries). Because static data inflates the table row count in the formulas above, your database waits too long to trigger maintenance on the active working set. This is the core problem with default autovacuum settings at scale: a table with 5 million rows can accumulate half a million stale modifications before the planner statistics are refreshed, and over a million dead tuples before bloat is cleaned up. The larger the table grows, the longer it waits, and the worse the situation becomes:
To demonstrate this, I have run the following script to simulate this kind of activity, constantly inserting 100 rows and then updating them. We delete nothing because we want to keep the history, but queries operate on those recent rows. Think of it like orders being entered, then processed, and remaining stored:
For each iteration, the total number of rows inserted ( The Y-axis shows the staleness of statistics ( With 5 million rows, the last million inserted rows accumulated dead tuples. That is 20% of the total table, as defined by the default vacuum scale factor, but it likely represents 100% of the data actively read by your queries (for example, if the application processes the last year or less of a 5-year history). Furthermore, the last 500,000 rows have completely stale statistics, the 10% default analyze scale factor, and the past months may not have the same data distribution as the previous years. Think about the impact this has on the maximum value for an ID sequence or a created_at timestamp. It also completely skews the query planner's understanding of your data distribution (such as querying by country or day of the week). I have seen this cause severe performance issues in the real world: a retail company where shops only open on Sundays during the summer, or a trading platform suddenly processing entirely new market trends. Because the statistics are stale, the planner assumes your new, active data looks exactly like your old, historical data. As the table grows, the impact of this bloat and staleness compounds, and performance will no longer scale. Eventually, your execution plans will flip—not because the queries changed, but simply because the estimations of the query planner are completely wrong. For very large tables where the total size increases but the active working set is a small, predictable number of rows, you can effectively disable the scale factor and rely almost entirely on the fixed threshold:
This sets a nearly flat threshold that does not grow with the table size. The right threshold value depends on how many rows your active working set changes per hour and how much staleness you can tolerate. However, you must monitor the consequences of running autovacuum frequently on a growing table to ensure it does not cause localized I/O spikes. Here is how the same run starts with the new table settings: Auto analyze never left more than ~10,000 modified rows without refreshing statistics. This threshold grows slightly with the table (at 10 million rows it doubles to 20,000), but remains vastly better than the default. Auto vacuum follows the same pattern for dead tuples, but runs more frequently here because the insert-specific vacuum trigger was left at its default (1,000 rows + 20% scale factor), which only triggers the analyze threshold beyond 45,000 rows. To address this unbounded growth natively, recent PostgreSQL versions introduced Naturally, enforcing stricter thresholds—whether through these new maximum caps or manual table-level tuning—means autovacuum will run more frequently, which demands more background worker capacity. Historically, increasing Alternatively, if your table has a clear time-based or categorical boundary between hot and cold data, partitioning is worth considering. Autovacuum operates per partition, so a current_year partition with 100,000 rows will trigger maintenance far sooner than a monolithic 5-million-row table, meaning the default scale factor will naturally behave exactly as intended. Protecting your Supabase projects from npm supply chain attacks
How Supabase is responding to npm supply chain attacks and practical steps you should take today to reduce your risk.
May 25, 2026Running TidesDB as a MySQL 9.7 storage enginetidesdb-mysql is an experimental build that was developed to verify how TidesDB, the LSM-tree key/value engine, can work with MySQL 9.7 as a storage engine. The current build is v0.2.4, and it’s an experiment, not a finished product. So you can use it in your tests if you also want to try TidesDB with MySQL … Continued The post Running TidesDB as a MySQL 9.7 storage engine appeared first on Percona. Migrate from Crunchy Data PostgreSQL Operator to Percona PostgreSQL Operator: Standby Cluster MethodA Crunchy to Percona PostgreSQL migration is more straightforward than most cross-operator moves on Kubernetes, because the Percona PostgreSQL Operator is a hard fork of the Crunchy Data PostgreSQL Operator. Same Patroni HA, same pgBackRest backups, same overall CRD shape. This post walks through the safest of the three migration paths: a standby cluster method … Continued The post Migrate from Crunchy Data PostgreSQL Operator to Percona PostgreSQL Operator: Standby Cluster Method appeared first on Percona. |