a curated list of database news from authoritative sources

June 07, 2026

Extended RUM in DocumentDB extension for PostgreSQL: Efficient ESR (Equality, Sort, Range) Queries

Last year, I examined RUM indexes within this series on multi-key indexing, demonstrating that they cannot substitute MongoDB's compound indexes for sorted queries. A year later, the DocumentDB extension has evolved to incorporate an Extended RUM index, which allows for an ordered scan rather than a bitmap scan. Let's revisit our pagination query to see how it performs now.

I start a container with the latest DocumentDB (version v0.112-0 from May 26, 2026):

docker run -d --name documentdb-local -p 10260:10260 -p 9712:9712 ghcr.io/documentdb/documentdb/documentdb-local:latest  --username franck --password franck --start-pg

I can connect to PostgreSQL on port 9712, where many extensions are installed, including the extended RUM index:

docker exec -it documentdb-local psql -p 9712 postgres

psql (17.10 (Debian 17.10-1.pgdg13+1))
Type "help" for help.

postgres=# \dx
                                        List of installed extensions
Name           | Version |   Schema   |                        Description                         
-------------------------+---------+------------+------------------------------------------------------------
 documentdb              | 0.112-0 | public     | API surface for DocumentDB for PostgreSQL
 documentdb_core         | 0.112-0 | public     | Core API surface for DocumentDB on PostgreSQL
 documentdb_extended_rum | 0.112-0 | public     | DocumentDB Extended RUM index access method
 pg_cron                 | 1.6     | pg_catalog | Job scheduler for PostgreSQL
 plpgsql                 | 1.0     | pg_catalog | PL/pgSQL procedural language
 postgis                 | 3.6.3   | public     | PostGIS geometry and geography spatial types and functions
 tsm_system_rows         | 1.0     | public     | TABLESAMPLE method which accepts number of rows as a limit
 vector                  | 0.8.2   | public     | vector data type and ivfflat and hnsw access methods
(8 rows)

postgres=#

I can also connect to the MongoDB-compatible API:

docker exec -it documentdb-local mongosh -u franck -p franck 'mongodb://localhost:10260/?tls=true&tlsAllowInvalidCertificates=true'

Current Mongosh Log ID: 6a0b3b537d2a1c3471d1a7ba
Connecting to:          mongodb://<credentials>@localhost:10260/?tls=true&tlsAllowInvalidCertificates=true&directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+2.8.3
Using MongoDB:          7.0.0
Using Mongosh:          2.8.3

For mongosh info see: https://www.mongodb.com/docs/mongodb-shell/

[direct: mongos] test>

Like in the previous post, I created a simple collection with 10,000 documents:

[direct: mongos] test>
 for (let i = 0; i < 10000; i++) {
  db.demo.insertOne({
    a: 1,
    b: Math.random(),
    ts: new Date()
  });
}

I create a compound index that follows the MongoDB Equality, Sort, Range rule—designed for queries with an equality filter on a and a sort on ts:

[direct: mongos] test>
 db.demo.createIndex({ "a": 1, "ts": -1 });

I run the same query as in the previous post, which, with the standard RUM indexes, produced a Bitmap Index Scan followed by a Sort of all documents matching a: 1 before returning the top 10:

[direct: mongos] test>
 db.demo.find(
 { a: 1 }
).sort(
 { ts: -1 }
).limit(10).explain("executionStats");

The good surprise is that with the current version of DocumentDB, the execution plan looks like MongoDB's native IXSCAN with no additional sort step:

[direct: mongos] test> db.demo.find(
  { a: 1 }).sort({ts:-1}).limit(10).explain("executionStats")
;

{
  explainVersion: 2,
...
  executionStats: {
    nReturned: Long('10'),
    executionTimeMillis: 0.286,
    executionStartAtTimeMillis: 0.256,
    totalDocsExamined: Long('10'),
    totalKeysExamined: Long('10'),
    executionStages: {
      stage: 'LIMIT',
      nReturned: Long('10'),
      executionTimeMillis: 0.286,
      executionStartAtTimeMillis: 0.256,
      totalDocsExamined: 10,
      totalKeysExamined: 10,
      numBlocksFromCache: 25,
      inputStage: {
        stage: 'FETCH',
        nReturned: Long('10'),
        executionTimeMillis: 0.267,
        executionStartAtTimeMillis: 0.253,
        totalKeysExamined: 10,
        numBlocksFromCache: 25,
        inputStage: {
          stage: 'IXSCAN',
          nReturned: Long('10'),
          executionTimeMillis: 0.267,
          executionStartAtTimeMillis: 0.253,
          indexName: 'a_1_ts_-1',
          totalKeysExamined: 10,
          numBlocksFromCache: 25
        }
      }
    }
  },
  ok: 1
}

It read only the minimum necessary to get the result: ten index entries (totalKeysExamined: 10) in the expected order and fetched only ten documents (totalDocsExamined: 10). This is the most efficient execution plan.

Comparing the Two RUM Index Definitions

I connect to PostgreSQL to describe the table that stores my collection documents (you will see later how I obtained the name):

postgres=# \d documentdb_data.documents_7

            Table "documentdb_data.documents_7"

     Column      |  Type  | Collation | Nullable | Default 
-----------------+--------+-----------+----------+---------
 shard_key_value | bigint |           | not null | 
 object_id       | bson   |           | not null | 
 document        | bson   |           | not null | 

Indexes:

    "collection_pk_7" PRIMARY KEY, btree (shard_key_value, object_id)

    "documents_rum_index_25" documentdb_extended_rum (document documentdb_extended_rum_catalog.bson_extended_rum_composite_path_ops (pathspec='[ "a", { "ts" : -1 } ]', tl='2691'))

Check constraints:

    "shard_key_value_check" CHECK (shard_key_value = '7'::bigint)

postgres=#

What was a standard RUM index in the previous post is now an extended RUM index:

Attribute Previous post Current test
Index Type documentdb_rum documentdb_extended_rum
Operator Class bson_rum_single_path_ops (×2) bson_extended_rum_composite_path_ops
Fields a (asc), ts (implicit asc) a (asc), ts (desc)
Sort Direction on ts Not specified / default ascending Explicitly -1 (descending)
Path Encoding Two separate path= entries Single JSON pathspec array

The extended RUM index acts as a sort-order-aware composite index, embedding the descending direction directly into the pathspec. Unlike the previous approach, which stored each path independently, this approach encodes all indexed fields as a single composite pathspec and generates a single composite index entry per document, preserving the relative ordering between fields. An index scan (RumOrderedScan) efficiently covers both filtering and sorting, eliminating the need for a separate Sort node in the PostgreSQL execution plan. This benefit is evident when executing the same query via the DocumentDB API in PostgreSQL:

postgres=# explain (analyze, buffers, verbose, costs off)
select document from bson_aggregation_find(
  'test',
  '{
    "find": "demo",
    "filter": { "a": 1 },
    "sort":   { "ts": -1 },
    "limit": 10
  }'::documentdb_core.bson
);
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Limit (actual time=0.108..0.138 rows=10 loops=1)
   Output: document, (bson_orderby(document, '{ "ts" : { "$numberInt" : "-1" } }'::bson))
   Buffers: shared hit=4
   ->  Index Scan using "a_1_ts_-1" on documentdb_data.documents_7 collection (actual time=0.104..0.117 rows=10 loops=1)
         Output: document, bson_orderby(document, '{ "ts" : { "$numberInt" : "-1" } }'::bson)
         Index Cond: (collection.document @= '{ "a" : { "$numberInt" : "1" } }'::bson)
         Order By: (collection.document |-<> '{ "ts" : { "$numberInt" : "-1" } }'::bson)
         Buffers: shared hit=4
 Planning:
   Buffers: shared hit=2
 Planning Time: 0.453 ms
 Execution Time: 0.192 ms
(12 rows)

postgres=#

Note: I got the name of the internal table I described above from this execution plan, which uses the collection name in the query. The MongoDB API's explain() shows a MongoDB-compatible execution plan, and EXPLAIN in PostgreSQL shows the PostgreSQL version of it.

Comparison of Execution Plans

Here is how the new behavior with an ordered index scan compares to the previous bitmap scan.

Feature Ordered Index Scan Bitmap Index Scan
PostgreSQL Node Index Scan Bitmap Index Scan
Ordering Handled by sort direction in index Lost—requires a Sort node
Scan Type scanType: RumOrderedScan scanType: RumFastScan / RumRegularScan
Efficiency Supports early termination (LIMIT) Must scan all matching TIDs into bitmap
RUM Entry Point rumgettuple() rumgetbitmap()
Sort Step None — useSimpleScan = true rum_tuplesort_performsort() required
Memory Usage Low—one tuple at a time High—full TIDBitmap + sort state
Index Structure Used B-tree walk via orderStack Posting list / posting tree dump
Filter Evaluation Inline via ValidateIndexEntry() Post-collection in keyGetItem()
Seek Optimization Yes—advances queryKey as entries exhaust No
Multi-column Support Multi-column via composite pathspec Multi-column via separate entries
LIMIT benefit ✅ Full — stops after N rows ❌ None — bitmap built before LIMIT applies
Recheck Behavior xs_recheckorderby per tuple xs_recheck on bitmap result
Trigger Condition RumEnableOrderedOperatorScans + willSort + norderbys > 0 Default path

Index Only Scan

Other improvements are coming to Extended RUM, like Index Only Scan, currently supported for COUNT:

postgres=# explain (analyze, buffers, verbose, costs off)  
SELECT document FROM bson_aggregation_count(  
  'test',  
  '{  
    "count": "demo",  
    "query": { "a": 1 }  
  }'::documentdb_core.bson  
);
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Aggregate (actual time=22.760..22.763 rows=1 loops=1)
   Output: documentdb_api_internal.bsoncommandcount(1)
   Buffers: shared hit=109
   ->  Index Only Scan using "a_1_ts_-1" on documentdb_data.documents_7 collection (actual time=0.089..14.292 rows=10000 loops=1)
         Output: collection.document
         Index Cond: (collection.document @= '{ "a" : { "$numberInt" : "1" } }'::bson)
         Heap Fetches: 0
         Buffers: shared hit=109
 Planning:
   Buffers: shared hit=4
 Planning Time: 0.441 ms
 Execution Time: 22.883 ms
(12 rows)

Index Only Scan will be supported in the future (see IsQueryValidForIndexOnlyScan)

Conclusion

A year ago, DocumentDB's RUM indexes had a significant limitation for pagination queries: even with the right compound index, the planner would fall back to a Bitmap Index Scan followed by a full Sort, meaning every matching document had to be collected and sorted before the first result could be returned. A LIMIT 10 query on 10,000 documents would examine all 10,000—defeating the purpose of the compound index.

With v0.112-0, this is fixed. The new documentdb_extended_rum index type, combined with the RumOrderedScan execution path, reduces the gap with native MongoDB behavior:

  • The index encodes sort direction directly in the pathspec ({ "ts": -1 })
  • The planner chooses an Index Scan instead of a Bitmap Index Scan
  • No Sort node appears in the plan
  • LIMIT 10 examines exactly 10 index entries and 10 documents

This is more than just a cosmetic change. In time-series queries—such as filtering on a low-cardinality field, sorting by timestamp descending, and retrieving the first page—the difference between the two plans ranges from O(result) to O(total size). For OLTP systems, pagination queries are common and need to be quick and reliable, since they show results to the user before the user takes any action, selects, refines filters, or moves to the next page.

This ordered scan is also essential for TTL indexes to efficiently identify expiration candidates.

The key ingredients that make this work together are visible from the DocumentDB open-source code:

  1. documentdb_extended_rum—composite pathspec with explicit sort direction
  2. bson_extended_rum_composite_path_ops — single operator class covering all fields
  3. RumOrderedScan — B-tree walk in index order via orderStack, bypassing rumgetbitmap()
  4. useSimpleScan — returns one tuple at a time, enabling true LIMIT pushdown
  5. RumAllowOrderByRawKeys — the GUC that enables this path, now on by default

This is all enabled by default:

postgres=# \dconfig *rum*order*
           List of configuration parameters
                  Parameter                   | Value
----------------------------------------------+-------
 documentdb_rum.enable_ordered_operator_scans | on
 documentdb_rum.forceRumOrderedIndexScan      | off
(2 rows)

In under a year, DocumentDB evolved from "RUM instead of GIN, but with the same pagination limitations" to "RUM with ordered scan, aligning more with MongoDB's IXSCAN behavior for ESR-pattern indexes". For developers implementing cursor-based pagination or queries with a selective filter and sorting on a time or sequence field, this marks the version at which it begins to function as expected.

June 05, 2026

Getting Started with pg_durable: Durable Workflows Inside PostgreSQL

Modern applications demand long‑lasting workflows that can withstand crashes, restarts, and failures, often relying on external platforms. pg_durable integrates durable workflow orchestration directly into PostgreSQL, enabling workflows to be created in SQL and reliably executed, monitored, inspected, and recovered from disruptions.

This tool is particularly useful for ETL processes, data pipelines, background tasks, scheduled jobs, long‑running business procedures, and internal workflow management.

To truly grasp its capabilities, the best approach is to try it yourself. I did so by building and running it locally, which provided practical insight.

Installing pg_durable

I installed Rust to compile the extension:

curl https://sh.rustup.rs -sSf | sh

. "$HOME/.cargo/env"

I installed the PostgreSQL build dependencies.

sudo dnf install -y openssl-devel pkg-config libicu-devel readline-devel flex

I installed pgrx for PostgreSQL 17:

cargo install cargo-pgrx --version 0.16.1 --locked

cargo pgrx init --pg17=download

export PG_VERSION=pg17

I enabled the extension preload library:

sed -e "/shared_preload_libraries/s/^.*$/shared_preload_libraries='pg_durable'/" \
    -i ~/.pgrx/data-17/postgresql.conf

I declared the superuser that I'll create to run the background worker:

cat >> ~/.pgrx/data-17/postgresql.conf <<<"pg_durable.worker_role='superworker'"

I cloned pg_durable from the Microsoft repository:

git clone https://github.com/microsoft/pg_durable.git

cd pg_durable

I built the extension, started PostgreSQL, and got a psql prompt:


cargo pgrx run

I checked the pg_durable configuration, connected to the pg_durable.database, created the pg_durable.worker_role user, and installed the extension:

pg_durable=# \dconfig pg_durable.*

          List of configuration parameters
               Parameter               |    Value
---------------------------------------+-------------
 pg_durable.database                   | postgres
 pg_durable.enable_superuser_instances | off
 pg_durable.execution_acquire_timeout  | 30
 pg_durable.max_duroxide_connections   | 10
 pg_durable.max_management_connections | 6
 pg_durable.max_user_connections       | 10
 pg_durable.worker_role                | superworker
(7 rows)

\c postgres

CREATE ROLE superworker SUPERUSER LOGIN;

CREATE EXTENSION pg_durable;

My PostgreSQL instance is now ready to use durable functions.

Hello World

I create a user that will execute workflows and grant access to pg_durable:

CREATE USER franck;

GRANT CREATE ON DATABASE postgres TO franck;

SELECT df.grant_usage('franck');

I reconnect as that user:

postgres=# \c postgres franck

You are now connected to the database "postgres" as user "franck".

postgres=>

I start a durable workflow:

postgres=> SELECT df.start(
           $$SELECT ''Hello, durable world!'' AS message$$
          );

  start
----------
 393f09e0
(1 row)

One second later, it is completed:

postgres=> SELECT df.status('393f09e0');

  status
-----------
 completed

(1 row)

The result can be fetched as a document with a list of rows:

postgres=> SELECT df.result('393f09e0');

                              result
------------------------------------------------------------------
 {"rows": [{"message": "Hello, durable world!"}], "row_count": 1}

(1 row)

I used the same "Hello, durable world!" example as in the HorizonDB durable functions documentation, because this is where pg_durable comes from: Microsoft provides it to PostgreSQL as an open-source extension.

Here is more information about this execution:

postgres=> \x
Expanded display is on.

postgres=> SELECT * FROM df.list_instances();

-[ RECORD 1 ]---+-------------------------------------------------------------
instance_id     | 393f09e0
label           |
function_name   | pg_durable::orchestration::execute-function-graph
status          | completed
execution_count | 1
output          | {"rows":[{"message":"Hello, durable world!"}],"row_count":1}

postgres=> \x
Expanded display is off.
postgres=>

The execution graph is basic, as there's a single operation:

postgres=> SELECT df.explain('393f09e0');

                                explain
------------------------------------------------------------------------
 Instance: 393f09e0                                                    +
 Status:    Completed                                                 +
 Output:   {"rows":[{"message":"Hello, durable world!"}],"row_count":1}+
                                                                       +
 SQL: SELECT 'Hello, durable world!' AS message 
(1 row)

There is a single node:

postgres=> \x
Expanded display is on.

postgres=> SELECT * from df.instance_nodes('393f09e0');

-[ RECORD 1 ]+-----------------------------------------------------------------
execution_id | 1
node_id      | 60e433bb
node_type    | SQL
query        | SELECT 'Hello, durable world!' AS message
result_name  |
left_node    |
right_node   |
status       | completed
result       | {"rows": [{"message": "Hello, durable world!"}], "row_count": 1}
updated_at   | 2026-06-04 16:36:38.594906+00

postgres=> \x
Expanded display is off.

It was executed once:


postgres=> SELECT * from df.instance_executions('393f09e0');

 execution_id |  status   | event_count | duration_ms |                            output
--------------+-----------+-------------+-------------+--------------------------------------------------------------
            1 | Completed |          14 |         692 | {"rows":[{"message":"Hello, durable world!"}],"row_count":1}
(1 row)

Here are the global metrics:

postgres=> SELECT * FROM df.metrics();

 total_instances | running_instances | completed_instances | failed_instances | total_executions | total_events
-----------------+-------------------+---------------------+------------------+------------------+--------------
               1 |                 0 |                   1 |                0 |                1 |           14
(1 row)

This simply looks like ADBA—Asynchronous Database Access, but it can do more. We will see that it can define complex workflows with a declarative syntax. The examples start the workflow immediately, but it can also wait for a signal or a specific time using a cron syntax:

Beyond a Single Query

While a simple SQL statement demonstrates the basics, pg_durable becomes much more powerful when orchestrating multiple operations.

The SQL workflow DSL supports:

Feature Operator
Variable capture =>
Sequential execution ~>
Parallel execution (wait) &
Parallel execution (race) `
Conditional branching {% raw %}?> and !>
Infinite loops @>

Here are simple table-free examples demonstrating pg_durable's key features. Let's start with a sequential execution (~>):

postgres=> SELECT df.start(
       'SELECT now() as step1' |=> 't1'
    ~> 'SELECT pg_sleep(5)'
    ~> 'SELECT now() as step2' |=> 't2'
    ~> 'SELECT now() as step3' |=> 't3',
    'sequential-timing'
) as i \gset

-- Check the results after five seconds
postgres=> SELECT df.result(:'i');

                                  result
---------------------------------------------------------------------------
 {"rows": [{"step3": "2026-06-04T21:33:59.939785+00:00"}], "row_count": 1}

(1 row)

postgres=> SELECT node_type, result_name, result, node_id, left_node, right_node FROM df.instance_nodes(:'i');

 node_type | result_name |                                  result                                   | node_id  | left_node | right_node
-----------+-------------+---------------------------------------------------------------------------+----------+-----------+------------
 THEN      | t3          | {"rows": [{"step3": "2026-06-04T21:33:59.939785+00:00"}], "row_count": 1} | 49c9c8b4 | 48fcec43  | 9f818758
 THEN      | t2          | {"rows": [{"step2": "2026-06-04T21:33:59.507283+00:00"}], "row_count": 1} | 48fcec43 | 348c1948  | c78f2200
 THEN      |             | {"rows": [{"pg_sleep": null}], "row_count": 1}                            | 348c1948 | c7317ea8  | 200f8a52
 SQL       | t1          | {"rows": [{"step1": "2026-06-04T21:33:53.746030+00:00"}], "row_count": 1} | c7317ea8 |           |
 SQL       |             | {"rows": [{"pg_sleep": null}], "row_count": 1}                            | 200f8a52 |           |
 SQL       |             | {"rows": [{"step2": "2026-06-04T21:33:59.507283+00:00"}], "row_count": 1} | c78f2200 |           |
 SQL       |             | {"rows": [{"step3": "2026-06-04T21:33:59.939785+00:00"}], "row_count": 1} | 9f818758 |           |

(7 rows)

I displayed the timestamps to show that steps 2 and 3 occur 5 seconds after step 1 because I used pg_sleep(5) in the sequential execution.

The node_type column indicates the type of operation each node represents in the function graph. Here, THEN denotes sequential execution: running the left child, then the right child, and SQL is the leaf node with no children, executing the SQL query.

The complete list of valid node types in pg_durable is:

node_type Description Operator/Function
SQL Execute SQL query Plain string or df.sql()
THEN Sequential execution ~> operator or df.seq()
IF Conditional branch ?> !> operators or df.if()
JOIN Parallel execution (wait all) & operator or df.join()
RACE Parallel execution (first wins) \ operator or df.race()
LOOP Infinite loop @> operator or df.loop()
BREAK Exit loop df.break()
SLEEP Pause execution df.sleep()
WAIT_SCHEDULE Wait for cron schedule df.wait_for_schedule()
HTTP Make HTTP request df.http()
SIGNAL Wait for external signal df.wait_for_signal()

We can execute steps in parallel (&):

SELECT df.start(
    ('SELECT now() as branch1' |=> 'b1' ~> 'SELECT pg_sleep(20)')
    &
    ('SELECT now() as branch2' |=> 'b2' ~> 'SELECT pg_sleep(10)')
    ~> 'SELECT now() as after_join' |=> 'final',
    'parallel-sleep'
) as i \gset

postgres=> SELECT node_type, result_name, result, node_id, left_node, right_node FROM df.instance_nodes(:'i');

 node_type | result_name |                                              result                                              | node_id  | left_node | right_node
-----------+-------------+--------------------------------------------------------------------------------------------------+----------+-----------+------------
 SQL       | b1          | {"rows": [{"branch1": "2026-06-04T21:53:47.870539+00:00"}], "row_count": 1}                      | 9d479c96 |           |
 SQL       | b2          | {"rows": [{"branch2": "2026-06-04T21:53:47.870606+00:00"}], "row_count": 1}                      | cb844cd8 |           |
 SQL       |             | {"rows": [{"pg_sleep": null}], "row_count": 1}                                                   | ad8ec143 |           |
 THEN      |             | {"rows": [{"pg_sleep": null}], "row_count": 1}                                                   | 1ff3adc8 | cb844cd8  | ad8ec143
 SQL       |             | {"rows": [{"pg_sleep": null}], "row_count": 1}                                                   | 9db57c15 |           |
 THEN      |             | {"rows": [{"pg_sleep": null}], "row_count": 1}                                                   | 5e4d8069 | 9d479c96  | 9db57c15
 JOIN      |             | [{"rows": [{"pg_sleep": null}], "row_count": 1}, {"rows": [{"pg_sleep": null}], "row_count": 1}] | 6f95c84d | 5e4d8069  | 1ff3adc8
 SQL       |             | {"rows": [{"after_join": "2026-06-04T21:54:08.752247+00:00"}], "row_count": 1}                   | d4be28af |           |
 THEN      | final       | {"rows": [{"after_join": "2026-06-04T21:54:08.752247+00:00"}], "row_count": 1}                   | df59e250 | 6f95c84d  | d4be28af

(9 rows)

Here, b1 and b2 started at the same time, in parallel. One branch completed after 20 seconds and the other after 10 seconds, so the final step started after 20 seconds, and the total duration was 20 seconds—it would have been more than 30 seconds if executed serially:

postgres=> SELECT * from df.instance_executions(:'i');

 execution_id |  status   | event_count | duration_ms |                                   output
--------------+-----------+-------------+-------------+----------------------------------------------------------------------------
            1 | Completed |          26 |       22107 | {"rows":[{"after_join":"2026-06-04T21:54:08.752247+00:00"}],"row_count":1}

(1 row)

Parallel nodes are used when we need the result from all operations. The next step awaits all. If we want to continue as soon as one branch gets a result, we can run a race, where the first one wins (|):

SELECT df.start(
    'SELECT pg_sleep(30)' | 'SELECT pg_sleep(90)' ~> 'SELECT ''fast'' as winner',
    'race-test'
) as i \gset

select usename,backend_start,query_start,wait_event,state,query,application_name from pg_stat_activity where usename=user and pid!=pg_backend_pid(
)
\watch 10


                                              Fri 05 Jun 2026 04:08:30 AM GMT (every 10s)

 usename |         backend_start         |          query_start          | wait_event | state  |        query        | application_name
---------+-------------------------------+-------------------------------+------------+--------+---------------------+------------------
 franck  | 2026-06-05 04:08:29.873785+00 | 2026-06-05 04:08:29.877368+00 | PgSleep    | active | SELECT pg_sleep(30) |
 franck  | 2026-06-05 04:08:29.875738+00 | 2026-06-05 04:08:29.879296+00 | PgSleep    | active | SELECT pg_sleep(90) |
(2 rows)

                                              Fri 05 Jun 2026 04:08:40 AM GMT (every 10s)

 usename |         backend_start         |          query_start          | wait_event | state  |        query        | application_name
---------+-------------------------------+-------------------------------+------------+--------+---------------------+------------------
 franck  | 2026-06-05 04:08:29.873785+00 | 2026-06-05 04:08:29.877368+00 | PgSleep    | active | SELECT pg_sleep(30) |
 franck  | 2026-06-05 04:08:29.875738+00 | 2026-06-05 04:08:29.879296+00 | PgSleep    | active | SELECT pg_sleep(90) |
(2 rows)

                                              Fri 05 Jun 2026 04:08:50 AM GMT (every 10s)

 usename |         backend_start         |          query_start          | wait_event | state  |        query        | application_name
---------+-------------------------------+-------------------------------+------------+--------+---------------------+------------------
 franck  | 2026-06-05 04:08:29.873785+00 | 2026-06-05 04:08:29.877368+00 | PgSleep    | active | SELECT pg_sleep(30) |
 franck  | 2026-06-05 04:08:29.875738+00 | 2026-06-05 04:08:29.879296+00 | PgSleep    | active | SELECT pg_sleep(90) |
(2 rows)

                                              Fri 05 Jun 2026 04:09:00 AM GMT (every 10s)

 usename |         backend_start         |          query_start          | wait_event | state  |        query        | application_name
---------+-------------------------------+-------------------------------+------------+--------+---------------------+------------------
 franck  | 2026-06-05 04:08:29.875738+00 | 2026-06-05 04:08:29.879296+00 | PgSleep    | active | SELECT pg_sleep(90) |
(1 row)

                                              Fri 05 Jun 2026 04:09:10 AM GMT (every 10s)

 usename |         backend_start         |          query_start          | wait_event | state  |        query        | application_name
---------+-------------------------------+-------------------------------+------------+--------+---------------------+------------------
 franck  | 2026-06-05 04:08:29.875738+00 | 2026-06-05 04:08:29.879296+00 | PgSleep    | active | SELECT pg_sleep(90) |
(1 row)

                                              Fri 05 Jun 2026 04:09:20 AM GMT (every 10s)

 usename |         backend_start         |          query_start          | wait_event | state  |        query        | application_name
---------+-------------------------------+-------------------------------+------------+--------+---------------------+------------------
 franck  | 2026-06-05 04:08:29.875738+00 | 2026-06-05 04:08:29.879296+00 | PgSleep    | active | SELECT pg_sleep(90) |
(1 row)

                                              Fri... (truncated)
                                    

June 04, 2026

Understanding how backups work in Amazon Aurora

In this post, we dive deep into the Aurora backup architecture, how it differs from Amazon RDS backups, and the Amazon CloudWatch metrics available to monitor your backup storage usage. Through detailed scenarios and visualizations, we demonstrate how workload patterns and retention periods impact backup costs. We also explore cross-Region backup options and share recommended practices to optimize your backup storage consumption.

Migrating from MongoDB 6.0 to 8.0: How Percona ClusterSync Handles Cross-Version Replication

Percona ClusterSync for MongoDB (PCSM) replicates data between MongoDB clusters to keep migrations with near-zero downtime. Prior to version 0.9.0 it required the source and target to run the same major version, which ruled out the lift-and-shift move most migrations want: going from an older major like 6.0 straight onto a newer one like 8.0. … Continued

The post Migrating from MongoDB 6.0 to 8.0: How Percona ClusterSync Handles Cross-Version Replication appeared first on Percona.

Supabase Series F

Supabase has raised a $500M Series F at a $10B pre-money valuation, led by GIC.

June 03, 2026

Index types supported in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL using extensions (Bloom, pg_trgm, and pg_bigm)

In Part 1, Part 2, and Part 3 of this series, we explored PostgreSQL’s native indexes (B-tree, GIN, GiST, HASH, BRIN) and specialized extension-based index types (SP-GiST, btree_gin, btree_gist). In this post, we dive into three additional extensions: Bloom (for space-efficient multi-column equality filtering), pg_trgm (for fuzzy text matching and similarity searches), and pg_bigm (for full-text search optimized for Asian languages)

Index types supported in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL using extensions (SP-GiST, Btree_Gin and Btree_Gist)

In this post, the third in the series, we dive into three extension-based index types: SP-GiST, btree_gin, and btree_gist. These are available in Amazon Aurora PostgreSQL-Compatible Edition and Amazon Relational Database Service (Amazon RDS) for PostgreSQL. PostgreSQL’s index infrastructure is extensible. Operator classes define how indexes behave for specific data types and operations. The SP-GiST, btree_gin, and btree_gist extensions take advantage of this extensibility to give you additional indexing strategies beyond the native options. We walk through when to use each extension, the data types they support, and practical examples that demonstrate their performance benefits.

Migrating data from Oracle to Amazon Aurora DSQL

This post walks through migrating data from an Oracle source to Amazon Aurora DSQL, using AWS DMS, Amazon S3, AWS Glue, and AWS Step Functions to create an automated, cost-effective migration pipeline suitable for enterprise-scale deployments.

Implementing real-time change data capture with Debezium for Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL

In this post, we demonstrate how to implement a production-ready CDC solution by using Amazon Aurora for PostgreSQL, Debezium connectors, and Amazon Managed Streaming for Apache Kafka (Amazon MSK). This solution captures database changes in real time and streams them to Kafka topics so that downstream consumers can process the same data for different business purposes.

June 02, 2026

ACM CAIS: Conference on AI and Agentic Systems

Last week, I traveled to San Jose to attend the ACM CAIS conference. On Day 0, I gave a short talk at the Supporting our AI Overlords (SAO) workshop. And yes, I promise to write a summary of our paper, "A Case for Simulation-Driven Resilience in Agent-First Data Systems" soon! 

To start with an overall impression of the conference: much of the work presented felt exploratory and anecdotal. Since the compound AI space is still so new, many work seemed to share on-the-ground best practices that worked for them rather than principled results. Some talks really leaned into the "agent, act like a senior engineer and don't make mistakes" vibe. This was especially apparent in the "Agent Skills Workshop". I am not saying this is a bad thing, I learned some valuable lessons from that workshop, which I'll share below.

CAIS defines the conference's scope broadly as "research on compound AI architectures, optimization, and deployment". Unfortunately, this broadness seemed to work against the main track. Attendance at the primary conference talks was low, and it often felt like attendees were there solely to present their own work rather than engage with others, likely because the subject matter was spread too thin.

In contrast, the workshops were highly focused, which led to much better engagement and active listening from the audience. Moving forward, I think CAIS would benefit greatly from narrowing its focus, maybe specifically focusing more on data systems and infrastructure in support of AI.

On that note, what happened to our collective attention span? CAIS limited paper presentations to 7 minutes with just 2 minutes for questions. This year, SIGMOD also shifted to 9-minute talks. Our own paper, "LeaseGuard: Raft Leases Done Right!", got a mere 9 minutes in the spotlight after Jesse traveled all the way to Bangalore to present it. I've even heard that USENIX Security is down to 3-minute talks now. Should we maybe consider slowing down? After all, isn't attention all we need?


Agent Skills workshop (Day 0)

In the first talk, Graham Neubig discussed OpenHands, their open-source AI developer agent platform that's getting a lot of traction in highly regulated fields like finance and healthcare to speed up software development. A big theme of his talk was skill induction: "the process of inducing/verifying programmatic or prompt-based capabilities through testing/evaluation to enable single-agent systems complete complex long-horizon tasks". Through leveraging offline human-annotated examples or online user feedback, skill induction kicks off a rapid learning phase. In web navigation tests like WebArena, an agent's success rate can ramp up dramatically over a small number of trials before settling into a robust repeatable skill set.

Later on, Kanav Garg (co-founder of Core Automation) walked us through the lifecycle of a Reinforcement Learning (RL) environment. He defined it as a continuous loop made up of an actionable prompt, a starting state, a runtime environment (like a Docker container), configuration, and a reward system. The main takeaway here was that successful RL needs careful difficulty calibration and precise reward shaping to keep agents from hacking the reward system. To get this right, engineers have to actually look at the agent's traces instead of blindly trusting that the numbers on a chart are going up. Kanav also said that data environments are living projects with a shelf life of at most two months, and this means continuous learning from task failures and automated data pipelines are far more important/effective than relying on static expensive human data.


SAO: Supporting our AI overlords (Day 0)

The core theme of the SAO workshop was that agents are rapidly becoming both the primary users and the builders of data systems, and this shift is creating a vast new design space demanding entirely new abstractions. The workshop featured three keynote speakers and was incredibly well attended. With no seating left, people were standing in the doorway just to watch the talks.

Aaron Katz from Clickhouse gave a nice talk on this transition from human to agent users. He said that agents aren't just querying data anymore; they are actively provisioning services, so they need their own identities and budgets. Because agents drive such massive concurrency and require interactive latencies, traditional per-query pricing models are becoming way too punitive. To keep up, platforms have to adapt to headless API-first experiences. He said that Clickhouse is launching the ability to build agents directly inside the database for lower latency and for blending structured and full-text search.

Next, Andy Pavlo talked about databases being the "final boss" for agents. Check out his slides. It is classic Andy humor and style, though a Wu-Tang Clan reference was sorely missing this time. Andy focused heavily on automated database tuning and development, comparing their Proto-X tuning agent with ChatGPT's Lambda-tune. While Proto-X gets better optimization results, it takes 12 hours to train per database, whereas ChatGPT is fast (14 minutes) but performs terribly. To bridge this gap, they adopted LLMs to boost automatic tuning algorithms by leveraging prior history. In the second (and shorter) part of the talk, Andy also noted that while coding agents are making progress, they still completely fail at building complex database components like query optimizers, which require much more support. Although they have successfully "vibed" and manually verified a couple of optimization passes (like DPHyp and Unnesting v2), blindly accepting an LLM's output is fraught with problems because verifying query plan equivalence is notoriously difficult (despite solver efforts from UW, Berkeley, and Microsoft). It seems like coding agents love to add special-case code when an optimizer actually needs to be as general as possible.

Finally, Nikita Shamgunov (ex-Neon, now Databricks) discussed the infrastructure needed for agents, categorizing this down into three pillars: state, compute, and middleware. He argued that because agents speed up the dev loop by 1000x, true serverless architectures are now absolutely critical to avoid the insane costs of overprovisioning. He also described Neon's architecture, highlighting how separating compute from storage allows for instant database branching for agents using microVMs. While he shared some genuinely interesting technical points, the presentation itself was pretty dry and felt like it was missing a clear focus.

The workshop ended with a panel. The organizers tried to spark a debate, but there wasn't much disagreement. The general consensus was that while traditional OLTP and OLAP boundaries will remain, agents will increasingly be the ones conducting tasks that span seamlessly across them. After the panel, we headed to a MongoDB-sponsored happy hour, where the workshop audience had more time to have relaxed conversations about the breakneck hellscape transition our industry is currently going through.


Wednesday (Day 1)

Since CAIS is an AI systems conference, the focus is more on building the systems that surround AI models rather training individual AI models. The dominant theme this year seemed to be [multi-]agent architectures, coordination protocols, and workflow design. The emphasis is on composition: how do we organize these agents, manage their contexts, coordinate their interactions, and handle their lifecycles?

A second major trend is the growing focus on day-to-day operations. Entire sessions were dedicated to evaluation, trace analysis, failure detection, routing, and cost optimization. With so many papers covering efficiency, scheduling, and economic tradeoffs, it is clear that the industry is shifting from just maximizing capability to maximizing capability per dollar.

My shortlist from Day 1:

Here a brief overview of the TraceFix paper, which is most relevant to my interests.

TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples

TraceFix tackles the coordination failures that happen when multiple LLM agents try to work together concurrently on shared tasks. It is crucial to understand that this paper is not about generating classical distributed computing algorithms; rather, it focuses on scaffolding the formal "rules of engagement" for multi-agent LLM systems. When agents collaborate on domain tasks requiring fine-grained mutual exclusion over shared mutable resources (such as editing the same codebase or scheduling access to a simulated lab instrument) they naturally run into interleaving-sensitive bugs like deadlocks, missed handshakes, and race conditions. TraceFix solves this by isolating the coordination layer, formally verifying the protocol for how agents use shared locks and message channels, and then allowing the agents to remain completely autonomous in executing their actual domain-specific work.

To achieve this, TraceFix introduces a verification-first pipeline where an orchestration agent first synthesizes a declarative protocol topology and writes the behavioral logic in PlusCal. Before any agent takes action, the TLA+ model checker (TLC) exhaustively searches this proposed protocol for safety violations and feeds concrete counterexample traces back to the agent for iterative repair until the code is fully verified. At runtime, these verified process bodies are compiled into agent prompts, and a monitor strictly enforces the approved topology by rejecting any invalid or out-of-bounds coordination attempts. TraceFix is evaluated across a benchmark of 48 complex tasks. The system achieved a 100% verification success rate within four repair iterations and significantly improved runtime task completion, proving that formal model-checker feedback can effectively eliminate the deadlocks and resource clashes in multi-agent workflows.

Counterintuitively, the introduction of this formal coordination scaffolding did not bog the system down with unnecessary overhead, but rather, it significantly accelerated execution. By structurally preventing agents from colliding over resources (a problem that caused a massive 61.2% contention rate and endless retry loops in unstructured chat-only setups) TraceFix eliminated wasted trial-and-error steps. As a result, the verified topology-monitored protocol executed in an average of just 93 seconds using 62 tool-call steps. This vastly outperformed both the chaotic chat-only baseline (229 seconds and 203 steps) and a sequential single-agent baseline (~304 seconds).


Thursday (Day 2)

On Thursday, the papers were more about making agent systems manageable. Many of these papers are about memory, planning, governance, verification, security, and control. Researchers are increasingly treating agents as long-running software systems that require architecture, interfaces, observability, safety mechanisms, and operational discipline. This resembles the evolution of distributed systems from clever protocols toward operational concerns such as consistency models, monitoring, fault tolerance, and standards.

A second trend was the emergence of agents as compound systems rather than monolithic models. Instead of expecting a single model to solve everything, people arenow  building ecosystems of interacting components with explicit roles. The vibe here is more like systems engineering for AI. 

My shortlist:


Friday (Day 3)

In the Friday program, multiple papers study agent societies, debate, persuasion, socialization, consensus formation, and safety in multi-agent environments. Also, several papers focus on practical concerns such as routing requests across models, optimizing energy consumption, serving multi-agent systems, evaluating agent frameworks, and integrating AI into production workflows. 

My shortlist:

Extend MySQL Using Rust

Discover how the VillageSQL Rust SDK brings native, memory-safe Rust extension development to MySQL, bypassing C++ entirely.

June 01, 2026

Accelerating developer productivity in the agentic AI era with Amazon Aurora PostgreSQL

In this post, you learn how Amazon Aurora PostgreSQL-Compatible Edition accelerates developer productivity in the agentic AI era. We explore three core design convictions: meet developers where they work, absorb workload variability, and grow with the application from prototype to global scale.

Filter on Children, Sort by Parent: One-to-Many Compound Index Strategies in PostgreSQL

Before looking at PostgreSQL, I'll first introduce the problem by showing how MongoDB's document model and multi-key indexes handle compound indexes across a one-to-many relationship. Then we will see how to approximate this in PostgreSQL by denormalizing on the "many" side or the "one" side, how to maintain consistency with cascade foreign keys or triggers, and how to accelerate filtering with sorted pagination using B-tree, GIN, or RUM indexes.

Multi-key indexes in MongoDB allow compound indexes on one-to-many relationships. For example, the following index covers fields from both the children (child_value) and the parent (parent_value) when children are embedded as an array (in a children field) within the parent document:

db.parent.createIndex(
 { "children.child_value": 1, parent_value: 1 }
);

Such an index can efficiently support an equality filter on child_value combined with a sort and pagination on parent_value. For instance, finding the top 10 parents, ordered by parent_value, where at least one child has a child_value of 0.9:

db.parent.find(
 { "children.child_value": 0.9 }
).sort({ "parent_value": 1 }).limit(10);

It is important to understand the semantics of predicates on embedded arrays. This query does not return the top 10 children with this value — a parent may have multiple matching children, but we get one result per parent. This differs from a relational join (or $unwind in a MongoDB aggregation pipeline), which would produce one row per child. Here, the predicate tests only for the existence of at least one child with the specified value. The result set contains distinct parents, not (parent, child) pairs.

A multi-key index stores multiple index entries per parent document — one for each distinct child_value in the array — paired with the parent's parent_value. The number of index entries per parent equals the number of distinct child values, which can be less than the total number of children when multiple children share the same value. For example, given these documents:

{
  _id: 24,
  parent_value: "Y",
  children: [
    { child_num: 1, child_value: 0.9 },
    { child_num: 2, child_value: 0.8 }
  ]
},
{
  _id: 42,
  parent_value: "X",
  children: [
    { child_num: 1, child_value: 0.5 },
    { child_num: 2, child_value: 0.9 },
    { child_num: 3, child_value: 0.9 },
    { child_num: 4, child_value: 0.7 }
  ]
}

The index contains three entries for the second document, because the two children with child_value: 0.9 produce only one index entry:

  (0.5, "X", 42) → record ID of { _id: 42 }
  (0.7, "X", 42) → record ID of { _id: 42 }
  (0.8, "Y", 24) → record ID of { _id: 24 }
  (0.9, "X", 42) → record ID of { _id: 42 }
  (0.9, "Y", 24) → record ID of { _id: 24 }

The index scan for {child_value: 0.9} finds two entries, already ordered by the second field of the compound key, parent_value. MongoDB can read them in order — "X" before "Y" — and fetch the corresponding documents without an additional sort. With a .limit(10), the scan stops after 10 distinct parents are found, making the query efficient regardless of how many total documents exist in the collection.

There is no direct equivalent of this in PostgreSQL. If the one-to-many relationship is stored as JSONB, GIN indexes can locate documents containing {child_value: 0.9} but do not preserve ordering on other fields. The query must fetch all matching documents, sort them on parent_value, and only then return the top 10. If the one-to-many relationship is normalized into two tables, no single index can span columns from both tables — a join is required first, and the sort optimization depends on the join strategy chosen by the planner.

Normalized One-to-Many

Now that the problem is defined, let's see how we can work around this limitation in a normalized model:

drop table if exists parent, child cascade;

create table parent (
 parent_id    bigserial primary key,
 parent_value float
);

create table child (
 child_id     bigserial primary key,
 parent_id    int references parent (parent_id),
 child_value  float
);

insert into parent (parent_value)
 select random() from generate_series(1, 1000)
;

insert into child (parent_id, child_value)
 select parent_id, round(random()::numeric, 4) from parent, generate_series(1, 1000)
;

create index on parent (parent_value);
create index on child (child_value);

vacuum analyze parent, child;

The following query finds the top 10 parents, by parent_value, that have at least one child with a child_value of 0.9999:

-- explain (analyze, buffers, verbose, costs off)
select *
from parent p
where exists (
  select 1 from child c
  where c.parent_id = p.parent_id
  and c.child_value = 0.9999
)
order by parent_value
limit 10;

 parent_id |     parent_value
-----------+----------------------
       956 | 0.013518349069161273
       789 | 0.022813950892476287
       307 | 0.024740344416860793
       833 | 0.041118650516225985
       493 |  0.07587347477845374
       402 |  0.08400127026866477
       256 |   0.0980166832909124
       342 |  0.11175505348300807
       846 |  0.12047992766758941
       590 |  0.16840580135419425
(10 rows)

The execution plan shows that:

  1. The index on child_value was scanned to find all index entries with the expected value — here rows=99.0 — which are then used to locate the corresponding blocks in the child table (this is a Bitmap Index Scan).
  2. The Bitmap Heap Scan on the child table returned all children with child_value = 0.9999 (there are 95 if we count the actual matches — a bitmap scan can have some false positives). The results were stored in memory (Materialize).
  3. The index on parent_value was scanned to retrieve parents in the desired order, checking for each whether its parent_id appears in the materialized buffer. It had to read rows=169.00 parents before finding rows=10.00 that satisfied the condition, with lookups on the materialized result (loops=169).
                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Limit (actual time=0.366..2.272 rows=10.00 loops=1)
   Output: p.parent_id, p.parent_value
   Buffers: shared hit=245
   ->  Nested Loop Semi Join (actual time=0.365..2.270 rows=10.00 loops=1)
         Output: p.parent_id, p.parent_value
         Join Filter: (p.parent_id = c.parent_id)
         Rows Removed by Join Filter: 16246
         Buffers: shared hit=245
         ->  Index Scan using parent_parent_value_idx on public.parent p (actual time=0.012..0.080 rows=169.00 loops=1)
               Output: p.parent_id, p.parent_value
               Index Searches: 1
               Buffers: shared hit=142
         ->  Materialize (actual time=0.000..0.006 rows=96.19 loops=169)
               Output: c.parent_id
               Storage: Memory  Maximum Storage: 20kB
               Buffers: shared hit=103
               ->  Bitmap Heap Scan on public.child c (actual time=0.032..0.151 rows=99.00 loops=1)
                     Output: c.parent_id
                     Recheck Cond: (c.child_value = '0.9999'::double precision)
                     Heap Blocks: exact=99
                     Buffers: shared hit=103
                     ->  Bitmap Index Scan on child_child_value_idx (actual time=0.016..0.016 rows=99.00 loops=1)
                           Index Cond: (c.child_value = '0.9999'::double precision)
                           Index Searches: 1
                           Buffers: shared hit=4

The other index on parent_value can serve the sort but not the filter on child_value. In both cases, many more rows must be read to combine the two conditions after a Nested Loop Join. That is the price of normalization.

In a SQL database, if we want a single index to cover both fields, they must exist in the same table.

Denormalize into the "Many" Side (Children)

A common solution is to duplicate the parent value into the child table and create a compound index:

alter table child add parent_value float;

update child
  set parent_value = parent.parent_value
  from parent
  where child.parent_id = parent.parent_id;

alter table child alter parent_value set not null;

create index on child (child_value, parent_value);

The query can now operate on the child table only:

-- explain (analyze, buffers, verbose, costs off)
select distinct parent_id, parent_value
 from child
 where child_value = 0.9999
 order by parent_value
 limit 10;

 parent_id |     parent_value
-----------+----------------------
       956 | 0.013518349069161273
       789 | 0.022813950892476287
       307 | 0.024740344416860793
       833 | 0.041118650516225985
       493 |  0.07587347477845374
       402 |  0.08400127026866477
       256 |   0.0980166832909124
       342 |  0.11175505348300807
       846 |  0.12047992766758941
       590 |  0.16840580135419425
(10 rows)

The execution plan shows a single index scan which covers:

  • the filter: Index Cond: child_value = 0.9999
  • the order: Presorted Key: child.parent_value
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Limit (actual time=0.067..0.072 rows=10.00 loops=1)
   Output: parent_id, parent_value
   Buffers: shared hit=36
   ->  Unique (actual time=0.066..0.070 rows=10.00 loops=1)
         Output: parent_id, parent_value
         Buffers: shared hit=36
         ->  Incremental Sort (actual time=0.066..0.066 rows=11.00 loops=1)
               Output: parent_id, parent_value
               Sort Key: child.parent_value, child.parent_id
               Presorted Key: child.parent_value
               Full-sort Groups: 1  Sort Method: quicksort  Average Memory: 26kB  Peak Memory: 26kB
               Buffers: shared hit=36
               ->  Index Scan using child_child_value_parent_value_idx on public.child (actual time=0.014..0.054 rows=33.00 loops=1)
                     Output: parent_id, parent_value
                     Index Cond: (child.child_value = '0.9999'::double precision)
                     Index Searches: 1
                     Buffers: shared hit=36
 Planning:
   Buffers: shared hit=3
 Planning Time: 0.132 ms
 Execution Time: 0.088 ms

There is an additional sort because the result may contain duplicate parents when they have multiple children with the same value, but it is an incremental sort so it does not have to read all children — here only rows=33.00 before producing rows=10.00. This gets the top ten parents by parent_value without reading more child rows than necessary.

This is nearly optimal. The new index resembles a multi-key index, except that it may contain duplicate entries for the same parent — unnecessary for this query, but potentially useful for others that need the detail of individual children. Like the multi-key index, it must be maintained when the parent value changes. We have two solutions: referential integrity constraint and triggers.

Update with Cascade Constraint (Recommended)

One way to keep this column automatically updated is declarative. We add an additional foreign key that enforces that the parent_value in the child table matches the one in the parent, and declare it on update cascade:

alter table parent add unique (parent_id, parent_value);

alter table child
 add constraint fk_child_parent
 foreign key (parent_id, parent_value)
 references parent (parent_id, parent_value)
 on update cascade;

Unfortunately, PostgreSQL cannot use the same index for two unique constraints that share a prefix, so two indexes are maintained on the parent table. This solution is intended for cases where updates are not frequent.

I test the performance overhead when updating one parent:

\timing on
explain (analyze, buffers, costs off, verbose on)
 update parent
 set parent_value = parent_value - 0.000000000000000042
 where parent_id = 789
;

                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Update on public.parent (actual time=0.063..0.064 rows=0.00 loops=1)
   Buffers: shared hit=16
   ->  Index Scan using parent_parent_id_parent_value_key on public.parent (actual time=0.022..0.025 rows=1.00 loops=1)
         Output: (parent_value - '4.2e-17'::double precision), ctid
         Index Cond: (parent.parent_id = 789)
         Index Searches: 1
         Buffers: shared hit=5
 Planning Time: 0.080 ms
 Trigger RI_ConstraintTrigger_a_18059 for constraint fk_child_parent on parent: time=37.950 calls=1
 Trigger RI_ConstraintTrigger_c_18061 for constraint fk_child_parent on child: time=6.767 calls=1000
 Execution Time: 44.874 ms

There is an overhead: 37.9 milliseconds to cascade the update to a thousand of children (parent action), and 6.7 milliseconds to validate each _c_hild row (child check).

This is not problematic for occasional updates. Of course, if we update all parents, the row-by-row cascading constraints can be slow:

\timing on
explain (analyze, buffers, costs off, verbose on)
 update parent
 set parent_value = parent_value - 0.000000000000000042
;
vacuum parent, child;

                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Update on public.parent (actual time=5.944..5.945 rows=0.00 loops=1)
   Buffers: shared hit=10975 dirtied=1
   ->  Seq Scan on public.parent (actual time=0.010..0.156 rows=1000.00 loops=1)
         Output: (parent_value - '4.2e-17'::double precision), ctid
         Buffers: shared hit=11
 Planning:
   Buffers: shared hit=86
 Planning Time: 0.235 ms
 Trigger RI_ConstraintTrigger_a_18059 for constraint fk_child_parent on parent: time=42659.655 calls=506
 Trigger RI_ConstraintTrigger_c_18061 for constraint fk_child_parent on child: time=3398.155 calls=506000
 Execution Time: 46103.577 ms
(11 rows)

Time: 46105.778 ms (00:46.106)

Obviously this is not a solution for frequently updated columns. However, the parent side of a one-to-many relationship often consists of static references, slowly changing dimensions, or immutable event headers, making this overhead acceptable. I also tested with the constraint set as DEFERRABLE. The elapsed time is similar, but you do not see it in the EXPLAIN output because the validation occurs at commit time.

Update with Trigger (Not Recommended)

The cascade constraint has the advantage of being declarative, but it executes internally as a trigger. We can instead create a custom trigger that cascades the update without re-validating the foreign key — assuming that nobody updates the denormalized column directly, and that it is set correctly on insert.

-- function to update the parent value in the children
create or replace function sync_parent_value() returns trigger as $$
begin
  update child set parent_value = new.parent_value
  where parent_id = new.parent_id;
  return new;
end;
$$ language plpgsql;

-- trigger raised for each update
create trigger trg_sync_parent_value
after update of parent_value on parent
for each row
when (old.parent_value is distinct from new.parent_value)
execute function sync_parent_value();

-- drop the previous constraint
alter table child drop constraint fk_child_parent;
alter table parent drop constraint parent_parent_id_parent_value_key;

I test the update of all parents:

\timing on
explain (analyze, buffers, costs off, verbose on)
 update parent
 set parent_value = parent_value - 0.000000000000000042
;
vacuum parent, child;

                                   QUERY PLAN
---------------------------------------------------------------------------------
 Update on public.parent (actual time=4.547..4.547 rows=0.00 loops=1)
   Buffers: shared hit=8453 dirtied=26
   ->  Seq Scan on public.parent (actual time=0.009..0.124 rows=1000.00 loops=1)
         Output: (parent_value - '4.2e-17'::double precision), ctid
         Buffers: shared hit=11
 Planning:
   Buffers: shared hit=8 dirtied=2
 Planning Time: 0.108 ms
 Trigger trg_sync_parent_value: time=42719.154 calls=506
 Execution Time: 42724.042 ms
(10 rows)

Time: 42729.150 ms (00:42.729)

postgres=# vacuum parent, child;

VACUUM
Time: 176.585 ms

The performance advantage over the cascade constraint is minimal, and the risk is higher because parent_value in the child table could be updated independently without any check that it still matches the parent. We would also need an insert trigger to set the correct value, and an update trigger to prevent direct modifications. Ultimately, the cascade constraint solution is strongly preferable.

Denormalize into the "One" Side (Parent)

If we want to more closely mimic a MongoDB multi-key index, denormalizing into the child table is not an exact equivalent because there is no deduplication. It has its advantages — it can serve queries that need individual child rows — but a structure analogous to a multi-key index should be stored on the parent side, where each parent holds a summary of its children's values.

One approach is to add an array of child values to the parent table so that we can filter on it:

-- add array column
alter table parent add child_values float[];

-- populate from child
update parent p set child_values = (
  select array_agg(distinct child_value)
  from child where parent_id = p.parent_id
);

A trigger must maintain the array when children change:

create or replace function sync_child_values_array() returns trigger as $$
begin
  update parent set child_values = (
    select array_agg(distinct child_value)
    from child where parent_id = coalesce(new.parent_id, old.parent_id)
  ) where parent_id = coalesce(new.parent_id, old.parent_id);
  
                                    
                                    
                                    
                                    
                                

May 30, 2026

BSON and OSON: documents are designed to be nested, not flat

I like vendors benchmarks as they are often good illustrations of worst practice. Rather than focusing on real implementation trade-offs, they pick an extreme case that favors their own implementation by chance but is not optimized in the competitor's — because it is simply not how that technology is meant to be used. For example, Oracle published a "2x" benchmark results using YCSB, a key-value benchmark (slide 14 here), and a "529x" test misusing the raw BSON purpose (here). Both have something in common: they compare a relational database against a document database by using neither technology as it was designed to be used — no normalized tables, no nested documents, just flat key-value fields. The latter test is even worse: it uses 1,000 top-level fields in a single document. Don't do that!

While having thousands of columns in an SQL table is usually undesirable, it's acceptable to have hundreds or even thousands of fields in a document, as they represent multiple entities and value objects. However, nobody creates a flat structure with 1000 top-level fields. The advantage of JSON is its ability to organize entities into nested sub-documents. For instance, instead of storing first_name and last_name as separate fields, you can have a name field containing a sub-object with first and last. During queries, using dot notation to reference name.first and name.last makes no difference with first_name or last_name. It simplifies reading and displaying the document, and, as it is how it should be used, the binary JSON formats are optimized for it.

To illustrate this, I create two collections, flat with one thousand fields at the top level, and nest with ten top-level fields, each containing a sub-object with ten sub-objects.


{
echo field{0..9}{0..9}{0..9}
for i in {1..100000} ; do echo {0..9}{0..9}{0..9} ; done
} | sed 's/ /,/g' |
 mongoimport --collection "flat" --type=csv --headerline --drop

{
echo field{0..9}.sub{0..9}.sub{0..9}
for i in {1..100000} ; do echo {0..9}{0..9}{0..9} ; done
} | sed 's/ /,/g' |
 mongoimport --collection "nest" --type=csv --headerline --drop

The flat documents are like:

  { _id: ObjectId('6a1a1ed41d55219677a79c22'),
  field000: 0, field001: 1, field002: 2, field003: 3, field004: 4, field005: 5, field006: 6, field007: 7, field008: 8, field009: 9,
  field010: 10, field011: 11, field012: 12, field013: 13, field014: 14, field015: 15, field016: 16, field017: 17, field018: 18, field019: 19,
  field020: 20, field021: 21, field022: 22, field023: 23, field024: 24, field025: 25, field026: 26, field027: 27, field028: 28, field029: 29,
  field030: 30, field031: 31, field032: 32, field033: 33, field034: 34, field035: 35, field036: 36, field037: 37, field038: 38,...

The nested documents are like:

  {                                                                                                                                                                                               
    _id: ObjectId('6a1a160041c5538a7e93a9c5'),
    field0: {
      sub0: { sub0: 0, sub1: 1, sub2: 2, sub3: 3, sub4: 4, sub5: 5, sub6: 6, sub7: 7, sub8: 8, sub9: 9 },
      sub1: { sub0: 10, sub1: 11, sub2: 12, sub3: 13, sub4: 14, sub5: 15, sub6: 16, sub7: 17, sub8: 18, sub9: 19 },
      sub2: { sub0: 20, sub1: 21, sub2: 22, sub3: 23, sub4: 24, sub5: 25, sub6: 26, sub7: 27, sub8: 28, sub9: 29 },
      sub3: { sub0: 30, sub1: 31, sub2: 32, sub3: 33, sub4: 34, sub5: 35, ...

BSON format (https://bsonspec.org/)

BSON is designed for efficient sequential scanning through network or disk access. Each field is stored one after another, with a small header containing the field type and name, followed by the value. Because every value type has either a fixed size (like a 64-bit integer or a double) or an explicit length prefix (like a string or a binary blob), the parser can skip over any field it doesn't care about without reading its value at all — it simply jumps forward by the declared length.

Nested sub-documents and arrays are stored the same way: the type is object or array, followed by the total byte length of the entire nested structure, followed by the nested fields themselves (for arrays, the field names are simply "0", "1", "2", and so on). That length prefix is the key optimization: if the parser is looking for a top-level field and the current field is a sub-document that doesn't match, it can skip the entire sub-document in one jump — all its nested fields, however deep — without reading a single byte inside it.

When you query a field using dot notation, such as "name.first", the parser works level by level. It scans the top-level fields looking for name. Any top-level field that doesn't match name is skipped, including any sub-documents, in a single jump. Once name is found and confirmed to be of type object, the parser steps into that sub-document and begins scanning its fields for first, again skipping anything that doesn't match. It never needs to read the sibling sub-documents of name at all.

The worst case is a field that doesn't exist: the parser must scan every field at the relevant nesting level before concluding it isn't there. But even then, sub-documents and arrays that are clearly irrelevant — because their parent name doesn't match — are each skipped in a single operation regardless of how large or deeply nested they are.

This is precisely why organizing many fields into a nested hierarchy is not just a cosmetic preference. A flat document with 1,000 top-level fields forces the parser to read 1,000 field names in the worst case. A document with 10 top-level fields, each containing 10 sub-fields, each containing 10 values, holds the same 1,000 values but the parser only ever reads at most 10 + 10 + 10 = 30 field names to locate any one of them. The BSON length prefix turns the nested structure into an implicit index, making the format genuinely faster to navigate at scale.

With the collection flat that has one thousand top-level fields from "field000" to "field999", looking for the last field "field999" takes more than one second:


db.flat.find(
 { "field999" : {$lt:42} }
).explain("executionStats").executionStats
;

{
  executionSuccess: true,
  nReturned: 0,
  executionTimeMillis: 1081,
  totalKeysExamined: 0,
  totalDocsExamined: 100000,
  executionStages: {
    isCached: false,
    stage: 'COLLSCAN',
    filter: { field999: { '$lt': 42 } },
    nReturned: 0,
    executionTimeMillisEstimate: 1070,
    works: 100001,
    advanced: 0,
    needTime: 100000,
    needYield: 0,
    saveState: 64,
    restoreState: 64,
    isEOF: 1,
    direction: 'forward',
    docsExamined: 100000
  }
}

With the collection nest that has ten top-level fields from "field0" to "field9", and two levels of ten sub-objects from "sub0" to "sub9", looking for the last field "field9.sub9.sub9" is two times faster:


db.nest.find(
 { "field9.sub9.sub9" : {$lt:42} }
).explain("executionStats").executionStats
;

{
  executionSuccess: true,
  nReturned: 0,
  executionTimeMillis: 424,
  totalKeysExamined: 0,
  totalDocsExamined: 100000,
  executionStages: {
    isCached: false,
    stage: 'COLLSCAN',
    filter: { 'field9.sub9.sub9': { '$lt': 42 } },
    nReturned: 0,
    executionTimeMillisEstimate: 420,
    works: 100001,
    advanced: 0,
    needTime: 100000,
    needYield: 0,
    saveState: 21,
    restoreState: 21,
    isEOF: 1,
    direction: 'forward',
    docsExamined: 100000
  }
}

A major part of the time is fetching the document from storage. Navigating into it is a smaller part, but the difference is still clearly visible: 1,081 ms versus 424 ms for 100,000 documents, with the same number of values, and the same worst-case field to find: the last one. The only difference is how those fields are arranged. Nesting them three levels deep, with ten branches at each level, makes the collection scan faster — with no index, no schema change, and no query rewrite beyond the dot notation that you would use anyway.

This is not a micro-optimization or an edge case. Any collection scan, whether triggered by a missing index, a low-selectivity filter, or a background analytics query, pays this cost on every document it reads. The deeper and wider your documents are, the more the BSON length-prefix trick pays off, because the parser can leap over entire branches of the document tree in a single bounds check.

What about Oracle's OSON format?

Oracle Database stores JSON documents in its proprietary binary format called OSON. At first glance, OSON appears to take a fundamentally different approach: rather than storing field names inline with each value as BSON does, OSON builds a field name dictionary at the beginning of the document. Think of it like a mini-datastore with it's catalog and indexes, rather than a protocol buffer. Each field in the document body then refers to its name by a short numeric ID rather than repeating the full string. This makes individual field names cheaper to compare and reduces document size when the same field name appears many times.

Given that design, you might expect nesting to make no difference in Oracle: if every field is just a numeric ID anyway, scanning 1,000 flat fields should cost the same as scanning 10 + 10 + 10 fields spread across three levels. The dictionary lookup cost is the same either way.

In practice, however, the same experiment on Oracle shows a very similar result to MongoDB.

Querying field999 on the flat table takes about 1.6 seconds:


set autotrace traceonly

select /*+ gather_plan_statistics */ "DATA" from "ORA"."flat"
 where JSON_EXISTS("DATA",'$?(@.field999.numberOnly() < $B0)'
 passing 42 as "B0" type(strict))
;

PLAN_TABLE_OUTPUT
________________________________________________________________________________________________
SQL_ID  1h98k751w7dws, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ "DATA" from "ORA"."flat"  where
JSON_EXISTS("DATA",'$?(@.field999.numberOnly() < $B0)'  passing 42 as
"B0" type(strict))

Plan hash value: 4073748891
---------------------------------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |      0 |00:00:01.60 |     301K|    739 |
|*  1 |  TABLE ACCESS FULL| flat |      1 |     12 |      0 |00:00:01.60 |     301K|    739 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(JSON_EXISTS2( /*+ QJSNMD_USE_SHR_XCTX */ "DATA" /*+ LOB_BY_VALUE */
              FORMAT OSON , '$?(@.field999.numberOnly() < $B0)' /* json_path_str
              $?(@.field999.numberOnly() < $B0)  */  PASSING 42 AS "B0" FALSE ON ERROR
              TYPE(STRICT) )=1)

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)


Statistics
-----------------------------------------------------------
             117  CPU used by this session
             117  CPU used when call started
             160  DB time
         1174164  RM usage by this session
               3  Requests to/from client
             738  Session total flash IO requests
         6053888  cell physical IO interconnect bytes
          301074  consistent gets
          301074  consistent gets from cache
          301074  consistent gets pin
            1068  consistent gets pin (fastpath)
             739  gcs data block access records
               1  messages sent
             747  non-idle wait count
              48  non-idle wait time
               2  opened cursors cumulative
               1  opened cursors current
             738  physical read total IO requests
         6053888  physical read total bytes
               2  process last non-idle time
          301074  session logical reads
              48  user I/O wait time
               4  user calls

Querying field9.sub9.sub9 on the nested table takes about 1.0 second:


select /*+ gather_plan_statistics */ "DATA" from "ORA"."nest"
 where JSON_EXISTS("DATA",'$?(@.field9.sub9.sub9.numberOnly() < $B0)'
 passing 42 as "B0" type(strict))
;

PLAN_TABLE_OUTPUT
__________________________________________________________________________________________________
SQL_ID  7yuwrup6rscrc, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ "DATA" from "ORA"."nest"   where
JSON_EXISTS("DATA",'$?(@.field9.sub9.sub9.numberOnly() < $B0)'
passing 42 as "B0" type(strict))

Plan hash value: 3225864993
---------------------------------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |      0 |00:00:01.04 |     100K|   1143 |
|*  1 |  TABLE ACCESS FULL| nest |      1 |   1000 |      0 |00:00:01.04 |     100K|   1143 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(JSON_EXISTS2( /*+ QJSNMD_USE_SHR_XCTX */ "DATA" /*+ LOB_BY_VALUE */
              FORMAT OSON , '$?(@.field9.sub9.sub9.numberOnly() < $B0)' /* json_path_str
              $?(@.field9.sub9.sub9.numberOnly() < $B0)  */  PASSING 42 AS "B0" FALSE ON ERROR
              TYPE(STRICT) )=1)


Statistics
-----------------------------------------------------------
              40  CPU used by this session
              40  CPU used when call started
             106  DB time
          404934  RM usage by this session
               4  Requests to/from client
            1125  Session total flash IO requests
         9363456  cell physical IO interconnect bytes
          100221  consistent gets
          100221  consistent gets from cache
          100221  consistent gets pin
           99096  consistent gets pin (fastpath)
            1142  gcs data block access records
             158  global enqueue gets sync
             158  global enqueue releases
            1134  non-idle wait count
              71  non-idle wait time
               2  opened cursors cumulative
               1  opened cursors current
            1125  physical read total IO requests
         9363456  physical read total bytes
              61  process last non-idle time
          100221  session logical reads
              71  user I/O wait time
               4  user calls

The most telling number is not the execution time but the Buffers column: 301,074 consistent gets for the flat table versus 100,221 for the nested table. Oracle is reading three times as many data blocks for the flat documents. To understand why, I check the actual segment sizes:


with
table_seg (owner,table_name, table_blocks) as ( select owner, segment_name, sum(blocks) from dba_segments group by owner,segment_name),
lob_seg   (owner,table_name,   lob_blocks) as ( select owner, table_name,   sum(blocks) from dba_lobs join dba_segments using (owner, segment_name) group by owner,table_name),
tab_stats (owner,table_name,  avg_row_len) as ( select owner, table_name,   avg_row_len from dba_tables)
select owner, table_name, table_blocks, nvl(lob_blocks, 0) as lob_blocks, table_blocks + nvl(lob_blocks, 0) as total_blocks, avg_row_len
 from tab_stats natural join table_seg natural left join lob_seg
 where owner = 'ORA' and table_name in ('flat', 'nest')
;

OWNER    TABLE_NAME       TABLE_BLOCKS    LOB_BLOCKS    TOTAL_BLOCKS    AVG_ROW_LEN
________ _____________ _______________ _____________ _______________ ______________
ORA      flat                    1,152       317,608         318,760            166
ORA      nest                  100,992            32         101,024          7,724

The segment query reveals the full story. The nested documents have an average row length of 7,724 bytes and are stored entirely inline in the table segment, with almost no LOB blocks. The flat documents have an average row length of only 166 bytes in the table segment, but they spill into 317,608 LOB blocks. That tiny average row length for flat is not a sign of small documents — it is the sign of a pointer. The actual OSON bytes have been pushed out of the row into a separate LOB segment managed by Oracle's securefile infrastructure, and every document access requires an additional pointer dereference to fetch them from there.

The root cause is the OSON dictionary. OSON stores each distinct field name only once in a per-document dictionary and replaces every occurrence in the document body with a short numeric ID. The dictionary itself is the mechanism that compresses the field names. A flat document with 1,000 entirely unique field names like field000 through field999 requires a dictionary with 1,000 entries, one per distinct string, with no repetition to exploit. The resulting OSON document is large enough to exceed Oracle's block size and gets stored out of row as a LOB. A nested document whose entire structure uses only 20 distinct names — field0 through field9 and sub0 through sub9 — has a dictionary with just 20 short entries. Furthermore, objects that share the same field structure can reuse each other's field ID array entirely rather than repeating it. The resulting document is compact enough to be stored inline, directly in the table block, with no LOB indirection at all.

The consequence is dramatic. Scanning the flat collection means fetching a pointer from the table block, then chasing it to a LOB block, for every single one of the 100,000 documents. Scanning the nested collection means reading the document directly from the table block. That is the difference between 301,074 consistent gets and 100,221.

So the two formats reach the same conclusion by different paths. BSON benefits from nesting because the length-prefix on sub-documents lets the parser skip entire branches with a single jump, reducing the number of field names it reads. OSON benefits from nesting because the dictionary compresses repeated field names, shrinking the documents enough to keep them inline in the table and avoiding the cost of LOB storage entirely. The mechanism differs, but the advice is the same. Documents are nested structures to aggregate multiple entities and object values.

Conclusion

The practical takeaway is straightforward: model your documents the way you would naturally think about the data — as a hierarchy of related objects — and the binary JSON format your database uses will reward you for it, whether that is BSON in MongoDB or OSON in Oracle. Flat documents with hundreds of top-level fields are not just harder to read. They are measurably slower to query and heavier on storage when compression is applied. Nesting is not only good document design.