a curated list of database news from authoritative sources

August 15, 2025

Why doesn't Oracle Multi-Value Index optimize .sort() like MongoDB does with its multi-key index? RecordId deduplication.

Until recently, Oracle Database offered very limited options for creating inverted indexes, primarily restricted to special indexing for text and spatial data. Unlike PostgreSQL with Generalized Inverted Index (GIN), Oracle lacked a general solution. However, in response to the growing demand for document databases and to provide a MongoDB emulation, Oracle introduced the Multi-Value Index (MVI) in version 23ai. Let's create one and examine its internal structure and hot it differs from MongoDB indexes.

Lab setup

I start a Docker container with Oracle 23ai running and connect with SQL*Plus:

docker run --name ora -d -e ORACLE_PWD=OracleDatabase_23ai container-registry.oracle.com/database/free:latest

docker exec -it ora bash -c 'until grep -B100 "Completed: ALTER DATABASE OPEN" /opt/oracle/diag/rdbms/free/FREE/trace/alert_FREE.log ; do sleep 1 ; done'

docker exec -it ora sqlplus / as sysdba

I create a table to store some JSON documents:

DROP TABLE if exists franck;
CREATE TABLE franck (data JSON);

INSERT INTO franck VALUES
  ('{ field1: "x", field2: [ "jkl" , "mno" , "jkl" ] }') -- 78 , [ 6a 6b 6c , 6d 6e 6f , 6a 6b 6c ]
 ,('{ field1: "y", field2: [ "mno" , "jkl"         ] }') -- 79 , [ 6d 6e 6f , 6a 6b 6c ]
 ,('{ field1: "z", field2:   "jkl"                   }') -- 7a ,   6a 6b 6c
);

The structure is flexible, and for the second field, I inserted scalar (in the third document) and arrays, some with redundant values, and with different orders.

Multi-Value Index

I create a multi-value index similar to what the MongoDB API emulation would create. It uses JSON_TABLE to extract fields to columns to be indexed. It uses JSON_MKMVI to make an index entry for the multi-value index:

CREATE MULTIVALUE INDEX FRANCK_MVI ON FRANCK (
 JSON_MKMVI(
  JSON_TABLE(
    "DATA", '$' COLUMNS (
     "SCALAR" PATH '$."field1"' ERROR ON ERROR NULL ON EMPTY NULL ON MISMATCH,
     NESTED PATH '$."field2"[*]' COLUMNS (
       "K0" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR NULL ON EMPTY NULL ON MISMATCH
     )
    )
  )
 )
);

ORA_RAWCOMPARE is a clue that the ordering of nested values is binary. It would be nice to run those functions and see how they make index entries from JSON, but we cannot use them directly:

  select JSON_TABLE(
         *
ERROR at line 2:
ORA-61712: Can't use json_table syntax to create non-MVI(MultiValue Index): .
Help: https://docs.oracle.com/error-help/db/ora-61712/

The index entries will have a compound key with values extracted from the DATA column:

  • field1 as a scalar value (not used by the MongoDB emulation as all fields can be an array)
  • field2[*] as the values within an array, and NESTED PATH unnests the values (like $unwind)

The indexes created by the MongoDB API emulation are all NESTED PATH to be compatible with MongoDB's flexible schema, where any field in a document can be a scalar or an array, but I created one scalar to see the behavior.

Index internals

The index is created, making it easy to dump what's inside. I get the block identifier from DBA_SEGMENTS and dump the block:

SQL> -- get the block offset in datafile 1 ( I run this as sysdba, it goes so SYSTEM tablespace )
SQL> column block new_value block
SQL> select header_block+1 as block from dba_segments where segment_name='FRANCK_MVI' and owner=user;

     BLOCK
----------
    114553

SQL> -- set tracefile identifier and dump the block
SQL> alter session set tracefile_identifier = 'franck&block';
old   1: alter session set tracefile_identifier = 'franck&block'
new   1: alter session set tracefile_identifier = 'franck    114553'

Session altered.

SQL> -- get tracefile name and print
SQL> column tracefile new_value tracefile
SQL> select tracefile from v$process where addr=(select paddr from v$session where sid=sys_context('userenv','sid'));

TRACEFILE
--------------------------------------------------------------------------------
/opt/oracle/diag/rdbms/free/FREE/trace/FREE_ora_3518_franck    114553.trc

SQL> host cat "&tracefile."
Trace file /opt/oracle/diag/rdbms/free/FREE/trace/FREE_ora_3518_franck    114553.trc
Oracle Database 23ai Free Release 23.0.0.0.0 - Develop, Learn, and Run for Free
Version 23.8.0.25.04
Build label:    RDBMS_23.8.0.25.04DBRU_LINUX.X64_250423
ORACLE_HOME:    /opt/oracle/product/23ai/dbhomeFree
System name:    Linux
Node name:      6fc22939038c
Release:        4.14.35-2047.540.4.1.el7uek.x86_64
Version:        #2 SMP Thu Sep 5 12:06:13 PDT 2024
Machine:        x86_64
VM name:        KVM CPUID feature flags: 0x01003afb
CLID:   P
Instance name: FREE
Instance number: 1
Database name: FREE
Database unique name: FREE
Database id: 1471336774
Database role: PRIMARY
Redo thread mounted by this instance: 1
Oracle binary inode:    1036542
Oracle process number: 63
Unix process pid:       3518
             NID:       4026532223
             created:   2025-08-15T09:12:36.-2739213311+00:00
             image:     oracle@6fc22939038c (TNS V1-V3)


*** 2025-08-15T09:13:47.601001+00:00 (CDB$ROOT(1))
*** SESSION ID:(213.17644) 2025-08-15T09:13:47.601014+00:00
*** CLIENT ID:() 2025-08-15T09:13:47.601024+00:00
*** SERVICE NAME:(SYS$USERS) 2025-08-15T09:13:47.601033+00:00
*** MODULE NAME:(sqlplus@6fc22939038c (TNS V1-V3)) 2025-08-15T09:13:47.601043+00:00
*** ACTION NAME:() 2025-08-15T09:13:47.601052+00:00
*** CLIENT DRIVER:(SQL*PLUS) 2025-08-15T09:13:47.601060+00:00
*** CONTAINER ID:(1) 2025-08-15T09:13:47.601069+00:00
*** CLIENT IP:(N/A) 2025-08-15T09:13:47.601078+00:00
*** CONNECTION ID:(PGTKhibdDbfgYwIAEaztFA==) 2025-08-15T09:13:47.601088+00:00

Start dump data blocks tsn: 0 file#:1 minblk 114553 maxblk 114553
Block dump from cache:

Dump of buffer cache at level 3 for pdb=1 tsn=0 rdba=114553

BH (0xbdf5b740) file#: 1 rdba: 0x0001bf79 (1024/114553) class: 1 ba: 0xbd134000
  set: 6 pool: 3 bsz: 8192 bsi: 0 sflg: 2 pwc: 0,0
  dbwrid: 0 obj: 72614 objn: 72614 tsn: [1/0] afn: 1 hint: 0xf
  hash: [0xc3f52b50,0x77de4520] lru: [0xbdf5d2e0,0xbdf5b990]
  ckptq: [NULL] fileq: [NULL]
  objq: [0x76c0c790,0x76c0c790] objaq: [0x76c0c780,0x76c0c780] qhead: 0x76c0c770
  st: XCURRENT md: NULL fpin: 'ktswh23: ktsfbkl' fscn: 0x2f14be fcur_vldr: 0x3 tch: 1
  flags: block_written_once
  LRBA: [0x0.0.0] LSCN: [0x0] HSCN: [0x2f14c0] HSUB: [1]
  Printing buffer operation history (latest change first): (cnt 9)
  01. sid:11 L192:kcbbic2:bic:FBD     02. sid:11 L191:kcbbic2:bic:FBW
  03. sid:11 L602:bic1_int:bis:FWC    04. sid:11 L822:bic1_int:ent:rtn
  05. sid:11 L832:oswmqbg1:clr:WRT    06. sid:11 L930:kubc:sw:mq
  07. sid:11 L913:bxsv:sw:objq        08. sid:11 L608:bxsv:bis:FBW
  09. sid:11 L607:bxsv:bis:FFW        10. sid:05 L464:chg1_mn:bic:FMS
  11. sid:05 L778:chg1_mn:bis:FMS     12. sid:05 L552:chg_main:bic:CLN
  13. sid:05 L353:gcur:set:MEXCL      14. sid:05 L353:gcur:set:MEXCL

  buffer tsn: 0 rdba: 0x0001bf79 (1024/114553)
  scn: 0x2f14c0 seq: 0x01 flg: 0x04 tail: 0x14c00601
  frmt: 0x02 chkval: 0x2a1a type: 0x06=trans data
Hex dump of block: st=0, typ_found=1
Dumping 'block' addr=0xbd134000 size=8192 bytes
Dump of memory from 0xbd134000 to 0xbd136000
0BD134000 0000A206 0001BF79 002F14C0 04010000  [....y...../.....]
0BD134010 00002A1A 00400102 00011BA6 002F14BE  [.*....@......./.]
0BD134020 00008000 00020002 00000000 00000000  [................]
0BD134030 00000000 00000000 00000000 00000000  [................]
0BD134040 00000000 0000FFFF 00000000 00000000  [................]
0BD134050 00000000 80008000 002F14BE 03800000  [........../.....]
0BD134060 00000000 002E0005 1ED31F01 00000000  [................]
0BD134070 00000000 00000000 00000002 00001F60  [............`...]
0BD134080 1F3A1F4D 1F141F27 00001F01 00000000  [M.:.'...........]
0BD134090 00000000 00000000 00000000 00000000  [................]
        Repeat 492 times
0BD135F60 7A040201 6B6A0404 0000066C 02004108  [...z..jkl....A..]
0BD135F70 02000000 04790402 6F6E6D04 08000006  [......y..mno....]
0BD135F80 00010041 02020000 04047904 066C6B6A  [A........y..jkl.]
0BD135F90 41080000 00000100 04020200 6D040478  [...A........x..m]
0BD135FA0 00066F6E 00410800 00000000 78040202  [no....A........x]
0BD135FB0 6B6A0404 0000066C 00004108 00000000  [..jkl....A......]
0BD135FC0 00000000 00000000 00000000 00000000  [................]
        Repeat 2 times
0BD135FF0 00000000 00000000 00000000 14C00601  [................]

Block scn: 0x2f14c0
Block header dump:  0x0001bf79
 Object id on Block? Y
 seg/obj: 0x11ba6  csc:  0x00000000002f14be  itc: 2  flg: --  typ: 2 - INDEX
     fsl: 0  fnx: 0x0 ver: 0x01

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x02   0xffff.000.00000000  0x00000000.0000.00  C---    0  scn  0x00000000002f14be
Leaf block dump
===============
header address 3172155484=0xbd13405c
kdxcolev 0
KDXCOLEV Flags = - - -
kdxcolok 0
kdxcoopc 0x80: opcode=0: iot flags=--- is converted=Y
kdxconco 3
kdxcosdc 0
kdxconro 5
kdxcofbo 46=0x2e
kdxcofeo 7937=0x1f01
kdxcoavs 7891
kdxlespl 0
kdxlende 0
kdxlenxt 0=0x0
kdxleprv 0=0x0
kdxledsz 2
kdxlebksz 8032
row#0[8013] flag: -------, lock: 0, len=19, data:(2):  00 02
col 0; len 2; (2):  04 78
col 1; len 4; (4):  04 6a 6b 6c
col 2; len 6; (6):  00 00 08 41 00 00
row#1[7994] flag: -------, lock: 0, len=19, data:(2):  00 02
col 0; len 2; (2):  04 78
col 1; len 4; (4):  04 6d 6e 6f
col 2; len 6; (6):  00 00 08 41 00 00
row#2[7975] flag: -------, lock: 0, len=19, data:(2):  00 02
col 0; len 2; (2):  04 79
col 1; len 4; (4):  04 6a 6b 6c
col 2; len 6; (6):  00 00 08 41 00 01
row#3[7956] flag: -------, lock: 0, len=19, data:(2):  00 02
col 0; len 2; (2):  04 79
col 1; len 4; (4):  04 6d 6e 6f
col 2; len 6; (6):  00 00 08 41 00 01
row#4[7937] flag: -------, lock: 0, len=19, data:(2):  00 01
col 0; len 2; (2):  04 7a
col 1; len 4; (4):  04 6a 6b 6c
col 2; len 6; (6):  00 00 08 41 00 02
----- end of leaf block Logical dump -----
----- end of leaf block dump -----

There is one index entry per unique value in the array, so five entries in total:

entry key 0 (hex) field1 key 1 (hex) field2 rowid row
0 04 78 'x' 04 6a 6b 6c 'jkl' 00 00 08 41 00 00 1
1 04 78 'x' 04 6d 6e 6f 'mno' 00 00 08 41 00 00 1
2 04 79 'y' 04 6a 6b 6c 'jkl' 00 00 08 41 00 01 2
3 04 79 'y' 04 6d 6e 6f 'mno' 00 00 08 41 00 01 2
4 04 7a 'z' 04 6a 6b 6c 'jkl' 00 00 08 41 00 02 3

The index appears to be suitable for ordered entries in MongoDB, which sorts the documents in ascending order by the minimum value of an array. However, Oracle multi-value indexes do not preserve the sort as they add a deduplication step. We will see that in execution plans.

Execution plan in MongoDB

Here is an example, with result and execution plan, in MongoDB:

db.franck.drop();

db.franck.createIndex({ field1: 1, field2: 2});

db.franck.insertMany([
     { field1: "x", field2: [ "jkl" , "mno" , "jkl" ] },
     { field1: "y", field2: [ "mno" , "jkl"         ] },
     { field1: "z", field2:   "jkl"                   },
    ]);

db.franck.find({ field1:"x" }).sort({ field1: 1, field2: 1 });

[
  {
    _id: ObjectId('689f03ed160c7ec59cd4b0ce'),
    field1: 'x',
    field2: [ 'jkl', 'mno', 'jkl' ]
  }
]

db.franck.find(
 { field1:"x" }
).sort(
 { field1: 1, field2: 1 }
).explain("executionStats").executionStats;

{
  executionSuccess: true,
  nReturned: 1,
  executionTimeMillis: 0,
  totalKeysExamined: 2,
  totalDocsExamined: 1,
  executionStages: {
    isCached: false,
    stage: 'FETCH',
    nReturned: 1,
    executionTimeMillisEstimate: 0,
    works: 3,
    advanced: 1,
    needTime: 1,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    docsExamined: 1,
    alreadyHasObj: 0,
    inputStage: {
      stage: 'IXSCAN',
      nReturned: 1,
      executionTimeMillisEstimate: 0,
      works: 3,
      advanced: 1,
      needTime: 1,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 1,
      keyPattern: { field1: 1, field2: 2 },
      indexName: 'field1_1_field2_2',
      isMultiKey: true,
      multiKeyPaths: { field1: [], field2: [ 'field2' ] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: { field1: [ '["x", "x"]' ], field2: [ '[MinKey, MaxKey]' ] },
      keysExamined: 2,
      seeks: 1,
      dupsTested: 2,
      dupsDropped: 1
    }
  }
}

The execution plan shows an IXSCAN and a FETCH, but no SORT operation because the index entries are scanned in the expected order.

Important statistics to note are: dupsTested: 2 and dupsDropped: 1. A multi-key index contains multiple keys for one document, but only one key is needed to fetch the document, resulting in the dropping of duplicates. MongoDB IXSCAN can eliminate duplicate keys from the same document while maintaining key order, keeping the key that matters (the minimum for ascending key order) and eliminating the need for an additional sort.

Execution plan in MongoDB emulation

I run the same on the MongoDB API emulation of Oracle Autonomous Database 23ai:

ora> db.franck.drop();

true

ora> db.franck.createIndex({ field1: 1, field2: 2});

field1_1_field2_2

ora> db.franck.insertMany([
         { field1: "x", field2: [ "jkl" , "mno" , "jkl" ] },
         { field1: "y", field2: [ "mno" , "jkl"         ] },
         { field1: "z", field2:   "jkl"                   },
        ]);

{
  acknowledged: true,
  insertedIds: {
    '0': ObjectId('689f041f23cd9d33b9d4b0c5'),
    '1': ObjectId('689f041f23cd9d33b9d4b0c6'),
    '2': ObjectId('689f041f23cd9d33b9d4b0c7')
  }
}

ora> db.franck.find(
 { field1:"x" }
).sort(
 { field1: 1, field2: 1 }
).explain("executionStats")

{
  queryPlanner: {
    plannerVersion: 1,
    namespace: 'ora.franck',
    indexFilterSet: false,
    parsedQuery: {
      '$query': { field1: { '$stringOnly': 'x' } },
      '$orderby': {
        '$fields': [
          { path: 'field1', order: 'asc', sortByMinMax: true },
          { path: 'field2', order: 'asc', sortByMinMax: true }
        ],
        '$lax': true
      }
    },
    rewrittenQuery: {
      '
                                    
                                    
                                    
                                    
                                

August 14, 2025

Top 5 Security Risks of Running MySQL 8.0 After Its EOL

Your MySQL database has been running smoothly for years. Your team knows it inside and out. Everything just… works. Why rock the boat with an upgrade? Here’s why: MySQL 8.0 reaches its end-of-life date in April 2026. After this date, there’s no safety net; staying on end-of-life software means you’re taking on all the responsibility […]

August 13, 2025

Vive la Difference: Practical Diff Testing of Stateful Applications

This Google paper (to appear in VLDB'25) is about not blowing up your production system. That is harder than it sounds, especially with stateful applications with memories. When rolling out new versions of stateful applications, the "shared, persistent, mutable" data means bugs can easily propagate across versions. Modern rollout tricks (canaries, blue/green deployments) don't save you from this. Subtle cross-version issues often slip through pre-production testing and surface in production, sometimes days or weeks later.

These bugs can be severe, and the paper categorizes them as data corruption, data incompatibility, and false data assumptions. The paper mentions real-world incidents from Google and open-source projects to emphasize these bugs' long detection and resolution times, and the production outages and revenue loss they cause.

So, we need tooling that directly tests v1/v2 interactions on realistic data before a rollout. The paper delivers a prototype of a stateful diff testing framework, which extends the familiar idea of diff testing (comparing outputs between two implementations) to handle persistent state. So, you take v1 (the old code) and v2 (the shiny new code). You feed them the same requests on the same kind of data you'll see in production. Then you compare two things: the answers they give, and the state they leave behind. This framework aims to solve three problems:

  • Comparing versions on production-scale data without risking live systems.
  • Testing subtle cross-version interactions, including request interleavings that occur during rollouts.
  • Capturing and efficiently reasoning about changes to persistent state.


The framework (code available on Github) takes three inputs: an initial database state (preferably a snapshot from production in a test environment), a set of representative requests (sampled, synthetic, or hand-crafted), and two binaries: v1 (current production) and v2 (candidate release).

The framework operates across three stages: branching, replaying, diffing.


Branching

They bolt a branching layer on PostgreSQL to create instant, isolated, writable branches of the database without copying the entire dataset. Two special tables (R⁺, R⁻) track changes and a derived view (R′) is used as the new database. Postgres triggers on views intercept every insert, delete, and update, keeping the base table untouched. Common integrity constraints (primary key, uniqueness, non-null, foreign key) are enforced per branch, but certain features (CHECK constraints, pre-existing triggers) are unsupported, limiting coverage for some schemas. The approach avoids changes to Postgres internals or application queries, making it minimally invasive.

I liked this approach a lot. It uses monotonicity principles to branch the database instantly and enables branching in time/work proportional to the size of the diffs rather than the size of the whole table. This is a bolt-on approach, so they don't have the full benefits/features of versioned databases, but they can also boast that this bolt-on approach is applicable to other databases as well (MySQL, Oracle, MongoDB). 


Replaying

Requests are executed serially, in three patterns: entirely on v1, entirely on v2, and in interleaved sequences between the two versions. Interleaving is crucial for catching compatibility issues where one version reads or writes data in a way the other cannot handle. The framework generates multiple random interleavings for each request sequence, trading exhaustive coverage for practical runtime. Concurrency bugs are explicitly out of scope in this framework, as it is hard to deal with the heavy/exponential nondeterminism they induce.


Diffing

The database diffing algorithm is designed for efficiency: rather than scanning full tables, the diffing inspects only the delta tables (R⁺, R⁻) , making runtime proportional to the number of changed rows. It produces three-way diffs: original state, v1 result, v2 result. This gives crucial context absent in two-way comparisons. This shows not only what changed, but how each version diverged from the common starting point. 


Evaluation

The authors implemented the framework in Go and tested it using synthetic workloads and the Bank of Anthos, a friendly little demo application. In these tests, they introduced the three bug categories described earlier and showed that the framework can catch them and produce clear diffs that explained the discrepancies.

They had a lot of performance benchmarks to show that branch creation is ~100 ms and independent of database size, read/write overheads are modest, and diff computation time scales with the size of the diff rather than the full dataset. They show that compared to naive Postgres SQL queries, diffing is up to 100x faster. They also show that compared to Dolt (a versioned database), branching is slower but read/write performance is better for large tables.


Discussion

As in the past two weeks, Aleksey and I live-recorded our reading and discussion of this paper to show the thought process and mechanics of how experts read papers in real time. You can watch the discussion video here (1.5x speed is best). The paper I annotated during our live-read is also available here.

Below are some of the main points from our discussion.

We initially expected the paper to be more of an experience report on bug-catching at Google scale. Instead, the evaluation presented a toy example and focused on microbenchmarks measuring latency of framework components. In that sense, the evaluation did not connect to the production-safety problem tightly. Something to look forward to in a follow-up paper.

The approach applies to a single-node database, but Google operates large-scale distributed databases. How would this method translate to Spanner? Although the current framework does not address concurrency, if one could capture the exact sequence of production inputs and replay them in a test setting, could that provide meaningful results?

The paper says: "The exact procedure by which inputs (the sequence of client requests) are generated is out of scope for this paper. Typically, they are generated randomly or drawn from a pool of hand-crafted inputs." But this sidesteps the important question: how do you generate the right inputs to catch bugs? And how do you measure coverage, improve it, and gain confidence before rolling out to production?

We also wanted a more explicit, qualitative comparison between their delta-table branching method and alternative techniques such as copy-on-write branching.

Another concern was that manual diff inspection could become a bottleneck. Could LLMs help triage diffs and surface the most suspicious ones?

Overall, the paper is in a strong place. It bridges the database and testing fields, and even touches cloud operations, making it well-positioned for long-term impact. It would be good to check on this prediction in 5-10 years.

Finally, shout out to a great guy, Michael Whittaker. We had collaborated on the "Compartmentalized  Paxos" paper, and we were impressed with his clear on-the-fly iPad illustrations of protocols/ideas in our Zoom meetings. He is likely responsible for the clean figures in this paper. Fun fact: he skipped formal slides for his PhD thesis defense, instead drew the protocols live on his iPad (which comes from hundreds of hours of prior practice in meetings).

Top 5 Real-Time Databases Compared

Compare top real-time databases: Convex for TypeScript teams, Firebase for MVPs, Supabase for SQL, Realm for mobile, InfluxDB for analytics.

Webinar Q&A: No More Workarounds: Open Source PostgreSQL TDE Is Here

I would like to thank everyone who attended my webinar, “No More Workarounds: Open Source PostgreSQL TDE Is Here,” on July 23, 2025. If you missed it, you can view the recording and slides. Here are the questions I had no time to answer during the live session. If you have more questions, post them in […]

Towards Optimal Transaction Scheduling

This paper (VLDB'2024) looks at boosting transaction throughput through better scheduling. The idea is to explore the schedule-space more systematically and pick execution orders that reduce conflicts.

The paper's two main contributions are a scheduling policy called Shortest Makespan First (SMF) and a MVTSO concurrency control variant called MVSchedO.

SMF uses a greedy heuristic to pick low-conflict schedules that minimize the increase in total execution time (makespan) at each step. MVSchedO enforces the chosen schedule at a fine-grained level by adapting multi-version timestamp ordering (MVTSO). The authors implement SMF and MVSchedO in RocksDB (R-SMF) and show up to 3.9x higher throughput and 3.2x lower tail latency on benchmarks as well as Meta's TAO workload.

I mean, this is a stellar systems work, and it makes a convincing case that search-based scheduling is a promising direction for extracting higher throughput from database systems.

Motivation

Reducing contention is a big deal. Throughput drops when transactions contend for the same data. But, finding the optimal throughput schedule is NP-complete (Papadimitriou-1970), so database systems use mostly FIFO arrival order among pending transactions (i.e., the transactions made available to the scheduler at a given time). How much can we reduce conflicts by reordering, and is it worth going to all that trouble?

This paper explores the full schedule space more systematically, even when access patterns are only partially known. It tries to find good schedules quickly enough to be practical, and then execute transactions exactly in that schedule order.

They use SMF for the first challenge and MVSchedO for the second, and implement both on RocksDB to show substantial gains.


Searching for Fast Schedules

To define a framework to measure the effect of transaction ordering, the paper proposes makespan: the total time for a batch to complete under idealized execution (equal-length operations, no failures, no bottlenecks). Minimizing makespan maximizes throughput for a finite batch. A toy example with four transactions under MVTSO shows that the FIFO schedule has makespan 8, while another that groups compatible transactions reduces makespan to 6, achieving a 25% improvement in throughput.


Shortest Makespan First (SMF) is a greedy scheduling algorithm to reduce makespan. You start with one transaction, then repeatedly add the one from a random sample of k unscheduled transactions that causes the smallest increase in makespan. This gives O(nk) complexity. Small k (like 5) works well in real workloads because there's a good chance of picking a low-conflict transaction.

The key insight is that most contention comes from a few hot keys. By separating transactions that hit those keys, SMF gets most of the benefit without needing full access-set knowledge. The paper illustrates this with TPC-C: it shows that knowing only Warehouse/District keys gets within 6% of the performance of having complete read/write knowledge.


An online scheduling-first database

To bring SMF into an online database, R-SMF splits the work into three steps:

  1. predict the small set of hot-key accesses using a classifier,
  2. schedule transactions using SMF based on those predictions, and
  3. enforce the order during execution using MVSchedO, which modifies MVTSO.


The classifier is simple and efficient. It clusters metadata vectors (type + hot keys) using KNN to get canonical hot-key sets. At runtime, a transaction's hints map to the predicted hot keys. This loses some precision but keeps overhead low, and is accurate in OLTP workloads where most contention comes from obvious keys. The model is retrained periodically to track drift.

The classifier is simple and efficient. To produce canonical hot-key sets, it employs kNN to cluster transaction metadata, such as the transaction type (new order, transfer funds) and explicit key parameters (warehouse ID, account number). At runtime, a transaction's hints (derived from metadata) are mapped to a predicted hot-key set. This sacrifices some precision but keeps overhead low, and works well in OLTP workloads where most contention comes from a few obvious keys. The model is periodically retrained to adapt to workload drift.

The online SMF only tracks in-flight transactions and ignores non-hot keys for makespan. For each hot key, it only tracks the latest conflicting op. This cuts memory and compute cost while keeping most of the scheduling benefit. The scheduler runs in O(nk) per batch, where n = in-flight txns, k is a small constant like 5.

MVSchedO adapts MVTSO to enforce SMF's chosen ordering. MVTSO already assigns timestamps and allows maximal concurrency, but typically assigns timestamps by arrival and does not prevent later-timestamp operations from running before earlier ones (causing aborts). To integrate SMF into MVTSO, MVSchedO enforces SMF's order by giving SMF's timestamps to MVTSO and delays hot-key ops until prior scheduled ops finish. The rest of MVTSO's read/write/commit logic remains unchanged. (In Algorithm 1, only the asterisk-marked lines are updated from the original MVTSO algorithm.) Non-hot keys run as usual. If predictions are wrong, blocked transactions resume when the awaited ops finish or abort. This keeps enforcement overhead low and ensures maximum concurrency possible under the chosen schedule.

MVSchedO preserves serializability because any execution it allows is a valid MVTSO execution under some arrival order. SMF only changes timestamp assignment. The paper argues MVSchedO achieves optimal concurrency for the given schedule with early write visibility.


Evaluation

The evaluation tries to address these questions: (1) how much does scheduling help, (2) what are the overheads, and (3) how does SMF compare to other search methods. They implement R-SMF in RocksDB, plus bolt-on versions (SMF-OCC, SMF-Lock) and a TAO prototype. Workloads include Epinions, SmallBank, TPC-C, TAOBench, and YCSB. Experiments use separate client/server VMs, closed-loop clients, and averaged results.

R-SMF shows big gains on skewed and mixed workloads: up to 3.9x throughput over RocksDB OCC and 3.2x lower tail latency on benchmarks and TAO. SMF reduces wasted work and aborts by picking low-conflict orders; MVSchedO prevents aborts from out-of-order execution after timestamp assignment. The bolt-on SMF variants also help, though less. Abort rates drop sharply across workloads.

Overhead is small when contention is low (<5% throughput drop) since most transactions skip hot-key logic. At medium/high contention, R-SMF improves throughput significantly. I like this "pay as you need" approach. The classifier accuracy seems to be the key: when hints are good, performance improves; with poor hints (50% wrong), performance can drop. The paper argues that system can fall back to FIFO if schedule quality degrades.



Comparison to Morty

Morty (which we had reviewed recently) and R-SMF both integrate ordering into concurrency control and build on MVTSO ideas, but they differ in approach. R-SMF focuses on avoiding conflicts up front: it predicts hot keys, uses SMF to pick a low-conflict order, and enforces it for those keys via MVSchedO. Execution sticks to the schedule from the start, so there's no speculative work. Mispredictions just cause some idle waiting. This works best when hot-key prediction is accurate and contention is skewed, and it can fall back to FIFO if needed.

Morty instead uses a speculate-and-repair model for geo-replicated, interactive workloads. It assigns a speculative total order at arrival but allows reads from uncommitted data to boost concurrency. At commit-time validation, if a read was stale, Morty rewinds to that point and re-executes only the affected part. This handles mispredictions naturally and works well in high-latency WAN settings, where speculative overlap hides delays.

I think a natural extension would be to combine the two: use R-SMF's SMF+MVSchedO to avoid most conflicts through proactive hot-key scheduling, but when unexpected conflicts slip through (due to mispredictions or non-hot-key contention) apply Morty-style selective re-execution instead of full aborts. This hybrid could cut abort rates further and preserve more work per transaction when hot-key prediction accuracy degrades.

August 12, 2025

Does PostgreSQL support as much "schema flexibility" as MongoDB? Not for indexing!

Another day, another myth. Someone on Medium claims PostgreSQL offers as much "schema flexibility" as MongoDB, and tells me that I'm misinformed when I warn about such misconception: you-are-misinformed. I think I'm informed, but I always test and none of his examples work: wrong results, and the index not used.
Let's go though it. Others' mistakes can be valuable learning material and PostgreSQL with JSON can be tricky.

The article provides the following example of document:

{
  name: "Amit",
  email: "amit@example.com",
  order: [
    { amount: 1200, status: "paid"  , createdAt:  "2025-08-15" },
    { amount:  300, status: "failed", createdAt: "2025-08-16" }
  ]
}

I create a PostgreSQL table to store this document, as well as another with older dates, to validate the query filtering:

postgres=# CREATE TABLE my_table (
    id BIGSERIAL PRIMARY KEY,
    jsonb_col JSONB
);
CREATE TABLE

postgres=# INSERT INTO my_table(jsonb_col) VALUES  ('
{
  "name": "Amit",
  "order": [
    { "amount": 1200, "status": "paid",   "createdAt":  "2025-08-15" },
    { "amount":  300, "status": "failed", "createdAt": "2025-08-16" }
  ]
}
'),('
{
  "name": "Older",
  "order": [
    { "amount": 1200, "status": "paid",   "createdAt":  "2024-08-15" },
    { "amount":  300, "status": "failed", "createdAt": "2024-08-16" }
  ]
}
');

INSERT 0 2

I used a string for the date because JSONB data types are limited to object, array, string, number, boolean, and null, unlike MongoDB's BSON types.

1. Incorrect operator

The response on medium starts like this:

You are misinformed.
CREATE INDEX my_index ON my_table USING GIN (jsonb_col);
SELECT * FROM my_table WHERE jsonb_col @? '$.order.createdAt >= "2025-08-12"';

I copy/paste the example and it provides a wrong result as it includes a user with all orders created before "2025-08-12":

postgres=# CREATE INDEX my_index ON my_table USING GIN (jsonb_col);

CREATE INDEX

postgres=# SELECT * FROM my_table 
           WHERE jsonb_col @? '$.order.createdAt >= "2025-08-12"';

 id |                                                                           jsonb_col
----+----------------------------------------------------------------------------------------------------------------------------------------------------------------
  1 | {"name": "Amit", "order": [{"amount": 1200, "status": "paid", "createdAt": "2025-08-15"}, {"amount": 300, "status": "failed", "createdAt": "2025-08-16"}]}
  2 | {"name": "Older", "order": [{"amount": 1200, "status": "paid", "createdAt": "2024-08-15"}, {"amount": 300, "status": "failed", "createdAt": "2024-08-16"}]}
(2 rows)

Did you spot the error? PostgreSQL JSON operators are tricky:

  • @? is a jsonb_path_exists operator
  • @@ is a jsonb_path_match operator

The query returned all rows because the path exists. What the author wanted was filtering on the date in this path. Here is the query corrected:

postgres=# SELECT * FROM my_table 
           WHERE jsonb_col @@ '$.order.createdAt >= "2025-08-12"';
 id |                                                                         jsonb_col
----+------------------------------------------------------------------------------------------------------------------------------------------------------------
  1 | {"name": "Amit", "order": [{"amount": 1200, "status": "paid", "createdAt": "2025-08-15"}, {"amount": 300, "status": "failed", "createdAt": "2025-08-16"}]}
(1 row)

The result is the expected one: only one user ordered after "2025-08-12". Let's check if the index was used.

2. Index unused

With such small dataset, the query runs a full table scan, but that may be a cost-based decision:

postgres=# explain (analyze, costs off, summary off)
           SELECT * FROM my_table
           WHERE jsonb_col @@ '$.order.createdAt >= "2025-08-12"'
;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Seq Scan on my_table (actual time=0.016..0.018 rows=1.00 loops=1)
   Filter: (jsonb_col @@ '($."order"."createdAt" >= "2025-08-12")'::jsonpath)
   Rows Removed by Filter: 1
   Buffers: shared hit=1
 Planning:
   Buffers: shared hit=1

I disable sequential scan and the index is used:

postgres=# set enable_seqscan to off;
SET

postgres=# explain (analyze, costs off, summary off)
           SELECT * FROM my_table
           WHERE jsonb_col @@ '$.order.createdAt >= "2025-08-12"'
;

                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Bitmap Heap Scan on my_table (actual time=0.023..0.025 rows=1.00 loops=1)
   Recheck Cond: (jsonb_col @@ '($."order"."createdAt" >= "2025-08-12")'::jsonpath)
   Rows Removed by Index Recheck: 1
   Heap Blocks: exact=1
   Buffers: shared hit=4
   ->  Bitmap Index Scan on my_index (actual time=0.009..0.009 rows=2.00 loops=1)
         Index Cond: (jsonb_col @@ '($."order"."createdAt" >= "2025-08-12")'::jsonpath)
         Index Searches: 1
         Buffers: shared hit=3
 Planning:
   Buffers: shared hit=1

However, the index was not used to find the rows, despite seeing "createdAt" >= "2025-08-12" in Index Cond. Look at the number of rows returned by the index scan: rows=2.00.
All rows were read, it is an index full scan, the index condition was not used because it cannot be used on an GIN index. The rows were filtered later when accessing the table, with Rows Removed by Index Recheck in order to return rows=1. If you had ten million rows in the table, and only a few with recent orders, it would have done the same: read all index entries, read all rows from the table, and finally reduce the result.

Reading PostgreSQL execution plans is tricky, but the numbers of explain analyze do not lie. The GIN index was not used. You can also read the documentation:

The standard distribution of PostgreSQL includes a GIN operator class for arrays, which supports indexed queries using these operators:
<@ @> = &&

Those are only equality operators:

  • @> - Contains: arrayA @> arrayB means that Array A contains all elements of Array B.
  • <@ - Is contained by: arrayA <@ arrayB means that Array A is contained in Array B (all elements of A are in B).
  • = - Equals: arrayA = arrayB means that arrays are equal (same elements, same order).
  • && - Overlap: arrayA && arrayB means that arrays have any elements in common (intersection is not empty).

GIN indexes cannot access ranges of values or return ordered values. They are used to find exact values.

3. Expression-based index: wrong result

The author adds another solution: use an expression based index rather than a GIN index:

Alternatively you could just make a more specific (and faster) index just on the creation time.
CREATE INDEX my_index ON my_table ((jsonb_col #>> '{order,createdAt}')::timestamptz);
SELECT * FROM my_table WHERE (jsonb_col #>> '{order,createdAt}')::timestamptz > now() - '2 days'::interval;

I copy/paste the index creation but it fails:

postgres=# CREATE INDEX my_index ON my_table ((jsonb_col #>> '{order,createdAt}')::timestamptz);

ERROR:  syntax error at or near "::"
LINE 1: ...x ON my_table ((jsonb_col #>> '{order,createdAt}')::timestam...

Expression-based indexes must include the expression in parentheses:

postgres=# CREATE INDEX my_index ON my_table ( ( (jsonb_col #>> '{order,createdAt}')::timestamptz) );

ERROR:  functions in index expression must be marked IMMUTABLE

Casting to a timestamp is non-deterministic, as it depends on the timezone, so I keep it as a character string in the index:

postgres=# CREATE INDEX my_index2 ON my_table ((jsonb_col #>> '{order,createdAt}'));

CREATE INDEX

The index is created, let's run the query:

postgres=# SELECT * FROM my_table 
           WHERE (jsonb_col #>> '{order,createdAt}') > '2025-08-12'
;

 id | jsonb_col
----+-----------
(0 rows)

The result is empty because jsonb_col #>> '{order,createdAt}' reads a scalar value at "order.createdAt" and none exist. It is different from the field in the array, which is, when there are two items in the array:

  • "order.0.createdAt" or jsonb_col #>> '{order,0,createdAt}'
  • "order.1.createdAt" or jsonb_col #>> '{order,1,createdAt}'

In MongoDB, if you create an index on "order.createdAt" it will transparently index all those possible paths, because MongoDB is a document database that supports flexible schema. However, in PostgreSQL, you need to use different index types for scalars and arrays.

In short, the solution proposed by the author uses the index but returns a wrong result (rows=0.00):

postgres=# explain (analyze, costs off, summary off)
           SELECT * FROM my_table
           WHERE (jsonb_col #>> '{order,createdAt}') > '2025-08-12'
;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Index Scan using my_index2 on my_table (actual time=0.012..0.012 rows=0.00 loops=1)
   Index Cond: ((jsonb_col #>> '{order,createdAt}'::text[]) > '2025-08-12'::text)
   Index Searches: 1
   Buffers: shared hit=1
(4 rows)

With expression-based index, you can get equality, sort or ranges, but they can be used only when there's no array in the path.

4. Immutable function-based index: wrong result

The author proposes an alternative solution, to force the cast to timestamp being considered immutable (which it is not) and index it:

Making a function for that lookup cleans things up nicely.
CREATE FUNCTION order_created(jsonb) RETURNS timestamptz IMMUTABLE STRICT LANGUAGE sql AS $$
SELECT ($1 #>> '{order,createdAt}')::timestamptz;
$$;
Then your index and query look like:
CREATE INDEX my_index ON my_table (order_created(jsonb_col));
SELECT * FROM my_table WHERE order_created(jsonb_col) > now() - '2 days'::interval;

Of course, this is not better as the function still reads the "order" array as if it was a scalar value. In addition, the function-based index materializes a value calculated for a specific timezone, but the function that will use it may be in a different timezone, so the comparison will be wrong.
To demonstrate that it is not immutable, despite the declaration, I create the function for the first item of the array, adding .0 to the path:

postgres=# CREATE FUNCTION order_created(jsonb) 
           RETURNS timestamptz IMMUTABLE STRICT LANGUAGE sql AS $$
            SELECT ($1 #>> '{order,0,createdAt}')::timestamptz;
           $$;

CREATE FUNCTION

For the same value, the one in the table, the function returns different values:

postgres=# SET TIMEZONE TO 'UTC';
SET
postgres=# select now() - order_created( jsonb_col ) from my_table
;
         ?column?
--------------------------
 -2 days -02:19:56.097701
 362 days 21:40:03.902299
(2 rows)

postgres=# SET TIMEZONE TO 'Pacific/Apia';
SET
postgres=# select now() - order_created( jsonb_col ) from my_table
;
         ?column?
--------------------------
 -1 days -13:19:42.152896
 363 days 10:40:17.847104
(2 rows)

If you ever create an immutable function for an expression that is not immutable, you must add enough tests to be sure it raises an error when used in a different context. Anyway, expression and function-based indexes cannot be used on documents with an array in the path.

Conclusion

PostgreSQL provides extended datatypes and indexing options, with JSONB enabling some non-relational data modeling. However, attempting MongoDB-like schema flexibility in PostgreSQL reveals limitations in query result reliability, index efficiency, and type safety.

For data models with arrays, nested fields, and complex types, especially dates, MongoDB’s BSON and indexing ensure correctness and performance. In contrast, PostgreSQL’s JSONB often requires cumbersome path expressions and workarounds, leading to potential errors from operator confusion, array/scalar mismatches, or timezone issues. JSON and JSONB datatypes are useful to store and retreive documents in a PostgreSQL transactions, but it's still a SQL database and not a document database.

Be cautious of what you read on Medium or elsewhere if it lacks demos and documentation. Many people accept information as correct without verification. By fact-checking, you’ll make better decisions and learn from others' mistakes. I've a blog post series about learning from the Myths about MongoDB

Does PostgreSQL support as much "schema flexibility" as MongoDB? Not for indexing!

Another day, another myth. Someone on Medium claims PostgreSQL offers as much "schema flexibility" as MongoDB, and tells me that I'm misinformed when I warn about such misconception: you-are-misinformed. I think I'm informed, but I always test and none of his examples work: wrong results, and the index not used.
Let's go though it. Others' mistakes can be valuable learning material and PostgreSQL with JSON can be tricky.

The article provides the following example of document:

{
  name: "Amit",
  email: "amit@example.com",
  order: [
    { amount: 1200, status: "paid"  , createdAt:  "2025-08-15" },
    { amount:  300, status: "failed", createdAt: "2025-08-16" }
  ]
}

I create a PostgreSQL table to store this document, as well as another with older dates, to validate the query filtering:

postgres=# CREATE TABLE my_table (
    id BIGSERIAL PRIMARY KEY,
    jsonb_col JSONB
);
CREATE TABLE

postgres=# INSERT INTO my_table(jsonb_col) VALUES  ('
{
  "name": "Amit",
  "order": [
    { "amount": 1200, "status": "paid",   "createdAt":  "2025-08-15" },
    { "amount":  300, "status": "failed", "createdAt": "2025-08-16" }
  ]
}
'),('
{
  "name": "Older",
  "order": [
    { "amount": 1200, "status": "paid",   "createdAt":  "2024-08-15" },
    { "amount":  300, "status": "failed", "createdAt": "2024-08-16" }
  ]
}
');

INSERT 0 2

I used a string for the date because JSONB data types are limited to object, array, string, number, boolean, and null, unlike MongoDB's BSON types.

1. Incorrect operator

The response on medium starts like this:

You are misinformed.
CREATE INDEX my_index ON my_table USING GIN (jsonb_col);
SELECT * FROM my_table WHERE jsonb_col @? '$.order.createdAt >= "2025-08-12"';

I copy/paste the example and it provides a wrong result as it includes a user with all orders created before "2025-08-12":

postgres=# CREATE INDEX my_index ON my_table USING GIN (jsonb_col);

CREATE INDEX

postgres=# SELECT * FROM my_table 
           WHERE jsonb_col @? '$.order.createdAt >= "2025-08-12"';

 id |                                                                           jsonb_col
----+----------------------------------------------------------------------------------------------------------------------------------------------------------------
  1 | {"name": "Amit", "order": [{"amount": 1200, "status": "paid", "createdAt": "2025-08-15"}, {"amount": 300, "status": "failed", "createdAt": "2025-08-16"}]}
  2 | {"name": "Older", "order": [{"amount": 1200, "status": "paid", "createdAt": "2024-08-15"}, {"amount": 300, "status": "failed", "createdAt": "2024-08-16"}]}
(2 rows)

Did you spot the error? PostgreSQL JSON operators are tricky:

  • @? is a jsonb_path_exists operator
  • @@ is a jsonb_path_match operator

The query returned all rows because the path exists. What the author wanted was filtering on the date in this path. Here is the query corrected:

postgres=# SELECT * FROM my_table 
           WHERE jsonb_col @@ '$.order.createdAt >= "2025-08-12"';
 id |                                                                         jsonb_col
----+------------------------------------------------------------------------------------------------------------------------------------------------------------
  1 | {"name": "Amit", "order": [{"amount": 1200, "status": "paid", "createdAt": "2025-08-15"}, {"amount": 300, "status": "failed", "createdAt": "2025-08-16"}]}
(1 row)

The result is the expected one: only one user ordered after "2025-08-12". Let's check if the index was used.

2. Index unused

With such small dataset, the query runs a full table scan, but that may be a cost-based decision:

postgres=# explain (analyze, costs off, summary off)
           SELECT * FROM my_table
           WHERE jsonb_col @@ '$.order.createdAt >= "2025-08-12"'
;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Seq Scan on my_table (actual time=0.016..0.018 rows=1.00 loops=1)
   Filter: (jsonb_col @@ '($."order"."createdAt" >= "2025-08-12")'::jsonpath)
   Rows Removed by Filter: 1
   Buffers: shared hit=1
 Planning:
   Buffers: shared hit=1

I disable sequential scan and the index is used:

postgres=# set enable_seqscan to off;
SET

postgres=# explain (analyze, costs off, summary off)
           SELECT * FROM my_table
           WHERE jsonb_col @@ '$.order.createdAt >= "2025-08-12"'
;

                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Bitmap Heap Scan on my_table (actual time=0.023..0.025 rows=1.00 loops=1)
   Recheck Cond: (jsonb_col @@ '($."order"."createdAt" >= "2025-08-12")'::jsonpath)
   Rows Removed by Index Recheck: 1
   Heap Blocks: exact=1
   Buffers: shared hit=4
   ->  Bitmap Index Scan on my_index (actual time=0.009..0.009 rows=2.00 loops=1)
         Index Cond: (jsonb_col @@ '($."order"."createdAt" >= "2025-08-12")'::jsonpath)
         Index Searches: 1
         Buffers: shared hit=3
 Planning:
   Buffers: shared hit=1

However, the index was not used to find the rows, despite seeing "createdAt" >= "2025-08-12" in Index Cond. Look at the number of rows returned by the index scan: rows=2.00.
All rows were read, it is an index full scan, the index condition was not used because it cannot be used on an GIN index. The rows were filtered later when accessing the table, with Rows Removed by Index Recheck in order to return rows=1. If you had ten million rows in the table, and only a few with recent orders, it would have done the same: read all index entries, read all rows from the table, and finally reduce the result.

Reading PostgreSQL execution plans is tricky, but the numbers of explain analyze do not lie. The GIN index was not used. You can also read the documentation:

The standard distribution of PostgreSQL includes a GIN operator class for arrays, which supports indexed queries using these operators:
<@ @> = &&

Those are only equality operators:

  • @> - Contains: arrayA @> arrayB means that Array A contains all elements of Array B.
  • <@ - Is contained by: arrayA <@ arrayB means that Array A is contained in Array B (all elements of A are in B).
  • = - Equals: arrayA = arrayB means that arrays are equal (same elements, same order).
  • && - Overlap: arrayA && arrayB means that arrays have any elements in common (intersection is not empty).

GIN indexes cannot access ranges of values or return ordered values. They are used to find exact values.

3. Expression-based index: wrong result

The author adds another solution: use an expression based index rather than a GIN index:

Alternatively you could just make a more specific (and faster) index just on the creation time.
CREATE INDEX my_index ON my_table ((jsonb_col #>> '{order,createdAt}')::timestamptz);
SELECT * FROM my_table WHERE (jsonb_col #>> '{order,createdAt}')::timestamptz > now() - '2 days'::interval;

I copy/paste the index creation but it fails:

postgres=# CREATE INDEX my_index ON my_table ((jsonb_col #>> '{order,createdAt}')::timestamptz);

ERROR:  syntax error at or near "::"
LINE 1: ...x ON my_table ((jsonb_col #>> '{order,createdAt}')::timestam...

Expression-based indexes must include the expression in parentheses:

postgres=# CREATE INDEX my_index ON my_table ( ( (jsonb_col #>> '{order,createdAt}')::timestamptz) );

ERROR:  functions in index expression must be marked IMMUTABLE

Casting to a timestamp is non-deterministic, as it depends on the timezone, so I keep it as a character string in the index:

postgres=# CREATE INDEX my_index2 ON my_table ((jsonb_col #>> '{order,createdAt}'));

CREATE INDEX

The index is created, let's run the query:

postgres=# SELECT * FROM my_table 
           WHERE (jsonb_col #>> '{order,createdAt}') > '2025-08-12'
;

 id | jsonb_col
----+-----------
(0 rows)

The result is empty because jsonb_col #>> '{order,createdAt}' reads a scalar value at "order.createdAt" and none exist. It is different from the field in the array, which is, when there are two items in the array:

  • "order.0.createdAt" or jsonb_col #>> '{order,0,createdAt}'
  • "order.1.createdAt" or jsonb_col #>> '{order,1,createdAt}'

In MongoDB, if you create an index on "order.createdAt" it will transparently index all those possible paths, because MongoDB is a document database that supports flexible schema. However, in PostgreSQL, you need to use different index types for scalars and arrays.

In short, the solution proposed by the author uses the index but returns a wrong result (rows=0.00):

postgres=# explain (analyze, costs off, summary off)
           SELECT * FROM my_table
           WHERE (jsonb_col #>> '{order,createdAt}') > '2025-08-12'
;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Index Scan using my_index2 on my_table (actual time=0.012..0.012 rows=0.00 loops=1)
   Index Cond: ((jsonb_col #>> '{order,createdAt}'::text[]) > '2025-08-12'::text)
   Index Searches: 1
   Buffers: shared hit=1
(4 rows)

With expression-based index, you can get equality, sort or ranges, but they can be used only when there's no array in the path.

4. Immutable function-based index: wrong result

The author proposes an alternative solution, to force the cast to timestamp being considered immutable (which it is not) and index it:

Making a function for that lookup cleans things up nicely.
CREATE FUNCTION order_created(jsonb) RETURNS timestamptz IMMUTABLE STRICT LANGUAGE sql AS $$
SELECT ($1 #>> '{order,createdAt}')::timestamptz;
$$;
Then your index and query look like:
CREATE INDEX my_index ON my_table (order_created(jsonb_col));
SELECT * FROM my_table WHERE order_created(jsonb_col) > now() - '2 days'::interval;

Of course, this is not better as the function still reads the "order" array as if it was a scalar value. In addition, the function-based index materializes a value calculated for a specific timezone, but the function that will use it may be in a different timezone, so the comparison will be wrong.
To demonstrate that it is not immutable, despite the declaration, I create the function for the first item of the array, adding .0 to the path:

postgres=# CREATE FUNCTION order_created(jsonb) 
           RETURNS timestamptz IMMUTABLE STRICT LANGUAGE sql AS $$
            SELECT ($1 #>> '{order,0,createdAt}')::timestamptz;
           $$;

CREATE FUNCTION

For the same value, the one in the table, the function returns different values:

postgres=# SET TIMEZONE TO 'UTC';
SET
postgres=# select now() - order_created( jsonb_col ) from my_table
;
         ?column?
--------------------------
 -2 days -02:19:56.097701
 362 days 21:40:03.902299
(2 rows)

postgres=# SET TIMEZONE TO 'Pacific/Apia';
SET
postgres=# select now() - order_created( jsonb_col ) from my_table
;
         ?column?
--------------------------
 -1 days -13:19:42.152896
 363 days 10:40:17.847104
(2 rows)

If you ever create an immutable function for an expression that is not immutable, you must add enough tests to be sure it raises an error when used in a different context. Anyway, expression and function-based indexes cannot be used on documents with an array in the path.

Conclusion

PostgreSQL provides extended datatypes and indexing options, with JSONB enabling some non-relational data modeling. However, attempting MongoDB-like schema flexibility in PostgreSQL reveals limitations in query result reliability, index efficiency, and type safety.

For data models with arrays, nested fields, and complex types, especially dates, MongoDB’s BSON and indexing ensure correctness and performance. In contrast, PostgreSQL’s JSONB often requires cumbersome path expressions and workarounds, leading to potential errors from operator confusion, array/scalar mismatches, or timezone issues. JSON and JSONB datatypes help store and retrieve documents within a PostgreSQL transaction, but it remains a SQL database rather than a document database.

  • for data-centric, normalized schemas: use PostgreSQL with a relational data model—even though you can add JSON and JSONB columns for occasional flexibility.
  • For flexible, application-centric document schemas: Use MongoDB. Its design is built for rich, evolving structures, arrays, and nested data—even though you can create references and perform lookups between documents.

Be cautious of what you read on Medium or elsewhere if it lacks demos and documentation. Many people accept information as correct without verification. By fact-checking, you’ll make better decisions and learn from others' mistakes. I've a blog post series about learning from the Myths about MongoDB

How we built our own Claude Code

Or, rather, how we built Tinybird Code, a command line agent inspired by Claude Code, but optimized for complex real-time data engieering problems with ClickHouse.

August 11, 2025

How Wiz achieved near-zero downtime for Amazon Aurora PostgreSQL major version upgrades at scale using Aurora Blue/Green Deployments

Wiz, a leading cloud security company, identifies and removes risks across major cloud platforms. Our agent-less scanner processes tens of billions of daily cloud resource metadata entries. This demands high-performance, low-latency processing, making our Amazon Aurora PostgreSQL-Compatible Edition database, serving hundreds of microservices at scale, a critical component of our architecture. In this post, we share how we upgraded our Aurora PostgreSQL database from version 14 to 16 with near-zero downtime using Amazon Aurora Blue/Green Deployments.

PostgreSQL UUID: Bulk insert with UUIDv7 vs UUIDv4

PostgreSQL 18 (currently in beta) introduces a UUID version 7 generation function that features a time-ordered prefix, reducing the scattering seen in random UUID version 4. I was surprised by the enthusiasm for this. For time-incrementing identifiers, I prefer cached sequences. I think the primary advantage of UUID is the ability to generate it from the application, before hitting the database. I discussed this in a previous post: UUID in PostgreSQL.

There's one case where generating a UUID version 7 in the database is beneficial: bulk loads, as it shares the same uuid datatype as UUIDv4, and both can be present in the same column.
For OLTP, UUIDv4 minimizes hotspots in the B-Tree index during concurrent inserts, and is often generated by the application. However, for bulk loads, UUIDv7 improves cache efficiency by updating fewer blocks. When data comes from another table, and necessitates an new identifier, a built-in function is useful.

Bulk Insert throughput

For this demo, I ran a bulk ingest job, inserting 10 million rows, and compared identifiers generated by uuidv7() and uuidv4. The purpose is to illustrate the comparison.

Here is the full script:


-- reset (you are in a lab)
\! pkill -f "postgres: .* COPY"
drop table if exists demo;
drop table if exists progress;

-- create table and insert 10M rows in background
create table demo ( 
    id uuid default uuidv7() primary key,
 -- id uuid default uuidv4() primary key,
 value text
);

\! psql -c "copy demo(value) from program 'base64 -w 100 /dev/urandom | head -10000000'" &

-- monitor size and progress

create table progress as
select now() ts,*,pg_indexes_size('demo')/8192 index_pages, pg_table_size('demo') tablesize , pg_indexes_size('demo') indexsize, pg_current_wal_lsn()
 from pg_stat_progress_copy;
create index on progress(ts asc);

-- record last progress snapshot
insert into progress
select now(),*,pg_indexes_size('demo')/8192 index_pages, pg_table_size('demo') tablesize , pg_indexes_size('demo') indexsize, pg_current_wal_lsn()
 from pg_stat_progress_copy
returning *
\;
-- display rate over last snapshots
select
  pg_size_pretty((bytes_processed - lag(bytes_processed) over w) / extract(epoch from (ts - lag(ts) over w))) || '/s' as "COPY bytes/s",  
  (tuples_processed - lag(tuples_processed) over w) / extract(epoch from (ts - lag(ts) over w))::int as "rows/s",  
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn , lag(pg_current_wal_lsn) over w) / extract(epoch from (ts - lag(ts) over w))) || '/s' as "WAL bytes/s",   
round(pg_wal_lsn_diff(pg_current_wal_lsn , lag(pg_current_wal_lsn) over w) / (bytes_processed - lag(bytes_processed) over w),2) || ' %' as "WAL bytes/COPY bytes", 
 pg_size_pretty(indexsize) "index",
 pg_size_pretty(tablesize) "table",
 current_setting('shared_buffers') shared_buffers,
 ts
from progress
window w as (order by ts asc)
order by ts asc -- limit 30
-- every 10 second
\watch 10

The script first resets the lab environment to avoid leftovers from previous runs. It then creates a demo table, populating it via COPY with pseudo-random values. I run it in the background. The foreground session queries pg_stat_progress_copy every ten seconds and calculates some thoughput statistics. Each snapshot records:

  • Insert rate (rows per second)
  • COPY throughput (input bytes per second)
  • Index size and table size over time
  • WAL (Write-Ahead Log) usage

UUIDv7

Here is the result with id uuid default uuidv7() primary key:

 COPY bytes/s | rows/s | WAL bytes/s | WAL bytes/COPY bytes |  index  |  table  | shared_buffers |              ts
--------------+--------+-------------+----------------------+---------+---------+----------------+-------------------------------
              |        |             |                      | 8232 kB | 38 MB   | 128MB          | 2025-08-11 15:44:56.858214+00
 9030 kB/s    |  91601 | 22 MB/s     | 2.46 x               | 36 MB   | 169 MB  | 128MB          | 2025-08-11 15:45:06.858344+00
 8979 kB/s    |  91036 | 22 MB/s     | 2.50 x               | 63 MB   | 298 MB  | 128MB          | 2025-08-11 15:45:16.858316+00
 819 kB/s     |   8314 | 2046 kB/s   | 2.50 x               | 65 MB   | 310 MB  | 128MB          | 2025-08-11 15:45:26.858289+00
 826 kB/s     |   8318 | 2048 kB/s   | 2.48 x               | 68 MB   | 321 MB  | 128MB          | 2025-08-11 15:45:36.858318+00
 7693 kB/s    |  77984 | 19 MB/s     | 2.50 x               | 91 MB   | 432 MB  | 128MB          | 2025-08-11 15:45:46.858495+00
 7296 kB/s    |  74032 | 18 MB/s     | 2.50 x               | 114 MB  | 537 MB  | 128MB          | 2025-08-11 15:45:56.858338+00
 5184 kB/s    |  52531 | 13 MB/s     | 2.51 x               | 130 MB  | 612 MB  | 128MB          | 2025-08-11 15:46:06.858545+00
 832 kB/s     |   8453 | 2017 kB/s   | 2.42 x               | 132 MB  | 624 MB  | 128MB          | 2025-08-11 15:46:16.858854+00
 1389 kB/s    |  14074 | 3528 kB/s   | 2.54 x               | 136 MB  | 644 MB  | 128MB          | 2025-08-11 15:46:26.858837+00
 7642 kB/s    |  77440 | 19 MB/s     | 2.49 x               | 160 MB  | 754 MB  | 128MB          | 2025-08-11 15:46:36.858441+00
 7776 kB/s    |  78848 | 19 MB/s     | 2.50 x               | 183 MB  | 866 MB  | 128MB          | 2025-08-11 15:46:46.858377+00
 4826 kB/s    |  48928 | 12 MB/s     | 2.50 x               | 198 MB  | 935 MB  | 128MB          | 2025-08-11 15:46:56.858401+00
 1350 kB/s    |  13675 | 3366 kB/s   | 2.49 x               | 202 MB  | 955 MB  | 128MB          | 2025-08-11 15:47:06.858323+00
 1216 kB/s    |  12372 | 3047 kB/s   | 2.51 x               | 206 MB  | 972 MB  | 128MB          | 2025-08-11 15:47:16.858334+00
 7731 kB/s    |  78382 | 19 MB/s     | 2.50 x               | 229 MB  | 1084 MB | 128MB          | 2025-08-11 15:47:26.859178+00
 6996 kB/s    |  70923 | 17 MB/s     | 2.50 x               | 251 MB  | 1185 MB | 128MB          | 2025-08-11 15:47:36.858273+00
 4928 kB/s    |  49925 | 12 MB/s     | 2.55 x               | 266 MB  | 1255 MB | 128MB          | 2025-08-11 15:47:46.858335+00
 934 kB/s     |   9504 | 2166 kB/s   | 2.32 x               | 269 MB  | 1269 MB | 128MB          | 2025-08-11 15:47:56.858356+00
 2259 kB/s    |  22880 | 5806 kB/s   | 2.57 x               | 276 MB  | 1301 MB | 128MB          | 2025-08-11 15:48:06.858839+00
 7744 kB/s    |  78496 | 19 MB/s     | 2.45 x               | 299 MB  | 1413 MB | 128MB          | 2025-08-11 15:48:16.858306+00
(21 rows)

The UUIDv7 ingest run shows consistently high throughput, with brief dips likely due to vacuum, background I/O or checkpoints, followed by quick recovery. Index growth is smooth and compact, while WAL overhead stays stable at 2.50 times the input data. The sequential, time-sortable nature of UUIDv7 enables fast and predictable bulk load performance, completing the 10M row job in just over 3 minutes and maintaining tight disk usage.

UUIDv4

Here is the result with id uuid default uuidv4() primary key:

 COPY bytes/s | rows/s | WAL bytes/s | WAL bytes/COPY bytes |  index  |  table  | shared_buffers |              ts
--------------+--------+-------------+----------------------+---------+---------+----------------+-------------------------------
 8698 kB/s    |  88199 | 24 MB/s     | 2.82 x               | 37 MB   | 142 MB  | 128MB          | 2025-08-11 15:37:08.184794+00
 7802 kB/s    |  79130 | 20 MB/s     | 2.66 x               | 71 MB   | 254 MB  | 128MB          | 2025-08-11 15:37:18.184817+00
 5920 kB/s    |  59959 | 15 MB/s     | 2.58 x               | 90 MB   | 339 MB  | 128MB          | 2025-08-11 15:37:28.184804+00
 1248 kB/s    |  12664 | 3364 kB/s   | 2.70 x               | 96 MB   | 357 MB  | 128MB          | 2025-08-11 15:37:38.184869+00
 877 kB/s     |   8924 | 2391 kB/s   | 2.73 x               | 101 MB  | 370 MB  | 128MB          | 2025-08-11 15:37:48.184803+00
 1882 kB/s    |  19083 | 12 MB/s     | 6.60 x               | 112 MB  | 397 MB  | 128MB          | 2025-08-11 15:37:58.184795+00
 4384 kB/s    |  44420 | 12 MB/s     | 2.71 x               | 131 MB  | 460 MB  | 128MB          | 2025-08-11 15:38:08.184808+00
 3821 kB/s    |  38720 | 10006 kB/s  | 2.62 x               | 142 MB  | 515 MB  | 128MB          | 2025-08-11 15:38:18.184814+00
 2778 kB/s    |  28160 | 7144 kB/s   | 2.57 x               | 149 MB  | 555 MB  | 128MB          | 2025-08-11 15:38:28.184822+00
 1971 kB/s    |  19982 | 5152 kB/s   | 2.61 x               | 155 MB  | 583 MB  | 128MB          | 2025-08-11 15:38:38.184814+00
 1427 kB/s    |  14513 | 3750 kB/s   | 2.63 x               | 159 MB  | 604 MB  | 128MB          | 2025-08-11 15:38:48.184858+00
 1152 kB/s    |  11658 | 3069 kB/s   | 2.66 x               | 164 MB  | 621 MB  | 128MB          | 2025-08-11 15:38:58.184933+00
 1312 kB/s    |  13333 | 15 MB/s     | 11.83 x              | 169 MB  | 639 MB  | 128MB          | 2025-08-11 15:39:08.184786+00
 3373 kB/s    |  34144 | 9108 kB/s   | 2.70 x               | 184 MB  | 688 MB  | 128MB          | 2025-08-11 15:39:18.184791+00
 2221 kB/s    |  22528 | 5985 kB/s   | 2.69 x               | 196 MB  | 720 MB  | 128MB          | 2025-08-11 15:39:28.18487+00
 3885 kB/s    |  39424 | 10 MB/s     | 2.72 x               | 217 MB  | 776 MB  | 128MB          | 2025-08-11 15:39:38.184842+00
 3232 kB/s    |  32736 | 8685 kB/s   | 2.69 x               | 233 MB  | 822 MB  | 128MB          | 2025-08-11 15:39:48.18481+00
 832 kB/s     |   8448 | 2233 kB/s   | 2.68 x               | 237 MB  | 834 MB  | 128MB          | 2025-08-11 15:39:58.184806+00
 800 kB/s     |   8096 | 2136 kB/s   | 2.67 x               | 241 MB  | 846 MB  | 128MB          | 2025-08-11 15:40:08.184789+00
 1107 kB/s    |  11264 | 2962 kB/s   | 2.68 x               | 246 MB  | 862 MB  | 128MB          | 2025-08-11 15:40:18.184786+00
 768 kB/s     |   7744 | 15 MB/s     | 19.64 x              | 249 MB  | 873 MB  | 128MB          | 2025-08-11 15:40:28.184804+00
 2912 kB/s    |  29568 | 11 MB/s     | 3.99 x               | 260 MB  | 915 MB  | 128MB          | 2025-08-11 15:40:38.184825+00
 3475 kB/s    |  35200 | 9083 kB/s   | 2.61 x               | 271 MB  | 965 MB  | 128MB          | 2025-08-11 15:40:48.184843+00
 1491 kB/s    |  15136 | 3886 kB/
                                        by Franck Pachot
                                    

Announcing Neki

Sharded Postgres by the team behind Vitess

NULL BITMAP on SIMD

A NULL BITMAP guest post on loop parallelism transformations.

August 09, 2025

What even is distributed systems

Distributed systems is simply the study of interactions between processes. Every two interacting processes form a distributed system, whether they are on the same host or not. Distributed systems create new challenges (compared to single-process systems) in terms of correctness (i.e. consistency), reliability, and performance (i.e. latency and throughput).

The best way to learn about the principles and fundamentals of distributed systems is to 1) read Designing Data Intensive Applications and 2) read through the papers and follow the notes in the MIT Distributed Systems course.

For Designing Data Intensive Applications (DDIA), I strongly encourage you to find buddies at work or online who will read it through with you. You can also always join the Software Internals Discord's #distsys channel to ask questions as you go. But it's still best if you have some partners to go through the book with, even if they are as new to it as you.

I also used to think that you might want to wait a few years into your career before reading DDIA but when you have friends to read it with I think you need not wait.

If you have only skimmed the book you should definitely go back and give it a thorough read. I have read it three times already and I will read it again as part of the Software Internals Book Club next year after the 2nd Edition is published.

Keep in mind that every chapter of DDIA provides references to papers you can keep reading should you end up memorizing DDIA itself.

When you've read parts of DDIA or the MIT Distributed Systems course and you want practice, the Fly.io x Jepsen Distributed Systems Challenge is one guided option. Other options might include simply implementing (in somewhat ascending complexity):

  • two-phase commit
  • three-phase commit
  • single-decree Paxos
  • chain replication (or CRAQ), using a 3rd-party consensus library
  • Raft
  • epaxos

And if you get bored there you can see Alex Miller's Data Replication Design Spectrum for more ideas and variants.

If these projects and papers sound arcane or intimidating, know that you will see the problems these projects/papers solve whether or not you know and understand these solutions. Developers often end up reinventing hacky versions of these which are more likely to have subtle bugs, while instead you can recognize and use one of these well-known building blocks. Or at least have the background to better reason about correctness should you be in a situation where you must work with a novel distributed system or you end up designing a new one yourself.

And again, if you want folks to bounce ideas off of or ask questions to, I strongly encourage you to join the Software Internals Discord and ask there!

August 08, 2025

Joining and grouping on array fields in MongoDB may require using $unwind before applying $group or $lookup

Working with nested data in MongoDB simplifies mapping between application objects and database structures. However, challenges can arise when grouping or joining values within sub-document arrays, particularly for developers shifting from SQL databases with normalized data where the result is always flattened to tabular result.

I'll go through an example with the following collection, and link to MongoDB Playground for each explanation.

[
  {
    "projectName": "Troubleshooting PostgreSQL issues",
    "team": [
      { "memberId": "cyclops", "name": "Cyclops", "role": "Postgres Expert" },
      { "memberId": "wolverine", "name": "Wolverine", "role": "Consultant" },
      { "memberId": "storm", "name": "Storm", "role": "DBA" },
      { "memberId": "beast", "name": "Beast", "role": "Developer" },
      { "memberId": "tony", "name": "Tony", "role": "Architect" }
    ],
    "status": "active"
  },
  {
    "projectName": "Build new apps with MongoDB",
    "team": [
      { "memberId": "tony", "name": "Tony", "role": "Developer" }
    ],
    "status": "planned"
  }
]

Suppose you want a report of which projects each person was involved in, and their role(s) for each.

In PostgreSQL, you’d have normalized to three tables because of the Many-to-Many relationship. The join returns one row per project and team member, with information about projects and members duplicated. You can then aggregate them per team member to get the projects per person as an array:

SELECT    
  tm.member_id,    
  tm.name,    
  ARRAY_AGG(p.project_name ORDER BY p.project_id) AS projects    
FROM    
  team_members tm    
  JOIN project_team pt ON tm.member_id = pt.member_id    
  JOIN projects p ON pt.project_id = p.project_id    
GROUP BY    
  tm.member_id, tm.name    
ORDER BY    
  tm.member_id;    

Here's the example on db<>fiddle: https://dbfiddle.uk/FhsA9DpU

In MongoDB, relationships are embedded. I created a "team" array for project members. Let's explore how to aggregate it, starting with some common mistakes.

The Wrong Way: Grouping Directly on an Array

Suppose you try to group directly on the array field:

db.projects.aggregate([  
  { $group: { _id: "$team.memberId", projects: { $push: "$projectName" } } }  
])

What you get:

[
  {
    "_id": [
      "cyclops", "wolverine", "storm", "beast", "tony"
    ],
    "projects": ["Troubleshooting PostgreSQL issues"]
  },
  {
    "_id": [
      "tony"
    ],
    "projects": ["Build new apps with MongoDB"]
  }
]

Here's the example on MongoDB Playground: https://mongoplayground.net/p/bn7xsRMQhu2

What went wrong?

MongoDB used the whole array as the grouping key. Instead of grouping by each individual member, you get one group for each unique full-team combination.

Tip for SQL users:

In SQL, GROUP BY splits rows by scalar values, but in MongoDB, grouping on an array field groups by the entire array as a single value.

The Wrong Way: Lookup Directly on an Array

Similarly, you may try to enrich team data with $lookup on the array field:

{  
  $lookup: {  
    from: "members",  
    localField: "team.memberId",  
    foreignField: "memberId",  
    as: "memberInfo"  
  }  
}

Here's the example on MongoDB Playground:

https://mongoplayground.net/p/RVXoY5z7ke1

What you get:

  • The entire array is used as the lookup key, so memberInfo will usually just be an empty array (no matches), unless a member’s memberId somehow matches an array, which it never will.

Key for SQL folks:

Unlike SQL, where the JOIN is applied per row value, MongoDB $lookup expects scalar fields, not arrays. If your join keys are inside an array, you need to flatten that array first.

The Right Way: Flatten with $unwind, Then $group

First, flatten out your team array so each person/project combination gets its own pipeline document:

{ $unwind: "$team" }

Here's how it looks like on MongoDB Playground:

https://mongoplayground.net/p/yGvvM-FZM5p

Then group by the member ID:

Don’t forget to include the role for each project. Here, we also collect all roles for completeness.

{
  $group: {
    _id: "$team.memberId",
    memberName: { $first: "$team.name" },
    roles: { $addToSet: "$team.role" },
    projects: {
      $push: {
        name: "$projectName",
        id: "$_id",
        role: "$team.role"
      }
    }
  }
}

The full working pipeline is like this:

db.projects.aggregate([
  { $unwind: "$team" },
  {
    $group: {
      _id: "$team.memberId",
      memberName: { $first: "$team.name" },
      roles: { $addToSet: "$team.role" },
      projects: {
        $push: {
          name: "$projectName",
          id: "$_id",
          role: "$team.role"
        }
      }
    }
  }
])

Here's the result on MongoDB Playground:

https://mongoplayground.net/p/oMOfvrZXa2a

This is the expected output, with one document per member and their project or array of projects:

[
  {
    "_id": "cyclops",
    "memberName": "Cyclops",
    "roles": ["Postgres Expert"],
    "projects": [
      { "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "Postgres Expert" }
    ]
  },
  {
    "_id": "wolverine",
    "memberName": "Wolverine",
    "roles": ["Consultant"],
    "projects": [
      { "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "Consultant" }
    ]
  },
  {
    "_id": "storm",
    "memberName": "Storm",
    "roles": ["DBA"],
    "projects": [
      { "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "DBA" }
    ]
  },
  {
    "_id": "beast",
    "memberName": "Beast",
    "roles": ["Developer"],
    "projects": [
      { "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "Developer" }
    ]
  },
  {
    "_id": "tony",
    "memberName": "Tony",
    "roles": ["Architect", "Developer"],
    "projects": [
      { "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "Architect" },
      { "name": "Build new apps with MongoDB", "id": ObjectId("60f...a2"), "role": "Developer" }
    ]
  }
]

Now you get the right result: Tony appears in both projects, and each is listed with the correct role.

Practical reminder:

MongoDB’s $unwind makes array data behave more like SQL’s flattened rows, so aggregation and joins work as SQL users expect.

Here is an example with $lookup: https://mongoplayground.net/p/I3__p2EE9ps

What About Filtering Instead of Grouping?

If you just want to filter out all but the developers in each project, no unwinding is needed:

db.projects.aggregate([
  {
    $addFields: {
      team: {
        $filter: {
          input: "$team",
          as: "member",
          cond: { $eq: ["$$member.role", "Developer"] }
        }
      }
    }
  }
])

Here's the MongoDB playground:

https://mongoplayground.net/p/bhDQJVKu7Rp

[
  {
    "_id": ObjectId("60f...a1"),
    "projectName": "Troubleshooting PostgreSQL issues",
    "team": [
      { "memberId": "beast", "name": "Beast", "role": "Developer" }
    ],
    "status": "active"
  },
  {
    "_id": ObjectId("60f...a2"),
    "projectName": "Build new apps with MongoDB",
    "team": [
      { "memberId": "tony", "name": "Tony", "role": "Developer" }
    ],
    "status": "planned"
  }
]

I used the following aggregation operators:

  • $unwind: Flattens array fields, creating a separate document for each element in the array.
  • $group: Similar to SQL’s GROUP BY, especially effective after arrays have been unwound.
  • $lookup: Similar to a SQL LEFT OUTER JOIN, but by default MongoDB does not flatten joined results into multiple documents with repeated data. Instead, matched documents are returned normalized (0NF) as an array field.

SQL databases store their data normalized, often in third normal form, with multiple tables for one-to-many and many-to-many relationships to avoid duplicated data. The query result requires a join and is always in first normal form (1NF), as the result of a SQL query is a single flat table. Since 1NF cannot have arrays, SQL databases have to unnest groups, flattening relationships to multiple rows and introducing redundancy in a denormalized tabular result set.

MongoDB is not constrained by normal forms and supports rich document models, with arrays for repeating groups and nested objects directly in each document. When you use the aggregation pipeline, results can keep this nested structure. But if you want to group or join on values nested inside arrays, you’ll need to flatten the array to multiple documents using $unwind, so further aggregation stages work as expected. In practice, $lookup in MongoDB is often compared to JOINs in SQL, but if your fields live inside arrays, a join operation is really $unwind followed by $lookup.

Key takeaways for SQL users moving to MongoDB:

  • Arrays are not automatically expanded or unnested.
  • Whenever your “join key” or “group by value” is inside an array, always unwind first.
  • In MongoDB, aggregation pipeline results can be nested and contain arrays, unlike the flat results of SQL queries.
  • $unwind is the document-model equivalent of SQL’s UNNEST: it’s your bridge from nested arrays to flat, row-like documents for further aggregation.
  • When joining with $lookup, always check whether your localField or foreignField are arrays and flatten as needed.

Joining and grouping on array fields in MongoDB may require using $unwind before applying $group or $lookup

Working with nested data in MongoDB simplifies mapping between application objects and database structures. However, challenges can arise when grouping or joining values within sub-document arrays, particularly for developers shifting from SQL databases with normalized data where the result is always flattened to tabular result.

I'll go through an example with the following collection, and link to MongoDB Playground for each explanation.

[
  {
    "projectName": "Troubleshooting PostgreSQL issues",
    "team": [
      { "memberId": "cyclops", "name": "Cyclops", "role": "Postgres Expert" },
      { "memberId": "wolverine", "name": "Wolverine", "role": "Consultant" },
      { "memberId": "storm", "name": "Storm", "role": "DBA" },
      { "memberId": "beast", "name": "Beast", "role": "Developer" },
      { "memberId": "tony", "name": "Tony", "role": "Architect" }
    ],
    "status": "active"
  },
  {
    "projectName": "Build new apps with MongoDB",
    "team": [
      { "memberId": "tony", "name": "Tony", "role": "Developer" }
    ],
    "status": "planned"
  }
]

Suppose you want a report of which projects each person was involved in, and their role(s) for each.

In PostgreSQL, you’d have normalized to three tables because of the Many-to-Many relationship. The join returns one row per project and team member, with information about projects and members duplicated. You can then aggregate them per team member to get the projects per person as an array:

SELECT    
  tm.member_id,    
  tm.name,    
  ARRAY_AGG(p.project_name ORDER BY p.project_id) AS projects    
FROM    
  team_members tm    
  JOIN project_team pt ON tm.member_id = pt.member_id    
  JOIN projects p ON pt.project_id = p.project_id    
GROUP BY    
  tm.member_id, tm.name    
ORDER BY    
  tm.member_id;    

Here's the example on db<>fiddle: https://dbfiddle.uk/FhsA9DpU

In MongoDB, relationships are embedded. I created a "team" array for project members. Let's explore how to aggregate it, starting with some common mistakes.

The Wrong Way: Grouping Directly on an Array

Suppose you try to group directly on the array field:

db.projects.aggregate([  
  { $group: { _id: "$team.memberId", projects: { $push: "$projectName" } } }  
])

What you get:

[
  {
    "_id": [
      "cyclops", "wolverine", "storm", "beast", "tony"
    ],
    "projects": ["Troubleshooting PostgreSQL issues"]
  },
  {
    "_id": [
      "tony"
    ],
    "projects": ["Build new apps with MongoDB"]
  }
]

Here's the example on MongoDB Playground: https://mongoplayground.net/p/bn7xsRMQhu2

What went wrong?

MongoDB used the whole array as the grouping key. Instead of grouping by each individual member, you get one group for each unique full-team combination.

Tip for SQL users:

In SQL, GROUP BY splits rows by scalar values, but in MongoDB, grouping on an array field groups by the entire array as a single value.

The Wrong Way: Lookup Directly on an Array

Similarly, you may try to enrich team data with $lookup on the array field:

{  
  $lookup: {  
    from: "members",  
    localField: "team.memberId",  
    foreignField: "memberId",  
    as: "memberInfo"  
  }  
}

Here's the example on MongoDB Playground:

https://mongoplayground.net/p/RVXoY5z7ke1

What you get:

  • The entire array is used as the lookup key, so memberInfo will usually just be an empty array (no matches), unless a member’s memberId somehow matches an array, which it never will.

Key for SQL folks:

Unlike SQL, where the JOIN is applied per row value, MongoDB $lookup expects scalar fields, not arrays. If your join keys are inside an array, you need to flatten that array first.

The Right Way: Flatten with $unwind, Then $group

First, flatten out your team array so each person/project combination gets its own pipeline document:

{ $unwind: "$team" }

Here's how it looks like on MongoDB Playground:

https://mongoplayground.net/p/yGvvM-FZM5p

Then group by the member ID:

Don’t forget to include the role for each project. Here, we also collect all roles for completeness.

{
  $group: {
    _id: "$team.memberId",
    memberName: { $first: "$team.name" },
    roles: { $addToSet: "$team.role" },
    projects: {
      $push: {
        name: "$projectName",
        id: "$_id",
        role: "$team.role"
      }
    }
  }
}

The full working pipeline is like this:

db.projects.aggregate([
  { $unwind: "$team" },
  {
    $group: {
      _id: "$team.memberId",
      memberName: { $first: "$team.name" },
      roles: { $addToSet: "$team.role" },
      projects: {
        $push: {
          name: "$projectName",
          id: "$_id",
          role: "$team.role"
        }
      }
    }
  }
])

Here's the result on MongoDB Playground:

https://mongoplayground.net/p/oMOfvrZXa2a

This is the expected output, with one document per member and their project or array of projects:

[
  {
    "_id": "cyclops",
    "memberName": "Cyclops",
    "roles": ["Postgres Expert"],
    "projects": [
      { "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "Postgres Expert" }
    ]
  },
  {
    "_id": "wolverine",
    "memberName": "Wolverine",
    "roles": ["Consultant"],
    "projects": [
      { "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "Consultant" }
    ]
  },
  {
    "_id": "storm",
    "memberName": "Storm",
    "roles": ["DBA"],
    "projects": [
      { "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "DBA" }
    ]
  },
  {
    "_id": "beast",
    "memberName": "Beast",
    "roles": ["Developer"],
    "projects": [
      { "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "Developer" }
    ]
  },
  {
    "_id": "tony",
    "memberName": "Tony",
    "roles": ["Architect", "Developer"],
    "projects": [
      { "name": "Troubleshooting PostgreSQL issues", "id": ObjectId("60f...a1"), "role": "Architect" },
      { "name": "Build new apps with MongoDB", "id": ObjectId("60f...a2"), "role": "Developer" }
    ]
  }
]

Now you get the right result: Tony appears in both projects, and each is listed with the correct role.

Practical reminder:

MongoDB’s $unwind makes array data behave more like SQL’s flattened rows, so aggregation and joins work as SQL users expect.

Here is an example with $lookup: https://mongoplayground.net/p/I3__p2EE9ps

What About Filtering Instead of Grouping?

If you just want to filter out all but the developers in each project, no unwinding is needed:

db.projects.aggregate([
  {
    $addFields: {
      team: {
        $filter: {
          input: "$team",
          as: "member",
          cond: { $eq: ["$$member.role", "Developer"] }
        }
      }
    }
  }
])

Here's the MongoDB playground:

https://mongoplayground.net/p/bhDQJVKu7Rp

[
  {
    "_id": ObjectId("60f...a1"),
    "projectName": "Troubleshooting PostgreSQL issues",
    "team": [
      { "memberId": "beast", "name": "Beast", "role": "Developer" }
    ],
    "status": "active"
  },
  {
    "_id": ObjectId("60f...a2"),
    "projectName": "Build new apps with MongoDB",
    "team": [
      { "memberId": "tony", "name": "Tony", "role": "Developer" }
    ],
    "status": "planned"
  }
]

I used the following aggregation operators:

  • $unwind: Flattens array fields, creating a separate document for each element in the array.
  • $group: Similar to SQL’s GROUP BY, especially effective after arrays have been unwound.
  • $lookup: Similar to a SQL LEFT OUTER JOIN, but by default MongoDB does not flatten joined results into multiple documents with repeated data. Instead, matched documents are returned normalized (0NF) as an array field.

SQL databases store their data normalized, often in third normal form, with multiple tables for one-to-many and many-to-many relationships to avoid duplicated data. The query result requires a join and is always in first normal form (1NF), as the result of a SQL query is a single flat table. Since 1NF cannot have arrays, SQL databases have to unnest groups, flattening relationships to multiple rows and introducing redundancy in a denormalized tabular result set.

MongoDB is not constrained by normal forms and supports rich document models, with arrays for repeating groups and nested objects directly in each document. When you use the aggregation pipeline, results can keep this nested structure. But if you want to group or join on values nested inside arrays, you’ll need to flatten the array to multiple documents using $unwind, so further aggregation stages work as expected. In practice, $lookup in MongoDB is often compared to JOINs in SQL, but if your fields live inside arrays, a join operation is really $unwind followed by $lookup.

Key takeaways for SQL users moving to MongoDB:

  • Arrays are not automatically expanded or unnested.
  • Whenever your “join key” or “group by value” is inside an array, always unwind first.
  • In MongoDB, aggregation pipeline results can be nested and contain arrays, unlike the flat results of SQL queries.
  • $unwind is the document-model equivalent of SQL’s UNNEST: it’s your bridge from nested arrays to flat, row-like documents for further aggregation.
  • When joining with $lookup, always check whether your localField or foreignField is an array and flatten as needed.

Neurosymbolic AI: The 3rd Wave

The paper (arXiv 2020, also AI review 2023) opens up with discussing recent high-profile AI debates: the Montréal AI Debate and the AAAI 2020 fireside chat with Kahneman, Hinton, LeCun, and Bengio. A consensus seems to be emerging: for AI to be robust and trustworthy, it must combine learning with reasoning. Kahneman's "System 1 vs. System 2" dual framing of cognition maps well to deep learning and symbolic reasoning. And AI needs both.

Neurosymbolic AI promises to  combine data-driven learning with structured reasoning, and provide modularity, interpretability, and measurable explanations. The paper moves from philosophical context to representation, then to system design and technical challenges in neurosymbolic AI. 


Neurons and Symbols: Context and Current Debate

This section lays out the historic divide within symbolic AI and neural AI. Symbolic approach supports logic, reasoning, and explanation. Neural approach excels at perception and learning from data. Symbolic systems are good at thinking, but not learning. Deep learning is good at learning, but not thinking. Despite the great progress recently, deep learning still lacks transparency and remains energy-hungry. Critics like Gary Marcus argue that symbolic manipulation is needed for generalization and commonsense.

The authors here appeal to Valiant's call for a "semantics of knowledge" and say that neural-symbolic computing aims to answer this call. Symbolic logic can be embedded in neural systems, and neural representations can be interpreted in symbolic terms. Logic Tensor Networks (LTNs) are presented as a concrete solution. They embed first order logic formulas into tensors, and sneak logic into the loss function to help learn not just from data, but from rules. For this, logical formulas are relaxed into differentiable constraints. These are then used during training, guiding the model to satisfy logical relationships while learning from data. I was surprised to see some concrete work and software for LTNs on github. There is also a paper explaining the principles.


Distributed and Localist Representation

This section reframes the debate around representation. Neural networks use distributed representations: knowledge is encoded in continuous vectors, over which the concepts are smeared. This works well for learning and optimization. Symbolic systems use localist representations: discrete identifiers for concepts. These are better for reasoning and abstraction.

The challenge is to bridge the two. LTNs do this by grounding symbolic logic into tensor-based representations. Logic formulas are mapped to constraints over continuous embeddings. This enables symbolic querying over learned neural structures, while preserving the strengths of gradient-based learning. LTNs also allow symbolic structure to emerge during learning.

There is an interesting contrast here with the Neurosymbolic AI paper we reviewed yesterday. That paper favored Option2 approaches, which begins with the neural representation and lifts it into symbolic form. In other words, it advocates extracting structured symbolic patterns, explanations, or logical chains of reasoning from the output of neural systems. This paper, through advocating for LTNs, seems to favor Option1: embedding symbolic structures into neural vector spaces.


Neurosymbolic Computing Systems: Technical Aspects

Symbolic models use rules: decision trees, logic programs, structured knowledge. They are interpretable but brittle to change and new information. Deep nets, on the other hand, learn vector patterns using gradient descent. They are great with fuzz, but awful with generalizing rules. They speak linear algebra, not logic.

The first approach to combine them is to bake logic into the network's structure. The second approach is to encode logic in the loss function, but otherwise keep it separate from the network's architecture. The authors seem to lean toward the second approach for its flexibility, modularity, and scalability.

LTNs also seem to fall into the second approach. LTNs represent logical formulas as differentiable constraints, which are added to the loss function during training. The network learns to satisfy logic, but the logic is not hardwired into its structure. So the logic guides learning, but it is not embedded in the weights.


Challenges for the Principled Combination of Reasoning and Learning

Combining reasoning and learning introduces new challenges. One is how to handle quantifiers. Symbolic systems handle universal quantifiers (\forall) well. Neural networks are better at spotting existential patterns (\exists). This asymmetry makes hybrid systems attractive: let each side do what it does best.

Restricted Boltzmann Machines (RBMs) are discussed as early examples of hybrid models. They learn probability distributions over visible and hidden variables. With modular design, rules can be extracted from trained RBMs. But as models grow deeper, they lose modularity and interpretability. Autoencoders, GANs, and model-based reinforcement learning may offer ways to address this.