a curated list of database news from authoritative sources

September 08, 2025

Our Myrtle Beach vacation

This year was a bastard. Not from work. God, no, I find work relaxing. Reading papers, inventing algorithms, ripping apart distributed systems with TLA+ models -- that's how I have fun. I can do that all day with a grin on my face. But the minute I need to do adulting (like simply calling the cable company and ask why keep increasing our bill when I'm not looking), I will stress and procrastinate for weeks. And this year, I had a lot of adulting to do to put our house on market, and plan a move to California, all the while juggling to help three kids with school and EC activities. I was pretty stressed most of the time, and I've been grinding my teeth at night like a mule chewing rocks.

Anywho, we botched our great escape to California. House didn't sell quickly, as we hoped it would, and we are stuck in Buffalo for another year. Summer disappeared in a cloud of errands and disappointment, and suddenly it was late August with our kids twitching with pre-school nerves. There was still some time left to salvage the wreck. We needed a beach, any beach. Myrtle Beach! We heard good things about it. One of my friends called it the Redneck Riviera. Good enough for me, and far enough from Buffalo, so we decided to give it a try. 

Planning is not my strong suit. My wife took the reins. She scoured hotel reviews like a CIA interrogator, picking through tales of bedbugs, mold, broken elevators. She has a radar for doom. Without her, I'd have booked us straight into some cockroach casino on the boardwalk. But she nailed it. Ten days before departure, she locked down an Airbnb room inside a proper resort hotel, facing the ocean. We chose Airbnb for better customer service and because the photos showed the exact floor and view we would get. There was no guessing which floor or room we would get if we went with the resort directly.

The best thing about the vacation is anticipation and the wait. We counted down the days,  giddy with excitement.

And then, the drive. Always the drive. That is how the Demirbas family rolls: No planes if at all possible. Planes are a scam. For five people, it's bankruptcy on wings. You waste a whole day shuffling through TSA lines, just to pray nervously that our planes don't cancel on you, and if you are lucky to sit in a recycled air canister for hours. We once drove from Buffalo to Seattle, and back. And another time to Florida and back. For us seven hours on asphalt is a warm-up lap from Buffalo to Boston. Myrtle Beach was thirteen. Still doable. Just load the audiobooks, clamp the Bose headphones on my head, and hit the highway. Driving is my meditation: the road pours itself under my car, like some childhood arcade game where the scenery and other cars on the road scrolls through you for hours as I nudge the steering wheel left and right to accommodate.

We left Buffalo at 8:30 in our tightly packed Highlander. By noon the youngest announced that she hadn't hit the bathroom that morning and we stopped at a McDonald's little south of Pittsburgh. We mostly pass as a normal American family, but at this stop we stood out like a sore thumb. We received a lot of cold blond stares. I later understood why, when we drove another 30 minutes, the barns started shouting TRUMP in dripping paint, and we entered West Virginia. God's Country, they call it. Heaven on earth. But it was just some green hills, and the scenery didn't impress me much. 

Our next stop was at a rest area in North Carolina, which turned out to become the cleanest, most immaculate rest area I'd ever seen. Somebody in Raleigh must be laundering money through landscaping contracts, but damn if it didn't impress us. Even the butterflies were impressive!

Then Myrtle Beach: 85 degrees weather, ocean air, great view from our flat, and a nice waterpark at the resort. Southern hospitality is real, everyone was smiling. Compared to the winter-scarred faces in Buffalo, it felt like stepping onto another planet. The Carolinas had already shoved their kids back into classrooms, so we owned the pools and ruled the lazy river. The kids tore through the slides. I soaked in the jacuzzi like a tired warrior. At night we binge-watched Quantum Leap. We would have also watched during day, but my wife dragged us to beach walks, waterpark raids. Sometimes we need the push.

By the third day, the spell had taken hold. I started to relax.  Sleep came easy, deeper than home. The wave sounds and the sea view worked its magic. Staying right on the beach was great. No hauling gear, no logistics. Step out the door, fall into the ocean, and crawl back to the flat when you're cooked. The flat was clean, spacious, and blessed with a kitchen so we could gorge on comfort food without shame.

We were wondering if we made a mistake by getting the resort 4-5 miles north of the Boardwalk. When we visited the boardwalk on the third day, we realized that it was overrated anyways. It was full of tourist-trap shops, neon lights, and featured a SkyWheel, which we didn't bother to try. We didn't need the Boardwalk. Myrtle Beach itself is the show: the waves, the horizon, and the beach.

Of course, I had to ruin myself. The kids used sunscreen like sensible citizens, and I, an idiot heir to Turkish tanning lore, slathered on olive oil (which I swiped from our kitchen). If it fries an egg, it'll bronze a body, right? Well... I roasted into a lobster, alright... But I ended up slowly shedding skin like a reptile for days afterwards.

The drive back was clean. Salem's Grill in Pittsburgh was our mandatory detour. You go to great food, if great food doesn't get to you. We hit it before 7pm and dined like kings until closing at 8pm. We were back home before midnight. Eventless driving, the way I like it.

But vacations are lies, sweet lies. Within days the teeth grinding returned. The adult machinery reloaded with forms to sign, kids to shuttle, bills to pay. Adulting feels like having to deal with a constant deluge of junk mail and random chores from the universe.

And, then the saddest part... we will be shipping Ahmet to college. He leaves for Caltech soon (must be his mother's genes). I am proud, of course, but I will miss him a lot. I bought the plane ticket yesterday after weeks of pretending I didn't have to. Kids grow fast. Too fast... It isn't fair. 

Building a DOOM-like multiplayer shooter in pure SQL

DOOMQL: A DOOM-like multiplayer shooter in pure SQL

I recently stumbled across Patrick’s excellent DOOM clone running in a browser powered by DuckDB-WASM. Ever since I’ve read that, I wanted to push his awesome idea to the logical extreme: Build a multiplayer DOOM-like shooter entirely in SQL with CedarDB doing all the heavy lifting. During a month of parental leave (i.e., a lot of sleepless nights), I tried exactly that.

Here’s a sneak peek at DOOMQL:

September 06, 2025

September 05, 2025

Automating vector embedding generation in Amazon Aurora PostgreSQL with Amazon Bedrock

In this post, we explore several approaches for automating the generation of vector embedding in Amazon Aurora PostgreSQL-Compatible Edition when data is inserted or modified in the database. Each approach offers different trade-offs in terms of complexity, latency, reliability, and scalability, allowing you to choose the best fit for your specific application needs.

Group database tables under AWS Database Migration Service tasks for PostgreSQL source engine

AWS DMS accommodates a broad range of source and target data repositories, such as relational databases, data warehouses, and NoSQL databases. Proper preparation and design are vital for a successful migration process, especially when it comes to optimizing performance and addressing potential delay issues. In this blog post, we offer guidance about recognizing potential root causes of complete load and CDC delays early in the process and provide suggestions for optimally clustering tables to achieve the best performance for an AWS DMS task.

September 03, 2025

Recent Reads (September 25)

Small Gods (1992)

I absolutely loved Small Gods. Pratchett takes on religion, faith, and power. The story follows Om, a god trapped in the body of a tortoise, who has only one true believer left: a novice named Brutha. The central premise is that gods and mythical beings exist because people believe in them, and their power fades as belief fades.

    There’s no point in believing in things that exist.

The book is funny, clever, and surprisingly philosophical. Pratchett skewers organized religion, but he also asks bigger questions: What is faith? What is belief? How do institutions shape people, and how do people shape institutions? It's satire, but not heavy-handed. Like Vonnegut, he writes with a wink, yet there's real depth under the jokes.

    Gods don't like people not doing much work. People who aren't busy all the time might start to think.

    Why not? If enough people believe, you can be god of anything….

    The figures looked more or less human. And they were engaged in religion. You could tell by the knives         (it's not murder if you do it for a god).

    The trouble with being a god is that you've got no one to pray to.

I came across Small Gods after reading Gaiman's American Gods (2001), which credits it as inspiration. Both explore gods and belief, but Pratchett's is lighter, sharper, and full of characters you actually care about. The audiobook is narrated by Andy Serkis, and he's brilliant. My precious!

Some more quotes from the book:

    But is all this true?" said Brutha. / Didactylos shrugged. "Could be. Could be. We are here and it is now. The way I see it is, after that, everything tends towards guesswork." / "You mean you don't KNOW it's true?" said Brutha. / "I THINK it might be," said Didactylos. "I could be wrong. Not being certain is what being a philosopher is all about.

    What have I always believed? That on the whole, and by and large, if a man lived properly, not according to what any priests said, but according to what seemed decent and honest inside, then it would, at the end, more or less, turn out all right.

    Sometimes the crime follows the punishment, which only serves to prove the foresight of the Great God." / "That's what my grandmother used to say," said Brutha automatically. / "Indeed? I would like to know more about this formidable lady." / "She used to give me a thrashing every morning because I would certainly do something to deserve it during the day," said Brutha. / "A most complete understanding of the nature of mankind,.

    Probably the last man who knew how it worked had been tortured to death years before. Or as soon as it was installed. Killing the creator was a traditional method of patent protection.

    Last night there seemed to be a chance. Anything was possible last night. That was the trouble with last nights. They were always followed by this mornings.

    The Turtle Moves!


I Shall Wear Midnight (2010)

I had read The Shepherd's Crown earlier, and it made me a witch lover. I loved this book too. Pratchett's witches are sharp, strong, and unforgettable. Tiffany Aching, the protagonist, is brave, clever, and endlessly practical: “I make it my business. I'm a witch. It's what we do. When it's nobody else's business, it's my business.”

Pratchett's prose is excellent and witty. He makes you laugh and think at the same time. Lines like “You've taken the first step.” / “There's a second step?” / “No; there's another first step. Every step is a first step if it's a step in the right direction” stayed with me. There’s so much care in how he builds characters, their choices, and their world. The book is about people, not the fantasy world, and that is Pratchett's genius. The witches, the townsfolk, even the Nac Mac Feegles ... they all feel alive.

I also loved how the story quietly mirrors what researchers like us do: “We look to the edges. There’s a lot of edges, more than people know. Between life and death, this world and the next, night and day, right and wrong … an’ they need watchin’. We watch ’em, we guard the sum of things.” That is the witches' charter, but also a motto for formal methods researchers, or anyone keeping an eye on the boundaries of a complex system.

Another great line: “Well, as a lawyer I can tell you that something that looks very simple indeed can be incredibly complicated, especially if I'm being paid by the hour. The sun is simple. A sword is simple. A storm is simple. Behind everything simple is a huge tail of complicated.”


Quantum Leap (TV series 2022)

As a kid growing up in Turkey, I watched the original Quantum Leap, and it felt magical. We were filled with wildly incorrect optimism about science and technology, but it felt inspiring. Some lies are useful. The reboot was fun to watch with my kids too. Raymond Lee as Ben Song is a strong lead: he's the physicist who ends up stuck leaping into the bodies of other people. Caitlin Bassett guides him as his fiance and observer.

The show is well-paced and fun for family viewing. Many episodes lean into socially conscious themes, which I appreciated, but at times it becomes unintentionally offensive: portraying “dumb Americans” with exaggerated Southern accents or mocking Salem settlers in ways that feel worse than silly. The writers clearly aimed for humor or commentary, but the execution backfired. I won't spoil the full story, but the second season tries a pivot/twist, only to throw it out again. What were they thinking? Moments like this make me want to be a TV writer.

September 02, 2025

Postgres 18 beta3, large server, sysbench

This has performance results for Postgres 18 beta3, beta2, beta1, 17.5 and 17.4 using the sysbench benchmark and a large server. The working set is cached and the benchmark is run with high concurrency (40 connections). The goal is to search for CPU and mutex regressions. This work was done by Small Datum LLC and not sponsored

tl;dr

  • There might be small regressions (~2%) for several range queries that don't do aggregation. This is similar to what I reported for 18 beta3 on a small server, but here it only occurs for 3 of the 4 microbenchmarks and on the small server it occurs on all 4. I am still uncertain about whether this really is a regression.
Builds, configuration and hardware

I compiled Postgres versions 17.4, 17.5, 18 beta1, 18 beta2 and 18 beta3 from source.

The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and 
ext4. More details on it are here.

The config file for Postgres 17.4 and 17.5 is x10a_c32r128.

The config files for Postgres 18 are:
  • x10b_c32r128 is functionally the same as x10a_c32r128 but adds io_method=sync
  • x10d_c32r128 starts with x10a_c2r128 and adds io_method=io_uring

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

The tests are run using 8 tables with 10M rows per table. The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for Postgres 17.5)
When the relative QPS is > 1 then some version is faster than PG 17.5.  When it is < 1 then there might be a regression. Values from iostat and vmstat divided by QPS are also provided here. These can help to explain why something is faster or slower because it shows how much HW is used per request.

Relative to: pg174_o2nofp.x10a_c32r128
col-1 : pg175_o2nofp.x10a_c32r128
col-2 : pg18beta1_o2nofp.x10b_c32r128
col-3 : pg18beta1_o2nofp.x10d_c32r128
col-4 : pg18beta2_o2nofp.x10d_c32r128
col-5 : pg18beta3_o2nofp.x10d_c32r128

col-1   col-2   col-3   col-4   col-5
0.98    0.99    0.99    1.00    0.99    hot-points_range=100
1.01    1.01    1.00    1.01    1.01    point-query_range=100
1.00    1.00    0.99    1.00    1.00    points-covered-pk
1.00    1.01    1.00    1.02    1.00    points-covered-si
1.00    1.01    1.00    1.00    1.00    points-notcovered-pk
1.00    1.00    1.01    1.02    1.00    points-notcovered-si
1.00    1.00    1.00    1.00    1.00    random-points_range=1000
1.00    1.01    1.00    1.00    1.00    random-points_range=100
1.00    1.00    1.00    1.00    1.00    random-points_range=10
1.00    0.97    0.96    0.98    0.97    range-covered-pk
1.00    0.97    0.97    0.98    0.97    range-covered-si
0.99    0.99    0.99    0.99    0.98    range-notcovered-pk
1.00    1.01    1.01    1.00    1.01    range-notcovered-si
1.00    1.02    1.03    1.03    1.02    read-only-count
1.00    1.00    1.00    1.01    1.01    read-only-distinct
1.00    1.00    1.00    1.00    1.00    read-only-order
1.01    1.01    1.02    1.02    1.01    read-only_range=10000
1.00    0.99    0.99    0.99    1.00    read-only_range=100
1.01    0.99    0.99    1.00    0.99    read-only_range=10
1.00    1.01    1.01    1.01    1.01    read-only-simple
1.00    1.02    1.03    1.03    1.02    read-only-sum
1.00    1.13    1.14    1.02    0.91    scan_range=100
1.00    1.13    1.13    1.02    0.90    scan.warm_range=100
1.00    0.99    0.99    0.99    0.99    delete_range=100
0.99    1.00    1.02    0.99    1.00    insert_range=100
1.01    1.00    1.00    1.00    0.99    read-write_range=100
1.00    0.98    1.00    1.01    0.99    read-write_range=10
0.99    0.99    1.02    0.98    0.96    update-index
1.00    1.01    1.00    1.00    1.01    update-inlist
0.98    0.98    0.99    0.98    0.97    update-nonindex
0.95    0.95    0.94    0.93    0.95    update-one_range=100
0.97    0.98    0.98    0.97    0.95    update-zipf_range=100
0.98    0.99    0.99    0.98    0.98    write-only_range=10000

Asymmetric Linearizable Local Reads

People want data fast. They also want it consistent. Those two wants pull in opposite directions. This VLDB'25 paper does another take on this conundrum. Rather than assuming a symmetric network environment where all replicas face similar latencies, the paper emphasizes that in practice, some replicas are closer to the leader, where others are stranded halfway across the globe. By embracing this asymmetry, the authors propose two new algorithms: Pairwise-Leader (PL) and Pairwise-All (PA). Both cut read latency compared to the prior approaches. PL could even achieve 50x latency improvements in some cases.

Aleksey and I did our usual thing. We recorded our first blind read of the paper. You can watch it here (link to come soon), if you like seeing two people puzzle through a paper in real time. I also annotated a copy while reading which you can access here.

We liked the ideas, even though the protocols themselves didn't thrill us particularly. I particularly liked finding another good example of the use of synchronized time in distributed database systems. Another study to add to my survey.


Background

At the heart of the problem is linearizability (see my explanation for a primer), a strong consistency condition that ensures operations on a distributed object appear to occur atomically in a single total order consistent with real time. If one operation finishes before another begins, then all replicas must reflect this order. This strong model spares developers from reasoning about concurrency anomalies: if you read after a write, you are guaranteed to observe it. That means there are no stale reads.

While linearizability makes life simple for the developers, it makes it harder for the system, which has to make sure every replica behaves like one machine. This is typically enforced by state machine replication (SMR) protocols such as Paxos or Raft, which order all operations through a leader. This works fine, except for reads. Reads dominate workloads in practice, and forcing every read to consult the leader or coordinate with a quorum introduces unnecessary WAN round trips. To improve efficiency, many systems have proposed linearizable local read algorithms, which allow followers to serve reads locally under certain conditions. However, this is tricky ground. Local reads introduce the risk of staleness, because that replica/follower has not yet applied the latest writes.

And here's the rub: WANs exacerbate the problem. Some replicas are close. Others are hopelessly far. The close ones see fresh data. The far ones lag. Prior work noticed this and tried various tricks to smooth out the unfairness. This paper doesn’t smooth it out. It embraces it as we will see in the coming sections.

The key problem addressed by the paper is: How can linearizable local reads be achieved in a system where replicas are asymmetric in their ability to keep up with the updates?


The Blocking Problem

Over the years, many protocols have been proposed for enabling linearizable local reads. The paper reviews these under three broad categories.

Invalidation-based algorithms (e.g., Megastore, PQL, Hermes): Replicas mark themselves invalid when they receive prepares and regain validity only after commits. But this can lead to unbounded blocking: If the leader issues prepares faster than it gathers acknowledgments in a write-heavy workload, the replicas stay perpetually invalid. 

Eager Stamping algorithms (e.g., CHT): Reads are stamped with the latest prepare index, then block until the corresponding commits arrive. This avoids perpetual invalidation thanks to instance-based tracking of prepares, but it still results in blocking proportional to twice the leader's eccentricity.

Delayed Stamping algorithms (e.g., CockroachDB Global Tables): These use synchronized clocks to assign visibility windows. Here, blocking time depends on the clock skew bound Δ, which theoretically in the worst case would be bound to the relative network diameter. This theoretical worst case bound does not apply if GPS clock synchronization (e.g., Google's Truetime or AWS Timesync) is available, and so the paper, in order to make its case, assumes GPS based synchronization is not available (which is actually pretty available). https://muratbuffalo.blogspot.com/2024/12/utilizing-highly-synchronized-clocks-in.html

Ok, I owe you an explation of delayed stamping approach. But first let me set this up using the paper's unifying stop/go events framework. I like this framing: when you name something, you own it. In this model, each index i on each replica has two events:

  • A stop event at which the replica stops assigning reads to indices less than i.
  • A go event at which the replica can safely serve reads at i.

This abstraction ultimately guarantees linearizability by ensuring that, across all replicas, all go events for i occur at or after all stop events for i. The framework is clean and clarifies why blocking occurs.

Now let's walk through Fig. 2 and explain the two approaches using the stop/go framework.

Eager Stamping (Fig. 2a). A follower stamps a read with the highest index i it has seen a prepare for (the stop event). It can only apply the read after it has received all commits up to i (the go event). The leader is special here: since it is always up to date, its stop and go events collapse into one.

Delayed Stamping (Fig. 2b). This approach uses synchronized clocks to decouple stop and go events from message arrival. When the leader accepts an update, it assigns it a future visibility time t+α, where the visibility delay α is derived from estimated commit time of the update. Followers stop stamping reads with indices less than i once that visibility time has passed on their clocks. They then apply the read after an additional Δ (the clock skew/uncertainty) has elapsed. Unlike Eager Stamping, the leader does not have special stop and go events; it also follows this visibility rule. I had talked about this earlier when explaining CockroachDB's global transactions and the Aurora Limitless and DSQL future timestamping.

Table 2 summarizes the worst-case blockage time at followers for the three category of algorithms mentioned above and the two new algorithms introduced in this work (which we explain next). 

I would be amiss (like the paper), if I do not emphasize a common flaw shared across all these algorithms: the leader in these algorithms requires acknowledgments from all nodes (rather than just a quorum) before it can commit a write! If you want a local linearizable read from a single node, the write protocol is forced into a write-all model to ensure that the one-node read quorum intersects with the write quorum. This design hurts both availability and tail-latency for the algorithms in Table 2. In contrast, in our Paxos Quorum Reads (PQR 2019) work we avoided the write-all model: PQR used only LSN tracking and quorum acknowledgments, and no clocks are needed. I discuss PQR at the end of this post.


Pairwise-Leader (PL)

The central idea of PL is to tailor blocking time to each replica's distance from the leader. Nearby replicas get near-zero latency, while distant ones may fare worse than before. Figure 3 provides the stepping stone for explaining the PL algorithm in Figure 4.

The central idea of PL is that blocking time should depend on how far a replica is from the leader. Replicas close to the leader see near-zero latency, while distant replicas may wait longer. To get there, the paper first introduces a stepping-stone algorithm (as shown in Figure 3). The trick is to deliberately time/delay prepare messages so that acknowledgments from all replicas reach the leader at the same time and hence blockage time is reduced for followers closer to the leader. Specifically, this ensures that for any replica, the gap between its prepare and commit messages is just the round-trip distance to the leader. That alone already improves over older algorithms that tied blocking to the full network diameter.

PL then builds on this stepping-stone by further decoupling stop and go events, borrowing the spirit of Delayed Stamping but applying it pairwise. Instead of relying on synchronized clocks, PL introduces a new event scheduling primitive that ensures a replica's stop event happens just before a visibility time, and its go event just after. Figure 4 illustrates this: each replica's worst-case blocking time becomes exactly 2 relative message delay between the leader and itself (see Table 1 for notation). In other words, nearby replicas get fast, almost instant reads, as the cost for distant ones reflects only their distance from the leader.

PL introduces a new pairwise event scheduling/synchronization primitive: Instead of requiring global clock synchronization, the leader coordinates stop and go events directly with each follower. This scheduling/synchronization ensures stop/go events happen at predictable real-time offsets relative to the leader's visibility time, while exploiting known lower bounds on delays to followers to maintain correctness. Yes, unfortunately the drawback is that the delays to followers need to be reliable/predictable for the correctness to work. I discuss this problem at the end of the post.


Pairwise-All (PA)

PL optimizes aggressively for leader-adjacent replicas, but it penalizes distant ones. PA extends the pairwise trick to all replicas using all-to-all communication as shown in Figure 6.

PA's stepping-stone in Figure 6 works like PL's but shifts the synchronization target: instead of aligning acknowledgments at the leader, it delays prepare messages so they all arrive every replica at the same time. Each process also sends acknowledgments to all others, not just the leader. The effect is that every replica can commit once its own eccentricity time has passed since receiving a prepare. As a result, the worst-case read blocking time for each replica is its relative eccentricity.

To further reduce blocking, PA applies the same decoupling (delayed stamping) idea as PL but with all replicas aligned at the visibility time (as shown in Figure 7). The leader schedules stop events near this visibility point, and each process waits for stopped events from others before issuing its own go. Since a go event is the maximum of all stopped events, correctness holds regardless of scheduling accuracy. This ensures that every replica's worst-case blocking time is bounded by its eccentricity rather than its leader-distance. In practice, that means nearby replicas don't get PL's extreme gains, but distant ones aren't punished.

That's it, that's the story. Asymmetry is real. PL exploits it ruthlessly. PA makes it fair. If you're close to the leader, PL is your friend. If you're far, PA keeps you from suffering.


Discussion

Both PL and PA assume stable latencies and non-faulty processes. The paper sketches how to tolerate failures, but variance and reconfiguration remain open issues. The funny thing is to justify predictable network latencies, the authors cite Aleksey's paper: "Cloudy Forecast: How Predictable is Communication Latency in the Cloud?" Ironically, that paper shows the opposite, the variance can be massive: up to 10x of median in WAN setup, and 3000x in same AZ setup! So either they didn't read it carefully, or they cited it for sport. Cloud networks aren't that tame and predictable for tenants.

The treatment of clock skew Δ is also odd. The paper insists Δ must be proportional to the network diameter, but that's a theoretical result, and I don't know how much it would apply to even NTP based synchronization. Moreover, in practice, GPS clocks exist, and AWS Timesync provides 50 microsecond clock uncertainty. Why not use these? The paper explicitly disallows GPS clocks to make the results from PL and PA look more favorable. A comparison against synchronized clocks would have been valuable. With clocks, blocking could be in less than millisecond (as we designed in AWS DSQL) with just delayed timestamping and that would not only be a lot more simple, but also beat anything from PL and PA significantly. 

Our PQR work (2019) also tackled linearizable non-leader reads. You can also frame it as local reads, though PQR used multiple nodes. The key idea in PQR is to involve the client: The client contacts a quorum of nodes, usually gets a linearizable read in one shot, and in the rare case of an ongoing update, waits briefly and completes with a callback. PQR required no synchronized clocks and worked in a fully asynchronous model using only LSNs. It fits naturally in this space.

Postgres 18 beta3, small server, sysbench

This has performance results for Postgres 18 beta3, beta2, beta1 and 17.6 using the sysbench benchmark and a small server. The working set is cached and the benchmark is run with low concurrency (1 connection). The goal is to search for CPU regressions. This work was done by Small Datum LLC and not sponsored

tl;dr

  • There might be small regressions (~2%) for several range queries that don't do aggregation. This is similar to what I reported for 18 beta1.
  • Vacuum continues to be a problem for me and I had to repeat the benchmark a few times to get a stable result. It appears to be a big source of non-deterministic behavior leading to false alarms for CPU regressions in read-heavy tests that run after vacuum. In some ways, RocksDB compaction causes similar problems. Fortunately, InnoDB MVCC GC (purge) does not cause such problems.
Builds, configuration and hardware

I compiled Postgres versions 17.6, 18 beta1, 18 beta2 and 18 beta3 from source.

The server is a Beelink SER7 with a Ryzen 7 7840HS CPU, 32G of RAM, 8 cores with AMD SMT disabled, Ubuntu 24.04 and an NVMe devices with discard enabled and ext4 for the database.

The config file for Postgres 17.6 is x10a_c8r32.

The config files for Postgres 18 are:
  • x10b_c8r32 is functionally the same as x10a_c8r32 but adds io_method=sync
  • x10b1_c8r32 starts with x10b_c8r32 and adds vacuum_max_eager_freeze_failure_rate =0
  • x10b2_c8r32 starts with x10b_c8r32 and adds vacuum_max_eager_freeze_failure_rate =0.99

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

The tests are run using 1 table with 50M rows. The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for Postgres 17.6)
When the relative QPS is > 1 then some version is faster than PG 17.6.  When it is < 1 then there might be a regression. Values from iostat and vmstat divided by QPS are also provided here. These can help to explain why something is faster or slower because it shows how much HW is used per request.

The numbers highlighted in yellow below might be from a small regression for range queries that don't do aggregation. But note that this does reproduce for the full table scan microbenchmark (scan). I am not certain it is a regression as this might be from non-deterministic CPU overheads for read-heavy workloads that are run after vacuum. I hope to look at CPU flamegraphs soon.
  • the mapping from microbenchmark name to Lua script is here
  • the range query without aggregation microbenchmarks use oltp_range_covered.lua with various flags set and the SQL statements it uses are here. All of these return 100 rows.
  • the scan microbenchmark uses oltp_scan.lua which is a SELECT with a WHERE clause that filters all rows (empty result set)
Relative to: x.pg176_o2nofp.x10a_c8r32.pk1
col-1 : x.pg18beta1_o2nofp.x10b_c8r32.pk1
col-2 : x.pg18beta2_o2nofp.x10b_c8r32.pk1
col-3 : x.pg18beta3_o2nofp.x10b_c8r32.pk1
col-4 : x.pg18beta3_o2nofp.x10b1_c8r32.pk1
col-5 : x.pg18beta3_o2nofp.x10b2_c8r32.pk1

col-1   col-2   col-3   col-4   col-5 -> point queries
1.00    1.00    0.98    0.99    0.99    hot-points_range=100
1.00    1.01    1.00    1.00    0.99    point-query_range=100
1.00    1.02    1.01    1.01    1.01    points-covered-pk
1.00    1.00    1.00    1.00    1.00    points-covered-si
1.01    1.01    1.00    1.00    1.00    points-notcovered-pk
1.01    1.00    1.00    1.00    1.00    points-notcovered-si
0.99    1.00    0.99    1.00    1.00    random-points_range=1000
1.01    1.00    1.00    1.00    1.00    random-points_range=100
1.01    1.01    1.00    1.00    0.99    random-points_range=10

col-1   col-2   col-3   col-4   col-5 -> range queries w/o agg
0.98    0.99    0.97    0.98    0.96    range-covered-pk_range=100
0.98    0.99    0.96    0.98    0.97    range-covered-si_range=100
0.98    0.98    0.98    0.97    0.98    range-notcovered-pk
0.99    0.99    0.98    0.98    0.98    range-notcovered-si
1.01    1.02    1.00    1.00    1.00    scan

col-1   col-2   col-3   col-4   col-5 -> range queries with agg
1.02    1.01    1.02    1.01    0.98    read-only-count_range=1000
0.98    1.01    1.01    1.00    1.03    read-only-distinct
0.99    0.99    0.99    0.99    0.99    read-only-order_range=1000
1.00    1.00    1.01    1.00    1.01    read-only_range=10000
0.99    0.99    0.99    0.99    0.99    read-only_range=100
0.99    0.99    0.99    0.98    0.99    read-only_range=10
1.01    1.00    1.00    1.00    1.01    read-only-simple
1.01    1.00    1.01    1.00    1.00    read-only-sum_range=1000

col-1   col-2   col-3   col-4   col-5 -> writes
0.99    1.00    0.98    0.98    0.98    delete_range=100
0.99    0.98    0.98    1.00    0.98    insert_range=100
0.99    0.99    0.99    0.98    0.99    read-write_range=100
0.98    0.99    0.99    0.98    0.99    read-write_range=10
1.00    0.99    0.98    0.97    0.99    update-index_range=100
1.01    1.00    0.99    1.01    1.00    update-inlist_range=100
1.00    1.00    0.99    0.96    0.99    update-nonindex_range=100
1.01    1.01    0.99    0.97    0.99    update-one_range=100
1.00    1.00    0.99    0.98    0.99    update-zipf_range=100
1.00    0.99    0.98    0.98    1.00    write-only_range=10000

September 01, 2025

DocumentDB: Comparing Emulation Internals with MongoDB

MongoDB is the leading database for document data modeling, with its Atlas service available on AWS, Azure, and Google Cloud. Its popularity has led to the development of compatible APIs by other vendors, like Amazon DocumentDB (with MongoDB compatibility), highlighting MongoDB's importance in modern applications. Microsoft did the same for CosmosDB and developed a MongoDB emulation on PostgreSQL called DocumentDB, now part of the Linux Foundation.
AWS has joined the project. While today Amazon DocumentDB uses its own Aurora‑based proprietary engine, AWS’s participation opens the possibility that, in the future, the managed service could leverage this PostgreSQL‑based extension.

An emulation cannot truly replace MongoDB, which was designed to store, index, and process documents with flexible schema natively instead of using fixed-size blocks and relational tables, but may help in their transition. This article tests a simple query across three options: native MongoDB, PostgreSQL with the DocumentDB extension, and Oracle Database’s emulation - another emulation on top another RDBMS. They encounter similar challenges: implementing document semantics on top of a row-based engine. The aim is to demonstrate an evaluation method, including execution plans, to assess the pros and cons of each platform in relation to relevant application patterns, rather than specific use cases.

MongoDB

I create a simple collection with one field "nnn", indexed, and insert random values between 0 and 100:

db.franck.drop();

db.franck.createIndex({nnn:1});

void db.franck.insertMany(
 Array.from({ length: 100000 }, () => (
  { nnn: (Math.random() * 100)}
 ))
);

I query values between 20 and 80, displaying the first five for pagination:

db.franck.find(
  { nnn: { $gte: 20, $lt: 80 } }
).sort({ nnn: 1 }).limit(5)

[
  { _id: ObjectId('68b37f883c2e2550c0d51c0c'), nnn: 20.00087217392812 },
  { _id: ObjectId('68b37f883c2e2550c0d5dd3c'), nnn: 20.000927538131542 },
  { _id: ObjectId('68b37f883c2e2550c0d5f1e7'), nnn: 20.000979995906974 },
  { _id: ObjectId('68b37f883c2e2550c0d59dc4'), nnn: 20.001754428025208 },
  { _id: ObjectId('68b37f883c2e2550c0d66c4f'), nnn: 20.002357317589414 }
]

Here is the execution plan with execution statistics:

db.franck.find(
  { nnn: { $gte: 20, $lt: 80 } }
).sort({ nnn: 1 }).limit(5).explain("executionStats").executionStats

{
  executionSuccess: true,
  nReturned: 5,
  executionTimeMillis: 0,
  totalKeysExamined: 5,
  totalDocsExamined: 5,
  executionStages: {
    isCached: false,
    stage: 'LIMIT',
    nReturned: 5,
    executionTimeMillisEstimate: 0,
    works: 6,
    advanced: 5,
    needTime: 0,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    limitAmount: 5,
    inputStage: {
      stage: 'FETCH',
      nReturned: 5,
      executionTimeMillisEstimate: 0,
      works: 5,
      advanced: 5,
      needTime: 0,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 0,
      docsExamined: 5,
      alreadyHasObj: 0,
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 5,
        executionTimeMillisEstimate: 0,
        works: 5,
        advanced: 5,
        needTime: 0,
        needYield: 0,
        saveState: 0,
        restoreState: 0,
        isEOF: 0,
        keyPattern: { nnn: 1 },
        indexName: 'nnn_1',
        isMultiKey: false,
        multiKeyPaths: { nnn: [] },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: { nnn: [ '[20, 80)' ] },
        keysExamined: 5,
        seeks: 1,
        dupsTested: 0,
        dupsDropped: 0
      }
    }
  }
}

MongoDB scanned the index (stage: 'IXSCAN') on "nnn" (keyPattern: { nnn: 1 }) for values between 20 and 80 (indexBounds: { nnn: [ '[20, 80)' ]). It examined 5 index entries (keysExamined: 5) and fetched the corresponding documents, resulting in 5 documents read (docsExamined: 5). It stopped (LIMIT) after returning the documents for the result (nReturned: 5).

We achieved exactly what we needed without any unnecessary work, so no further tuning is required. We could go further, like with a covering index to avoid the FETCH stage, but it's not needed as the number of documents fetched is low and bounded.

PostgreSQL with DocumentDB

To gain a comprehensive understanding of the emulation, I examine both the execution plan from the emulation and the execution plan in the underlying database. I begin by starting a container with DocumentDB.

I start a container for my lab using the DocumentDB image from the Microsoft repo, which will later move to the Linux Foundation, and I use the default ports.

docker run -d -p 10260:10260 -p 9712:9712 --name pgddb \
ghcr.io/microsoft/documentdb/documentdb-local:latest  \
--username ddb --password ddb

I add auto-explain to show all execution plans for my lab:

## add auto-explain extension to be loaded
docker exec -it pgddb sed -e '/shared_preload_libraries/s/,/, auto_explain,/' -i /home/documentdb/postgresql/data/postgresql.conf

## bounce the instance
docker restart -t 5 pgddb

# set auto-explain by default for the emulation gateway
psql -e 'postgres://documentdb@localhost:9712/postgres' <<'SQL'
\dconfig shared_preload_libraries
alter user ddb set auto_explain.log_analyze=on;
alter user ddb set auto_explain.log_buffers=on;
alter user ddb set auto_explain.log_format=text;
alter user ddb set auto_explain.log_min_duration=0;
alter user ddb set auto_explain.log_nested_statements=on;
alter user ddb set auto_explain.log_settings=on;
alter user ddb set auto_explain.log_timing=on;
alter user ddb set auto_explain.log_triggers=on;
alter user ddb set auto_explain.log_verbose=on;
alter user ddb set auto_explain.log_wal=on;
SQL

# tail the PostgreSQL log in the background to see the execution plan
docker exec -it pgddb tail -f /home/documentdb/postgresql/data/pglog.log | grep -v " LOG: cron job" &

# connect to the MogoDB emulation gateway
mongosh 'mongodb://ddb:ddb@localhost:10260/?tls=true&tlsAllowInvalidCertificates=true'

I create the same collection and index as in my MongoDB test, run the same query, and check the execution plan:

db.franck.find(
  { nnn: { $gte: 20, $lt: 80 } }
).sort({ nnn: 1 }).limit(5).explain("executionStats").executionStats

{
  nReturned: Long('5'),
  executionTimeMillis: Long('154'),
  totalDocsExamined: Long('5'),
  totalKeysExamined: Long('5'),
  executionStages: {
    stage: 'LIMIT',
    nReturned: Long('5'),
    executionTimeMillis: Long('154'),
    totalKeysExamined: Long('5'),
    totalDocsExamined: 5,
    numBlocksFromCache: 4415,
    numBlocksFromDisk: 0,
    inputStage: {
      stage: 'SORT',
      nReturned: Long('5'),
      executionTimeMillis: Long('154'),
      totalKeysExamined: Long('5'),
      totalDocsExamined: 5,
      sortMethod: 'top-N heapsort',
      totalDataSizeSortedBytesEstimate: 26,
      numBlocksFromCache: 4415,
      numBlocksFromDisk: 0,
      inputStage: {
        stage: 'FETCH',
        nReturned: Long('59935'),
        executionTimeMillis: Long('133'),
        totalKeysExamined: Long('59935'),
        indexName: 'nnn_1',
        totalDocsRemovedByIndexRechecks: 0,
        numBlocksFromCache: 4415,
        numBlocksFromDisk: 0,
        inputStage: {
          stage: 'IXSCAN',
          nReturned: Long('59935'),
          executionTimeMillis: Long('133'),
          totalKeysExamined: Long('59935'),
          indexName: 'nnn_1',
          totalDocsRemovedByIndexRechecks: 0,
          numBlocksFromCache: 4415,
          numBlocksFromDisk: 0
        }
      }
    }
  }
}

DocumentDB scanned the index (stage: 'IXSCAN'), identifying the index name but lacking details on key patterns or index bounds. It appears to have read the correct range (from 20 to 80) but did not apply pagination efficiently, as evidenced by the high volume of index entries read (totalKeysExamined: Long('59935'), nReturned: Long('59935')). All documents were retrieved (stage: 'FETCH', nReturned: Long('59935')) and sorted for pagination (stage: 'SORT', sortMethod: 'top-N heapsort'). Ultimately, this process returned the final result of 5 documents (stage: 'LIMIT', nReturned: Long('5')), discarding the thousands of documents read.

While the query and result are similar to MongoDB, the execution differs significantly. MongoDB avoids reading all documents and sorting them because its index not only helps find a range but also returns results in order.

To grasp the underlying reasons for this difference, we need more than just the execution plan of the emulation. I installed auto-explain in my lab to analyze the execution plan in PostgreSQL:

2025-08-31 17:20:47.765 UTC [416] LOG:  duration: 160.621 ms  plan:
        Query Text: EXPLAIN (FORMAT JSON, ANALYZE True, VERBOSE True, BUFFERS True, TIMING True) SELECT document FROM documentdb_api_catalog.bson_aggregation_find($1, $2)
        Query Parameters: $1 = 'test', $2 = '\x5f0000000266696e6400070000006672616e636b000366696c7465720022000000036e6e6e00180000001024677465001400000010246c740050000000000003736f7274000e000000106e6e6e000100000000106c696d6974000500000000'
        Limit  (cost=517.87..517.88 rows=5 width=68) (actual time=160.617..160.618 rows=5 loops=1)
          Output: document, (bson_orderby(document, 'BSONHEX0e000000106e6e6e000100000000'::documentdb_core.bson))
          Buffers: shared hit=4447, temp read=1860 written=2002
          ->  Sort  (cost=517.87..580.37 rows=25000 width=68) (actual time=160.615..160.616 rows=5 loops=1)
                Output: document, (bson_orderby(document, 'BSONHEX0e000000106e6e6e000100000000'::documentdb_core.bson))
                Sort Key: (bson_orderby(collection.document, 'BSONHEX0e000000106e6e6e000100000000'::documentdb_core.bson)) NULLS FIRST
                Sort Method: top-N heapsort  Memory: 26kB
                Buffers: shared hit=4447, temp read=1860 written=2002
                ->  Index Scan using nnn_1 on documentdb_data.documents_2 collection  (cost=0.00..102.62 rows=25000 width=68) (actual time=98.698..138.723 rows=59973 loops=1)
                      Output: document, bson_orderby(document, 'BSONHEX0e000000106e6e6e000100000000'::documentdb_core.bson)
                      Index Cond: (collection.document @<> 'BSONHEX3f000000036e6e6e0035000000106d696e0014000000106d61780050000000086d696e496e636c75736976650001086d6178496e636c757369766500000000'::documentdb_core.bson)
                      Buffers: shared hit=4439, temp read=1860 written=2002
        Settings: search_path = 'documentdb_api_catalog, documentdb_api, public'

The steps are similar but more detailed at the PostgreSQL level. The index access utilizes the @<> operator, which checks if a BSON value is within a specified range via a RUM index (DocumentDB uses an extended version of RUM index which provides more metadata than a GIN index). In this case, the index was applied solely for filtering, while a separate Sort step managed the final ordering. This method requires reading all documents before they can be ordered. Although the example is simple, indexed fields may contain arrays, which means a forward scan must return entries in order based on the smallest in the array. This behavior is native to MongoDB but not in PostgreSQL, and it likely explains why entries can't be retrieved in the desired order in the current version of DocumentDB.

Note that you might see a Bitmap Scan before auto-vacuum runs, but it's important to recognize that an Index Scan is also possible, which is a key distinction from GIN indexes.

The definition of the table and index is visible from PostgreSQL:

postgres=# \d documentdb_data.documents_2

                     Table "documentdb_data.documents_2"
     Column      |           Type           | Collation | Nullable | Default
-----------------+--------------------------+-----------+----------+---------
 shard_key_value | bigint                   |           | not null |
 object_id       | documentdb_core.bson     |           | not null |
 document        | documentdb_core.bson     |           | not null |
 creation_time   | timestamp with time zone |           |          |
Indexes:
    "collection_pk_2" PRIMARY KEY, btree (shard_key_value, object_id)
    "documents_rum_index_3" documentdb_rum (document documentdb_api_catalog.bson_rum_single_path_ops (path=nnn, tl='2699'))
Check constraints:
    "shard_key_value_check" CHECK (shard_key_value = '2'::bigint)

DocumentDB uses Citus for sharding, and an extended version of RUM indexes documentdb_rum for indexes:

postgres=# select indexdef, tablename, indexname 
           from pg_indexes 
           where schemaname='documentdb_data'
;
                                                                                indexdef                                                                                 |  tablename  |       indexname
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-----------------------
 CREATE UNIQUE INDEX collection_pk_1 ON documentdb_data.documents_1 USING btree (shard_key_value, object_id)                                                             | documents_1 | collection_pk_1
 CREATE UNIQUE INDEX retry_1_pkey ON documentdb_data.retry_1 USING btree (shard_key_value, transaction_id)                                                               | retry_1     | retry_1_pkey
 CREATE INDEX retry_1_object_id_idx ON documentdb_data.retry_1 USING btree (object_id)                                                                                   | retry_1     | retry_1_object_id_idx
 CREATE UNIQUE INDEX collection_pk_2 ON documentdb_data.documents_2 USING btree (shard_key_value, object_id)                                                             | documents_2 | collection_pk_2
 CREATE UNIQUE INDEX retry_2_pkey ON documentdb_data.retry_2 USING btree (shard_key_value, transaction_id)                                                               | retry_2     | retry_2_pkey
 CREATE INDEX retry_2_object_id_idx ON documentdb_data.retry_2 USING btree (object_id)                                                                                   | retry_2     | retry_2_object_id_idx
 CREATE INDEX documents_rum_index_3 ON documentdb_data.documents_2 USING documentdb_rum (document documentdb_api_catalog.bson_rum_single_path_ops (path=nnn, tl='2699')) | documents_2 | documents_rum_index_3
(7 rows)

Here is the list of extensions installed:

postgres=# \dx
                                    List of installed extensions
      Name       | Version |   Schema   |                        Description
-----------------+---------+------------+------------------------------------------------------------
 documentdb      | 0.104-0 | public     | API surface for DocumentDB for PostgreSQL
 documentdb_core | 0.104-0 | public     | Core API surface for DocumentDB on PostgreSQL
 pg_cron         | 1.6     | pg_catalog | Job scheduler for PostgreSQL
 plpgsql         | 1.0     | pg_catalog | PL/pgSQL procedural language
 postgis         | 3.5.3   | public     | PostGIS geometry and geography spatial types and functions
 rum             | 1.3     | public     | RUM index access method
 tsm_system_rows | 1.0     | public     | TABLESAMPLE method which accepts number of rows as a limit
 vector          | 0.8.0   | public     | vector data type and ivfflat and hnsw access methods
(8 rows)

In my example, DocumentDB performed more work than MongoDB because the sort operation was not pushed down to the index scan. Achieving ordered results from a multi-key index is challenging due to multiple index entries per document. The scan must deduplicate these entries and arrange them correctly: the lowest array value for a forward scan and the greatest for a backward scan. MongoDB implemented this functionality from the get-go in its multi-key indexes. Emulation on top of SQL databases still requires further development to match the performance and scalability, as RDBMS were not designed for multi-key indexes, as one-to-many relationships are typically managed in separate tables according to the first normal form. This remains a TODO in the code for DocumentDB's RUM index access method.

The order by pushdown is not the only limitation. If you cannot read the five index entries needed for the query, you should at least try to avoid fetching thousands of documents. In MongoDB, using a covering index will replace the FETCH stage with a PROJECTION_COVERED stage. I attempted the same in DocumentDB by omitting the "_id" from the projection:

db.franck.find(
  { nnn: { $gte: 20, $lt: 80 } },
  { _id: 0, nnn: 1 }
).sort({ nnn: 1 }).limit(5).hint({nnn:1, _id:1 }).explain("executionStats").executionStats

{
  nReturned: Long('5'),
  executionTimeMillis: Long('170'),
  totalDocsExamined: Long('5'),
  totalKeysExamined: Long('5'),
  executionStages: {
    stage: 'LIMIT',
    nReturned: Long('5'),
    executionTimeMillis: Long('170'),
    totalKeysExamined: Long('5'),
    totalDocsExamined: 5,
    numBlocksFromCache: 4607,
    numBlocksFromDisk: 0,
    inputStage: {
      stage: 'PROJECT',
      nReturned: Long('5'),
      executionTimeMillis: Long('170'),
      totalKeysExamined: Long('5'),
      totalDocsExamined: 5,
      numBlocksFromCache: 4607,
      numBlocksFromDisk: 0,
      inputStage: {
        stage: 'SORT',
        nReturned: Long('5'),
        executionTimeMillis: Long('170'),
        totalKeysExamined: Long('5'),
        totalDocsExamined: 5,
        sortMethod: 'top-N heapsort',
        totalDataSizeSortedBytesEstimate: 25,
        numBlocksFromCache: 4607,
        
                                    
                                    
                                    
                                    
                                

August 31, 2025