database.news | database technology news aggregator

February 22, 2026

Phil Eaton

We have pgvector at home

This is an external post of mine. Click here if you are not redirected.

February 21, 2026

Franck Pachot

Read‑your‑writes on replicas: PostgreSQL WAIT FOR LSN and MongoDB Causal Consistency

In databases designed for high availability and scalability, secondary nodes can fall behind the primary. Typically, a quorum of nodes is updated synchronously to guarantee durability while maintaining availability, while remaining standby instances are eventually consistent to handle partial failures. To balance availability with performance, synchronous replicas acknowledge a write only when it is durable and recoverable, even if it is not yet readable.

As a result, if your application writes data and then immediately queries another node, it may still see stale data.

Here’s a common anomaly: you commit an order on the primary and then try to retrieve it from a reporting system. The order is missing because the read replica has not yet applied the write.

PostgreSQL and MongoDB tackle this problem in different ways:

PostgreSQL 19 introduces a WAIT FOR LSN command, allowing applications to explicitly coordinate reads after writes.
MongoDB provides causal consistency within sessions using the afterClusterTime read concern.

Both approaches track when your write occurred and ensure subsequent reads observe at least that point. Let’s look at how each database does this.

PostgreSQL: `WAIT FOR LSN` (PG19)

PostgreSQL records every change in the Write‑Ahead Log (WAL). Each WAL record has a Log Sequence Number (LSN): a 64‑bit position, typically displayed as two hexadecimal halves such as 0/40002A0 (high/low 32 bits).

Streaming replication ships WAL records from the primary to standbys, which then:

Write WAL records to disk
Flush them to durable storage
Replay them, applying changes to data files

The write position determines what can be recovered after a database crash. The flush position defines the recovery point for a compute instance failure. The replay position determines what queries can see on a standby.

WAIT FOR LSN allows a session to block until one of these points reaches a target LSN:

standby_write → WAL written to disk on the standby (not yet flushed)
standby_flush → WAL flushed to durable storage on the standby
standby_replay (default) → WAL replayed into data files and visible to readers
primary_flush → WAL flushed on the primary (useful when synchronous_commit = off and a durability barrier is needed)

A typical flow is to write on the primary, commit, and then fetch the current WAL insert LSN:

pg19rw=*# BEGIN;

BEGIN

pg19rw=*# INSERT INTO orders VALUES (123, 'widget');

INSERT 0 1

pg19rw=*# COMMIT;

COMMIT

pg19rw=# SELECT pg_current_wal_insert_lsn();

 pg_current_wal_insert_lsn
---------------------------
 0/18724C0

(1 row)

That LSN is then used to block reads on a replica until it has caught up:


pg19ro=# WAIT FOR LSN '0/18724C0'
  WITH (MODE 'standby_replay', TIMEOUT '2s');

This LSN‑based read‑your‑writes pattern in PostgreSQL requires extra round‑trips: capturing the LSN on the primary and explicitly waiting on the standby. For many workloads, reading from the primary is simpler and faster.

The pattern becomes valuable when expensive reads must be offloaded to replicas while still preserving read‑your‑writes semantics, or in event‑driven and CQRS designs where the LSN itself serves as a change marker for downstream consumers.

MongoDB: Causal Consistency

While PostgreSQL reasons in WAL positions, MongoDB tracks causality using oplog timestamps and a hybrid logical clock.

In a replica set, each write on the primary produces an entry in local.oplog.rs, a capped collection. These entries are rewritten to be idempotent (for example, $inc becomes $set) so they can be safely reapplied. Each entry carries a Hybrid Logical Clock (HLC) timestamp that combines physical time with a logical counter, producing a monotonically increasing cluster time. Replica set members apply oplog entries in timestamp order.

Because MongoDB allows concurrent writes, temporary “oplog holes” can appear: a write with a later timestamp may commit before another write with an earlier timestamp. A naïve reader scanning the oplog could skip the earlier operation.

MongoDB prevents this by tracking an oplogReadTimestamp, the highest hole‑free point in the oplog. Secondaries are prevented from reading past this point until all prior operations are visible, ensuring causal consistency even in the presence of concurrent commits.

Causal consistency in MongoDB is enforced by attaching an afterClusterTime to reads:

Drivers track the operationTime of the last operation in a session.
When a session is created with causalConsistency: true, the driver automatically includes an afterClusterTime equal to the highest known cluster time on subsequent reads.
The server blocks the read until its cluster time has advanced beyond afterClusterTime.

With any read preference that allows reading from secondaries as well as the primary, this guarantees read‑your‑writes behavior:


// Start a causally consistent session
const session = client.startSession({ causalConsistency: true });

const coll = db.collection("orders");

// Write in this session
await coll.insertOne({ id: 123, product: "widget" }, { session });

// The driver automatically injects afterClusterTime into the read concern
const order = await coll.findOne({ id: 123 }, { session });

Causal consistency is not limited to snapshot reads. It applies across read concern levels. The key point is that the session ensures later reads observe at least the effects of earlier writes, regardless of which replica serves the read.

Conclusion

Here is a simplified comparison:

Feature	PostgreSQL `WAIT FOR LSN`	MongoDB Causal Consistency
Clock type	Physical byte offset in the WAL (LSN)	Hybrid Logical Clock (HLC)
Mechanism	Block until replay/write/flush LSN reached	Block until `afterClusterTime` is visible
Tracking	Application captures LSN	Driver tracks `operationTime`
Granularity	WAL record position	Oplog timestamp
Replication model	Physical streaming	Logical oplog application
Hole handling	N/A (serialized WAL)	`oplogReadTimestamp`
Failover handling	Error unless `NO_THROW`	Session continues, bounded by replication state

Both PostgreSQL’s WAIT FOR LSN and MongoDB’s causal consistency ensure reads can observe prior writes, but at different layers:

PostgreSQL offers manual, WAL‑level precision.
MongoDB provides automatic, session‑level guarantees.

If you want read‑your‑writes semantics to “just work” without additional coordination calls, MongoDB’s session‑based model is a strong fit. Despite persistent myths about consistency, MongoDB delivers strong consistency in a horizontally scalable system with a simple developer experience.

by Franck Pachot

Murat Demirbas

End of Productivity Theater

I remember the early 2010s as the golden age of productivity hacking. Lifehacker, 37signals, and their ilk were everywhere, and it felt like everyone was working on jury-rigging color-coded Moleskine task-trackers and web apps into the perfect Getting Things Done system.

So recently I found myself wondering: what happened to all that excitement? Did I just outgrow the productivity movement, or did the movement itself lose stream?

After poking around a bit, I think it's both. We collectively grew out of that phase, and productivity itself fundamentally changed.

The Trap of Micro-Optimizations

Back then, the underlying promise of productivity culture was about outputmaxxing (as we would now call it). We obsessed over efficiency at the margins: how to auto-sync this app with that one, or how to shave 5 seconds off an email reply. We accumulated systems, hacks, and integrations like collectors.

Eventually, the whole thing got exhausting. I think we all realized that tweaking task managers wasn't helping the bottom line. We were doing a lot of organizing, but that organizing wasn't reflecting in actually getting the work done.

The reason is simple: not all tasks matter equally. Making some tasks faster does not move the bottomline if the core task remains the serial bottleneck. Amdahl’s Law says that speeding up one part of a system improves overall performance only in proportion to the time that part consumes. If the hard, irreducible core is untouched, optimizations elsewhere are just noise.

Painting the deck of a sinking ship faster doesn't help anyone. Productivity should be about making sure we are working on the right things in the first place. The main thing is to keep the main thing the main thing.

Away From the Glowing Rectangle

For more than 15 years, I've relied on Emacs org-mode to run my life. It's the ultimate organization system, that has survived every software trend of the past decade and a half. But despite having this powerful writing system at my fingertips, my best ideas never arrive while I'm staring at a screen. Almost without exception, my hard thinking happens away from the screen. That's where the ideas come from.

If I'm being rational about it: I should be paid for the time I spend thinking hard, not for the time I spend managing my inbox, or doing trivial office work, or wrangling text on a screen.

So that's how I try to work now. I do my deep thinking, messy brainstorming, and wrestling-with-ideas completely away from the screen. Then I plan my next 45 minutes or so (what I'm going to do, in what order, and why) and only then do I go to my laptop to execute it. In other words, I arrive at the screen with a plan.

(OK, let's first take a moment to appreciate my self-restraint for not mentioning AI until this late in to the post. But here it comes.)

What does productivity even mean in the age of AI? What are we actually here to contribute? Are we supposed to be architects or butlers to LLMs?

If AI absorbs all the shallow work, the only things left that genuinely require a human are the core parts that demands genuine creativity, judgment, taste, and the type of thinking that can't be prompted away. This raises the stakes considerably, and changes what "a productive day" even means.

That kind of deep creative work is best done away from the glowing rectangle.

I recently launched a free email newsletter for the blog. Subscribe here to get these essays delivered to your inbox, along with behind-the-scenes commentary.

by Murat (noreply@blogger.com)

February 19, 2026

Franck Pachot

Top-K queries with MongoDB search indexes (BM25)

A document database is more than a JSON datastore. It must also support efficient storage and advanced search: equality and range predicates, fuzzy text search, ranking, pagination, and limited sorted results (top‑k). BM25 indexes, which combine an inverted index and columnar doc values, are ideal for this, with mature open‑source implementations like Lucene (used by MongoDB) and Tantivy (used by ParadeDB).

ParadeDB brings Tantivy indexing to PostgreSQL via the pg_search extension and recently published an excellent article showing where GIN indexes fall short and how BM25 bridges the gap. Here, I’ll present the MongoDB equivalent using its Lucene‑based search indexes. I suggest reading ParadeDB’s post first, as it clearly explains the problem and the solution:

How We Optimized Top K in Postgres | ParadeDB

How ParadeDB uses principles from search engines to optimize Postgres' Top K performance.

paradedb.com

I'll be lazy and use the same dataset, index and query.

MongoDB with search indexes

You can use BM25 indexes on MongoDB in several environments: the cloud-managed service (MongoDB Atlas), its local deployment (Atlas Local), on-premises MongoDB Enterprise Server, and the open-source MongoDB Community edition. The mongot engine that powers MongoDB Search is in public preview, with its source available at github.com/mongodb/mongot.

I started a local Atlas deployment on my laptop with Atlas CLI and connected automatically:


atlas deployments setup  mongo --type local --connectWith mongosh --force

Dataset generation

I generated 100,000,000 documents similar to ParadeDB's benchmark:


const batchSize = 10000;
const batches   = 10000;

const rows      = batches * batchSize;
print(`Generating ${rows.toLocaleString()} documents`);

db.benchmark_logs.drop();

const messages = [ 'The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.', 'The research facility analyzed samples from ancient artifacts, revealing breakthrough findings about civilizations lost to the depths of time.', 'The research station monitored weather patterns across mountain peaks, collecting data about atmospheric changes in the remote depths below.', 'The research observatory captured images of stellar phenomena, peering into the cosmic depths to understand the mysteries of distant galaxies.', 'The research laboratory processed vast amounts of genetic data, exploring the molecular depths of DNA to unlock biological secrets.', 'The research center studied rare organisms found in ocean depths, documenting new species thriving in extreme underwater environments.', 'The research institute developed quantum systems to probe subatomic depths, advancing our understanding of fundamental particle physics.', 'The research expedition explored underwater depths near volcanic vents, discovering unique ecosystems adapted to extreme conditions.', 'The research facility conducted experiments in the depths of space, testing how different materials behave in zero gravity environments.', 'The research team engineered crops that could grow in the depths of drought conditions, helping communities facing climate challenges.' ];

const countries = [ 'United States', 'Canada', 'United Kingdom', 'France', 'Germany', 'Japan', 'Australia', 'Brazil', 'India', 'China' ];

const labels = [ 'critical system alert', 'routine maintenance', 'security notification', 'performance metric', 'user activity', 'system status', 'network event', 'application log', 'database operation', 'authentication event' ];

let batch = [];
const startDate = new Date("2020-01-01T00:00:00Z");
for (let i = 0; i < rows; i++) {

  batch.push({
    message: messages[i % 10],
    country: countries[i % 10],
    severity: (i % 5) + 1,
    timestamp: new Date(startDate.getTime() + (i % 731) * 24 * 60 * 60 * 1000),
    metadata: {
      value: (i % 1000) + 1,
      label: labels[i % 10]
    }
  });

  if (batch.length === batchSize) {
    db.benchmark_logs.insertMany(batch);
    batch = [];
  }

}

I checked the document schema and counts:

print(`Done!
 \nSample: ${EJSON.stringify(  db.benchmark_logs.find().limit(1).toArray(),  null,  2  )}
 \nDocument count: ${db.benchmark_logs.countDocuments().toLocaleString()}
`);

Sample: [
  {
    "_id": {
      "$oid": "6997580679ab8450f81ff93c"
    },
    "message": "The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.",
    "country": "United States",
    "severity": 1,
    "timestamp": {
      "$date": "2020-01-01T00:00:00Z"
    },
    "metadata": {
      "value": 1,
      "label": "critical system alert"
    }
  }
]

Document count: 100,000,000

With 100 million documents, this is a large dataset. Because many fields can be queried, we can’t create every compound index combination. A single search index will make queries on this collection efficient.

Search index creation

I created the search index similar to the one used on ParadeDB (here):

const mapping = {
  mappings: {
    // Equivalent to: USING bm25 Atlas Search uses Lucene BM25 by default
    dynamic: false,
    fields: {
      // Equivalent to: bm25(id, message, ...) Standard full-text field scored by BM25
      message: { type: "string" },
      // Equivalent to: text_fields = {  "country": {  fast: true, tokenizer: { type: "raw", lowercase: true } } } // fast = true → implicit in Atlas Search; docValues optional in cloud
      country: { type: "string", analyzer: "keywordLowercase" },
      // Equivalent to:numeric field indexed for filtering
      severity: { type: "number", representation: "int64" },
      // Equivalent to:timestamp field included in the BM25 index
      timestamp: { type: "date" },
      // Equivalent to: json_fields = { "metadata": { fast: true, tokenizer: raw } }
      metadata: {
        type: "document",
        fields: {
          value: {
            type: "number",
            representation: "int64"
          },
          // Equivalent to: metadata tokenizer = raw + lowercase
          label: {
            type: "string",
            analyzer: "keywordLowercase"
          }
        }
      }
    }
  },
  analyzers: [
    {
      // Equivalent to: tokenizer = raw, lowercase = true
      name: "keywordLowercase",
      tokenizer: { type: "keyword" },
      tokenFilters: [{ type: "lowercase" }]
    }
  ]
};

db.benchmark_logs.createSearchIndex(
  "benchmark_logs_idx",
  mapping
);

The index is created asynchronously and updated via change stream operations.

Query and result

The query combines text search, range filter, sort by score, and limit for Top-K:

query = [
  {
    $search: {
      index: "benchmark_logs_idx",
      compound: {
        must: [{ text: { query: "research team", path: "message" } }],
        filter: [{ range: { path: "severity", lt: 3 } }]
      },
      sort: { score: { $meta: "searchScore" } }
    }
  },
  { $limit: 10 },
  {
    $project: {
      message: 1,
      country: 1,
      severity: 1,
      timestamp: 1,
      metadata: 1,
      rank: { $meta: "searchScore" }
    }
  }
]

const start = Date.now();

print(EJSON.stringify(db.benchmark_logs.aggregate(query).toArray(),null,2));

const end = Date.now();
print(`\nExecution time: ${end - start} ms`);

It is important that the sort is part of $search because an additional $sort stage would not be pushed down. This allows Atlas Search to run the query in Lucene’s Top‑K mode, enabling block‑max WAND (BMW) pruning via competitive score feedback during collection.

Here is the result and timing:


[{"_id":{"$oid":"699757049ce6a7c42c65d105"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-01-11T00:00:00Z"},"metadata":{"value":11,"label":"critical system alert"},"rank":0.6839379072189331},{"_id":{"$oid":"699757049ce6a7c42c65d10f"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-01-21T00:00:00Z"},"metadata":{"value":21,"label":"critical system alert"},"rank":0.6839379072189331},{"_id":{"$oid":"699757049ce6a7c42c65d119"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-01-31T00:00:00Z"},"metadata":{"value":31,"label":"critical system alert"},"rank":0.6839379072189331},{"_id":{"$oid":"699757049ce6a7c42c65d123"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-02-10T00:00:00Z"},"metadata":{"value":41,"label":"critical system alert"},"rank":0.6839379072189331},{"_id":{"$oid":"699757049ce6a7c42c65d12d"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-02-20T00:00:00Z"},"metadata":{"value":51,"label":"critical system alert"},"rank":0.6839379072189331},{"_id":{"$oid":"699757049ce6a7c42c65d137"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-03-01T00:00:00Z"},"metadata":{"value":61,"label":"critical system alert"},"rank":0.6839379072189331},{"_id":{"$oid":"699757049ce6a7c42c65d141"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-03-11T00:00:00Z"},"metadata":{"value":71,"label":"critical system alert"},"rank":0.6839379072189331},{"_id":{"$oid":"699757049ce6a7c42c65d14b"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-03-21T00:00:00Z"},"metadata":{"value":81,"label":"critical system alert"},"rank":0.6839379072189331},{"_id":{"$oid":"699757049ce6a7c42c65d155"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-03-31T00:00:00Z"},"metadata":{"value":91,"label":"critical system alert"},"rank":0.6839379072189331},{"_id":{"$oid":"699757049ce6a7c42c65d15f"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-04-10T00:00:00Z"},"metadata":{"value":101,"label":"critical system alert"},"rank":0.6839379072189331}]

Execution time: 1850 ms

On my laptop, this search over 100 million documents returns results in under two seconds, with no tuning. It performs a broad text match, and the high‑frequency terms "research" and "team" generate tens of millions of candidate documents. The additional severity filter and scoring require comparing tens of millions of scores, which has been heavily parallelized to stay within the two‑second budget.

Performance breakdown (explain)

Because the execution plan is long, I’ve packed it into a short string that you can easily copy and paste into your preferred AI chatbot:

EJSON.stringify(
db.benchmark_logs.aggregate(query).explain("executionStats")
);

{"explainVersion":"1","stages":[{"$_internalSearchMongotRemote":{"mongotQuery":{"index":"benchmark_logs_idx","compound":{"must":[{"text":{"query":"research team","path":"message"}}],"filter":[{"range":{"path":"severity","lt":3}}]},"sort":{"score":{"$meta":"searchScore"}}},"explain":{"query":{"type":"BooleanQuery","args":{"must":[{"path":"compound.must","type":"BooleanQuery","args":{"must":[],"mustNot":[],"should":[{"type":"TermQuery","args":{"path":"message","value":"research"},"stats":{"context":{"millisElapsed":1.273251,"invocationCounts":{"createWeight":2,"createScorer":87}},"match":{"millisElapsed":0},"score":{"millisElapsed":1292.607756,"invocationCounts":{"score":40000011}}}},{"type":"TermQuery","args":{"path":"message","value":"team"},"stats":{"context":{"millisElapsed":0.292666,"invocationCounts":{"createWeight":2,"createScorer":87}},"match":{"millisElapsed":0},"score":{"millisElapsed":379.190071,"invocationCounts":{"score":10000011}}}}],"filter":[],"minimumShouldMatch":0},"stats":{"context":{"millisElapsed":2.268162,"invocationCounts":{"createWeight":2,"createScorer":87}},"match":{"millisElapsed":0},"score":{"millisElapsed":3838.859709,"invocationCounts":{"score":40000011}}}}],"mustNot
                                    
                                    
                                    
                                        by Franck Pachot


                            
                                
                                    
                                        Percona Database Performance Blog
                                    
                                    
                                        A Guide to Accelerating Your Application with Valkey: Caching Database Queries and Sessions
                                    
                                    
                                    
                                        Modern applications often rely on multiple services to provide fast, reliable, and scalable responses. A common and highly effective architecture involves an application, a persistent database (like MySQL), and a high-speed cache service (like Valkey). In this guide, we’ll explore how to integrate these components effectively using Python to dramatically improve your application’s performance. Understanding […]
                                    
                                    
                                    
                                    
                                        by Arunjith Aravindan
                                    
                                    
                                
                            
                                
                                    
                                        Tinybird Engineering Blog
                                    
                                    
                                        How We Built Tinybird's TypeScript SDK for ClickHouse
                                    
                                    
                                    
                                        How we built the Tinybird TypeScript SDK: phantom types for compile-time inference, esbuild for schema loading, and a dev workflow that connects your app and data layer.
                                    
                                    
                                    
                                    
                                        by Rafael Moreno Higueras
                                    
                                    
                                
                            
                                
                                    
                                        PlanetScale Blog
                                    
                                    
                                        Faster PlanetScale Postgres connections with Cloudflare Hyperdrive
                                    
                                    
                                    
                                        Build a real-time application with PlanetScale and the Cloudflare global network. Infrastructure choices you won't need to migrate away from once you hit scale.
                                    
                                    
                                    
                                    
                                        by Simeon Griggs


                    
                        
                            February 18, 2026
                            
                            
                                
                                    
                                        Murat Demirbas
                                    
                                    
                                        OSTEP Chapter 9: Proportional Share Scheduling
                                    
                                    
                                    
                                        The Crux: Fairness Over Speed. Unlike the schedulers we explored in Chapter 8 (like Shortest Job First or Multi-Level Feedback Queues) that optimize for "turnaround time"  or "response time", proportional-share schedulers introduced in this Chapter aim to guarantee that each job receives a specific percentage of CPU time.
(This is part of our series going through OSTEP book chapters.)

Basic Concept: Tickets
Lottery Scheduling serves as the foundational example of proportional-share schedulers. It uses a randomized mechanism to achieve fairness probabilistically. The central concept of Lottery Scheduling is the ticket. Tickets represent the share of the resource a process should receive.
The scheduler holds a lottery every time slice. If Job A has 75 tickets and Job B has 25 (100 total), the scheduler picks a random number between 0 and 99. Statistically, Job A will win 75% of the time. The implementation is incredibly simple. It requires a random number generator, a list of processes, and a loop that sums ticket values until the randomly picked counter (300 in the below example) exceeds the winning number.

Advanced Ticket Mechanisms
1.  Ticket Currency: Users can allocate tickets among their own jobs in local currency (e.g., 500 "Alice-tickets"), which the system converts to global currency. This delegates the "fairness" decision to the user.
2.  Ticket Transfer: A client can temporarily hand its tickets to a server process to maximize performance while a specific request is being handled.
3.  Ticket Inflation: In trusted environments, a process can unilaterally boost its ticket count to reflect a higher need for CPU. In competitive settings this is unsafe, since a greedy process could grant itself excessive tickets and monopolize the machine. In practice, modern systems prevent this with control groups (cgroups), which act as an external regulator that assigns fixed resource weights so untrusted processes cannot simply print more tickets to override the scheduler.









Lottery Scheduling depends on randomness to decide which job runs next. This randomness helps avoid the tricky cases that can trip up traditional algorithms, like LRU on cyclic workloads, and keeps the scheduler simple with minimal state to track. However, fairness is only achieved over time. In the short term, a job might get unlucky and lose more often than its share of tickets. Studies show that fairness is low for short jobs and only approaches perfect fairness as the total runtime increases.

Lottery vs. Stride vs. CFS scheduling








Stride Scheduling emerged to address the probabilistic quirks of Lottery Scheduling. It assigns each process a stride, inversely proportional to its tickets, and maintains a pass value tracking how much CPU time the process has received. At each decision point, the scheduler selects the process with the lowest pass value.
This guarantees exact fairness each cycle, but it introduces challenges with global state. When a new process arrives, assigning it a fair initial pass value is tricky: set it too low, and it can dominate the CPU; too high, and it risks starvation. In contrast, Lottery Scheduling handles new arrivals seamlessly, since it requires no global state.


The Linux Completely Fair Scheduler (CFS) builds on these earlier proportional schedulers but removes randomness by using Virtual Runtime (vruntime) to track each process’s CPU usage. At every scheduling decision, CFS selects the job with the smallest vruntime, ensuring a fair distribution of CPU time. To prevent excessive context-switching overhead when there are many tasks (each receiving only a tiny slice of CPU time), CFS enforces a min_granularity. This ensures every process runs for at least a minimum time slice, and it balances fairness with efficient CPU utilization.


To prioritize specific processes, CFS uses the classic UNIX "nice" level, which allows users to assign values between -20 (highest priority) and +19 (lowest priority). CFS maps these values to geometric weights; a process with a higher priority (lower nice value) is assigned a larger weight. This weight directly alters the rate at which vruntime accumulates: high-priority processes add to their vruntime much more slowly than low-priority ones. When determining exactly how long a process should run within the target scheduling latency (sched_latency), instead of simply dividing the target latency equally among all tasks (e.g., 48ms/n), CFS calculates the time slice for a specific process k as a fraction of the total weight of all currently running processes.
Consequently, a high-priority job can run for a longer physical time while only "charging" a small amount of virtual time, allowing it to claim a larger proportional share of the CPU compared to "nicer" low-priority tasks.
Finally, because modern systems handle thousands of processes, CFS replaces the simple lists of Lottery Scheduling with Red-Black Trees, giving $O(\log n)$ efficiency for insertion and selection.

Challenges in Proportional Sharing
The I/O Problem. Proportional schedulers face a challenge when jobs sleep, such as waiting for I/O. In a straightforward model, a sleeping job lags behind, and when it resumes, it can monopolize the CPU to catch up, potentially starving other processes. CFS addresses this by resetting the waking job’s vruntime to the minimum value in the tree. This ensures no process starves, but it can penalize the interactive job, leading to slower response times.
The Ticket Assignment Problem. Assigning tickets is still an open challenge. In general-purpose computing, such as browsers or editors, it’s unclear how many tickets each application deserves, making fairness difficult to enforce. The situation is a bit more clear in virtualization and cloud computing, where ticket allocation aligns naturally with resource usage: if a client pays for 25% of a server, it can be assigned 25% of the tickets, providing a clear and effective proportional share.



I’ve been experimenting with Gemini Pro and NotebookLM. What a time to be alive! The fancy slides above all came from these tools, and for the final summary infographic, it produced a solid visual mind map of everything. Decades ago, as a secondary school student, I created similar visual mind maps as mnemonic devices for exams. The Turkish education system relied heavily on memorization, but I needed to understand concepts first to be able to memorize them. So I connected ideas, contextualized them through these visual mind maps, and it worked wonders. I even sold these mind maps to friends before exams. Looking back, those experiences were formative for my later blogging and explanation efforts. Fun times.
                                    
                                    
                                    
                                    
                                        by Murat (noreply@blogger.com)
                                    
                                    
                                
                            
                                
                                    
                                        Small Datum - Mark Callaghan
                                    
                                    
                                        Explaining why throughput varies for Postgres with a CPU-bound Insert Benchmark
                                    
                                    
                                    
                                        Throughput for the write-heavy steps of the Insert Benchmark look like a distorted sine wave with Postgres on CPU-bound workloads but not on IO-bound workloads. For the CPU-bound workloads the chart for max response time at N-second intervals for inserts is flat but for deletes it looks like the distorted sine wave. To see the chart for deletes, scroll down from here. So this looks like a problem for deletes and this post starts to explain that.

tl;dr
Once again, blame vacuum
History of the Insert Benchmark
Long ago (prior to 2010) the Insert Benchmark was published by Tokutek to highlight things that the TokuDB storage engine was great at. I was working on MySQL at Google at the time and the benchmark was useful to me, however it was written in C++. While the Insert Benchmark is great at showing the benefits of an LSM storage engine, this was years before MyRocks and I was only doing InnoDB at the time, on spinning disks. So I rewrote it in Python to make it easier to modify, and then the Tokutek team improved a few things about my rewrite, and I have been enhancing it slowly since then.
Until a few years ago the steps of the benchmark were:
load - insert in PK order
create 3 secondary indexes
do more inserts as fast as possible
do rate-limited inserts concurrent with range and point queries
The problem with this approach is that the database size grows forever and that limited for how long I could run the benchmark before running out of storage. So I changed it and the new approach keeps the database at a fixed size after the load. The new workflow is:
load - insert in PK order
create 3 secondary indexes
do inserts+deletes at the same rate, as fast as possible
do rate-limited inserts+deletes at the same rate concurrent with range and point queries
The benchmark treats the table like a queue, and when ordered by PK (transactionid) there are inserts at the high end and deletes at the low end. The delete statement currently looks like:
    delete from %s where transactionid in
        (select transactionid from %s where transactionid >= %d order by transactionid asc limit %d)

The delete statement is written like that because it must delete the oldest rows -- the ones that have the smallest value for transactionid. While the process that does deletes has some idea of what that smallest value is, it doesn't know it for sure, thus the query. To improve performance it maintains a guess for the value that will be <= the real minimum and it updates that guess over time.

I encountered other performance problems with Postgres while figuring out how to maintain that guess and get_actual_variable_range() in Postgres was the problem. Maintaining that guess requires a resync query every N seconds where the resync query is: select min(transactionid) from %s. The problem for this query in general is that is scans the low end of the PK index on transactionid and when vacuum hasn't been done recently, then it will scan and skip many entries that aren't visible (wasting much CPU and some IO) before finding visible rows. Unfortunately, there will be some time between consecutive vacuums to the same table and this problem can't be avoided. The result is that the response time for the query increases a lot in between vacuums. For more on how get_actual_variable_range() contributes to this problem, see this post.

I assume the sine wave for delete response time is caused by one or both of:
get_actual_varable_range() CPU overhead while planning the delete statement
CPU overhead from scanning and skipping tombstones while executing the select subquery
The structure of the delete statement above reduces the number of tombstones that the select subquery might encounter by specifying where transactionid >= %d. Perhaps that isn't sufficient. Perhaps the Postgres query planner still has too much CPU overhead from get_actual_variable_range() while planning that delete statement. I have yet to figure that out. But I have figured out that vacuum is a frequent source of problems.


                                    
                                    
                                    
                                    
                                        by Mark Callaghan (noreply@blogger.com)
                                    
                                    
                                
                            
                                
                                    
                                        Small Datum - Mark Callaghan
                                    
                                    
                                        MariaDB innovation: binlog_storage_engine, small server, Insert Benchmark
                                    
                                    
                                    
                                         MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB).

My previous post had results for sysbench with a small server. This post has results for the Insert Benchmark with a similar small server. Both servers use an SSD that has has high fsync latency. This is probably a best-case comparison for the feature. If you really care, then get enterprise SSDs with power loss protection. But you might encounter high fsync latency on public cloud servers.
tl;dr for a CPU-bound workload
Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2.
When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 1.75X larger for l.i0 (load) and 4X or more larger on the random write steps (l.i1, l.i2)
tl;dr for an IO-bound workload
Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2. It also has a large impact on qp1000, which is the most write-heavy of the query+write steps.
When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 4.74X larger for l.i0 (load), 1.50X larger for l.i1 (random writes) and 2.99X larger for l.i2 (random writes)
Builds, configuration and hardware

I compiled MariaDB 12.3.0 from source.

The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here. The storage device has high fsync latency.

I used 4 my.cnf files:
z12b
my.cnf.cz12b_c8r32 is my default configuration. Sync-on-commit is disabled for both the binlog and InnoDB so that write-heavy benchmarks create more stress.
z12c
my.cnf.cz12c_c8r32 is like z12b except it enables binlog_storage_engine
z12b_sync
my.cnf.cz12b_sync_c8r32 is like z12b except it enables sync-on-commit for the binlog and InnoDB
z12c_sync
my.cnf.cz12c_sync_c8r32 is like cz12c except it enables sync-on-commit for InnoDB. Note that InnoDB is used to store the binlog so there is nothing else to sync on commit.
The Benchmark

The benchmark is explained here. It was run with 1 client for two workloads:
CPU-bound - the database is cached by InnoDB, but there is still much write IO
IO-bound - most, but not all, benchmark steps are IO-bound
The benchmark steps are:
l.i0
insert XM rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 30M for CPU-bound and 800M for IO-bound.
l.x
create 3 secondary indexes per table. There is one connection per client.
l.i1
use 2 connections/client. One inserts XM rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. X is 40M for CPU-bound and 4M for IO-bound.
l.i2
like l.i1 but each transaction modifies 5 rows (small transactions) and YM rows are inserted and deleted per table. Y is 10M for CPU-bound and 1M for IO-bound.
Wait for S seconds after the step finishes to reduce MVCC GC debt and perf variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
qr100
use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload. This step runs for 1800 seconds.
qp100
like qr100 except uses point queries on the PK index
qr500
like qr100 but the insert and delete rates are increased from 100/s to 500/s
qp500
like qp100 but the insert and delete rates are increased from 100/s to 500/s
qr1000
like qr100 but the insert and delete rates are increased from 100/s to 1000/s
qp1000
like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: summary

Results: summary

The performance reports are here for:
CPU-bound
all-versions - results for z12b, z12c, z12b_sync and z12c_sync
sync-only - results for z12b_sync vs 12c_sync
IO-bound
all-versions - results for z12b, z12c, z12b_sync and z12c_sync
sync-only - results for z12b_sync vs 12c_sync
The summary sections from the performance reports have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from the base version. The base version is Postgres 12.22.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
insert/s for l.i0, l.i1, l.i2
indexed rows/s for l.x
range queries/s for qr100, qr500, qr1000
point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements.

I often use context switch rates as a proxy for mutex contention.

Results: CPU-bound

The summaries are here for all-versions and sync-only.
Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2.
When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 1.75X larger for l.i0 (load) and 4X or more larger on the random write steps (l.i1, l.i2)
The second table from the summary section has been inlined below. That table shows relative throughput which is:
all-versions: (QPS for my config / QPS for z12b)
sync-only: (QPS for my z12c / QPS for z12b)
For all-versions
dbms l.i0 l.x l.i1 l.i2 qr100 qp100 qr500 qp500 qr1000 qp1000
ma120300_rel_withdbg.cz12b_c8r32 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
ma120300_rel_withdbg.cz12c_c8r32 1.03 1.01 1.00 1.03 1.00 0.99 1.00 1.00 1.01 1.00
ma120300_rel_withdbg.cz12b_sync_c8r32 0.04 1.02 0.07 0.01 1.01 1.01 1.00 1.01 1.00 1.00
ma120300_rel_withdbg.cz12c_sync_c8r32 0.08 1.03 0.28 0.06 1.02 1.01 1.01 1.02 1.02 1.01

For sync-only
dbms l.i0 l.x l.i1 l.i2 qr100 qp100 qr500 qp500 qr1000 qp1000
ma120300_rel_withdbg.cz12b_sync_c8r32 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
ma120300_rel_withdbg.cz12c_sync_c8r32 1.75 1.01 3.99 6.83 1.01 1.01 1.01 1.01 1.03 1.01

Results: IO-bound

The summaries are here for all-versions and sync-only.
Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2. It also has a large impact on qp1000, which is the most write-heavy of the query+write steps.
When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 4.74X larger for l.i0 (load), 1.50X larger for l.i1 (random writes) and 2.99X larger for l.i2 (random writes)
The second table from the summary section has been inlined below. That table shows relative throughput which is:
all-versions: (QPS for my config / QPS for z12b)
sync-only: (QPS for my z12c / QPS for z12b)
For all-versions
dbms l.i0 l.x l.i1 l.i2 qr100 qp100 qr500 qp500 qr1000 qp1000
ma120300_rel_withdbg.cz12b_c8r32 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
ma120300_rel_withdbg.cz12c_c8r32 1.01 0.99 0.99 1.01 1.01 1.01 1.01 1.07 1.01 1.04
ma120300_rel_withdbg.cz12b_sync_c8r32 0.04 1.00 0.55 0.10 1.02 0.97 1.00 0.80 0.95 0.55
ma120300_rel_withdbg.cz12c_sync_c8r32 0.18 1.00 0.83 0.31 1.02 1.01 1.02 0.96 1.02 0.86

For sync-only
dbms l.i0 l.x l.i1 l.i2 qr100 qp100 qr500 qp500 qr1000 qp1000
ma120300_rel_withdbg.cz12b_sync_c8r32 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
ma120300_rel_withdbg.cz12c_sync_c8r32 4.74 1.00 1.50 2.99 1.00 1.04 1.02 1.20 1.08 1.57













                                    
                                    
                                    
                                    
                                        by Mark Callaghan (noreply@blogger.com)
                                    
                                    
                                
                            
                        
                    
                        
                            February 17, 2026
                            
                            
                                
                                    
                                        AWS Database Blog - Amazon Aurora
                                    
                                    
                                        Use default encryption at rest for new Amazon Aurora clusters
                                    
                                    
                                    
                                        In this post, you learn how Amazon Aurora now provides encryption at rest by default for all new database clusters using AWS owned keys. You’ll see how to verify encryption status using the new StorageEncryptionType field, understand the impact on new and existing clusters, and explore migration options for unencrypted databases.
                                    
                                    
                                    
                                    
                                        by Pratibha Shivnani
                                    
                                    
                                
                            
                                
                                    
                                        Percona Database Performance Blog
                                    
                                    
                                        An Open Letter to Oracle: Let’s Talk About MySQL’s Future
                                    
                                    
                                    
                                        What Happened at the Summits We just wrapped up two MySQL Community Summits – one in San Francisco in January, and one in Brussels right before FOSDEM. The energy in the rooms: a lot of people who care deeply about MySQL got together, exchanged ideas, and left with a clear sense that we need to […]
                                    
                                    
                                    
                                    
                                        by Vadim Tkachenko
                                    
                                    
                                
                            
                        
                    
                        
                            February 16, 2026
                            
                            
                                
                                    
                                        Small Datum - Mark Callaghan
                                    
                                    
                                        MariaDB innovation: binlog_storage_engine
                                    
                                    
                                    
                                        MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB).
In this post I have results for the performance benefit from this when using storage that has a high fsync latency. This is probably a best-case comparison for the feature. A future post will cover the benefit on servers that don't have high fsync latency.
tl;dr
the performance benefit from this is excellent when storage has a high fsync latency
my mental performance model needs to be improved. I gussed that throughput would increase by ~2X when using binlog_storage_engine relative to not using it but using sync_binlog=1 and innodb_flush_log_at_trx_commit=1. However the improvement is larger than 4X.
Builds, configuration and hardware

I compiled MariaDB 12.3.0 from source.

The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here. The storage device has high fsync latency.

I used 4 my.cnf files:
z12b
my.cnf.cz12b_c8r32 is my default configuration. Sync-on-commit is disabled for both the binlog and InnoDB so that write-heavy benchmarks create more stress.
z12c
my.cnf.cz12c_c8r32 is like z12b except it enables binlog_storage_engine
z12b_sync
my.cnf.cz12b_sync_c8r32 is like z12b except it enables sync-on-commit for the binlog and InnoDB
z12c_sync
my.cnf.cz12c_sync_c8r32 is like cz12c except it enables sync-on-commit for InnoDB. Note that InnoDB is used to store the binlog so there is nothing else to sync on commit.
Benchmark
I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks 
and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.

The benchmark is run with 1 client, 1 table and 50M rows. 

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation.  

But here I only report results for the write-heavy tests.

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for base version)
When the relative QPS is > 1 then some version is faster than base version.  When it is < 1 then there might be a regression. 

I present results for:
z12b, z12c, z12b_sync and z12c_sync with z12b as the base version
 z12b_sync and z12c_sync with z12b_sync as the base version
Results: z12b, z12c, z12b_sync, z12c_sync

Summary:
z12b_sync has the worst performance thanks to 2 fsyncs per commit
z12c_sync gets more than 4X the throughput vs z12b_sync. If fsync latency were the only thing that determined performance then I would expect the difference to be ~2X. There is more going on here and in the next section I mention that enabling binlog_storage_engine also reduces the CPU overhead.
some per-test data from iostat and vmstat is here
a representative sample of iostat collected at 1-second intervals during the update-inlist test is here. When comparing z12b_sync with z12c_sync
the fsync rate (f/s) is ~2.5X larger for z12c_sync vs z12b_sync (~690/s vs ~275/s) but fsync latency (f_await) is similar. So with binlog_storage_engine enabled MySQL is more efficient, and perhaps thanks to a lower CPU overhead, there is less work to do in between calls to fsync
Relative to: z12b
col-1 : z12c
col-2 : z12b_sync
col-3 : z12c_sync

col-1   col-2   col-3
1.06    0.01    0.05    delete
1.05    0.01    0.05    insert
1.01    0.12    0.47    read-write_range=100
1.01    0.10    0.44    read-write_range=10
1.03    0.01    0.11    update-index
1.02    0.02    0.12    update-inlist
1.05    0.01    0.06    update-nonindex
1.05    0.01    0.06    update-one
1.05    0.01    0.06    update-zipf
1.01    0.03    0.20    write-only

Results: z12b_sync, z12c_sync

Summary:
z12c_sync gets more than 4X the throughput vs z12b_sync. If fsync latency were the only thing that determined performance then I would expect the difference to be ~2X. There is more going on here and below I mention that enabling binlog_storage_engine also reduces the CPU overhead.
some per-test data from iostat and vmstat is here and the CPU overhead per operation is much smaller with binlog_storage_engine -- see here for the update-inlist test. In general, when sync-on-commit is enabled then the CPU overhead with binlog_storage_engine enabled is between 1/3 and 2/3 of the overhead without it enabled.
Relative to: z12b_sync
col-1 : z12c_sync

col-1
6.40    delete
5.64    insert
4.06    read-write_range=100
4.40    read-write_range=10
7.64    update-index
7.17    update-inlist
5.73    update-nonindex
5.82    update-one
5.80    update-zipf
6.61    write-only


                                    
                                    
                                    
                                    
                                        by Mark Callaghan (noreply@blogger.com)
                                    
                                    
                                
                            
                                
                                    
                                        Franck Pachot
                                    
                                    
                                        Relational composition and Codd's "connection trap" in PostgreSQL and MongoDB
                                    
                                    
                                    
                                        Relational composition is to joins what the cartesian product is to tables: it produces every result that could be true, not just what is true. This often leads to SQL mistakes and can often be suspected when a SELECT DISTINCT is added after a query starts returning more rows than expected, without the root cause being understood.  

In its mathematical definition, relational composition is the derived relation obtained by existentially joining two relations on a shared attribute and projecting away that attribute. In a database, it is meaningful only when a real‑world invariant ensures that the resulting pairs reflect actual facts. Otherwise, the result illustrates what E. F. Codd, in his 1970 paper A Relational Model of Data for Large Shared Data Banks, called the connection trap.  

Codd uses two relations in his example: a supplier supplies parts, and a project uses parts. At an intuitive level, this connection trap mirrors a syllogism: if a supplier supplies a part and a project uses that part, a join can derive that the supplier supplies the project—even when that conclusion was never stated as a fact.  

Codd observed that the connection trap was common in pre‑relational network data models, where users navigated data by following physical pointers. Path existence was often mistaken for semantic relationship. The relational model solved this problem by replacing navigational access with declarative queries over explicitly defined relations, and modern document models now do the same.

However, while the relational model removes pointer‑based navigation, it does not eliminate the trap entirely. Joins can still compute relational compositions, and without appropriate cardinality constraints or business invariants, such compositions may represent only possible relationships rather than actual ones. In this way, the connection trap can be reintroduced at query time, even in modern relational systems such as PostgreSQL, and similarly through $lookup operations in MongoDB.





  
  
  PostgreSQL — reproducing the connection trap


This model declares suppliers, parts, projects, and two independent many‑to‑many relationships:




CREATE TABLE suppliers (
    supplier_id TEXT PRIMARY KEY
);

CREATE TABLE parts (
    part_id TEXT PRIMARY KEY
);

CREATE TABLE projects (
    project_id TEXT PRIMARY KEY
);

-- Supplier supplies parts
CREATE TABLE supplier_part (
    supplier_id TEXT REFERENCES suppliers,
    part_id     TEXT REFERENCES parts,
    PRIMARY KEY (supplier_id, part_id)
);

-- Project uses parts
CREATE TABLE project_part (
    project_id TEXT REFERENCES projects,
    part_id    TEXT REFERENCES parts,
    PRIMARY KEY (project_id, part_id)
);






This follows Codd’s classic suppliers–parts–projects example, where suppliers supply parts and projects use parts as independent relationships.

The following data asserts that project Alpha uses parts P1 and P2, that supplier S1 supplies parts P1 and P2, and that supplier S2 supplies parts P2 and P3:




INSERT INTO suppliers VALUES ('S1'), ('S2');
INSERT INTO parts     VALUES ('P1'), ('P2'), ('P3');
INSERT INTO projects  VALUES ('Alpha');

-- Supplier capabilities
INSERT INTO supplier_part VALUES
('S1', 'P1'),
('S1', 'P2'),
('S2', 'P2'),
('S2', 'P3');

-- Project uses parts P1 and P2
INSERT INTO project_part VALUES
('Alpha', 'P1'),
('Alpha', 'P2');






The following query is valid SQL:




SELECT DISTINCT
    sp.supplier_id,
    pp.project_id
FROM supplier_part sp
JOIN project_part pp
  ON sp.part_id = pp.part_id;






However, this query falls into the connection trap:




 supplier_id | project_id
-------------+------------
 S2          | Alpha
 S1          | Alpha
(2 rows)






As we defined only supplier–part and project–part relationships, any derived supplier–project relationship is not a fact but a relational composition. We know that Alpha uses P1 and P2, and that part P2 can be supplied by either S1 or S2, but we have no record of which supplier actually supplies Alpha.

This query asserts “Supplier S1 supplies project Alpha”, but the data only says: “S1 and S2 supply P2” and “Alpha uses P2”.

This is the connection trap, expressed purely in SQL.


  
  
  PostgreSQL — the correct relational solution


If a supplier actually supplies a part to a project, that fact must be represented directly. We need a new table:




CREATE TABLE supply (

    supplier_id TEXT,
    project_id  TEXT,
    part_id     TEXT,

    PRIMARY KEY (supplier_id, project_id, part_id),

    FOREIGN KEY (supplier_id, part_id)
        REFERENCES supplier_part (supplier_id, part_id),

    FOREIGN KEY (project_id, part_id)
        REFERENCES project_part (project_id, part_id)
);






These foreign keys encode subset constraints between relations and prevent inserting supplies of parts not supplied by the supplier or not used by the project.

This relation explicitly states who supplies what to which project. We assume that the real‑world fact is “Alpha gets part P2 from supplier S1”:




INSERT INTO supply VALUES
('S1', 'Alpha', 'P2');






The correct query reads from this relation:




SELECT supplier_id, project_id
FROM supply;

 supplier_id | project_id
-------------+------------
 S1          | Alpha
(1 row)






The relationship is now real and asserted, not inferred. In total, we have six tables:




postgres=# \d
             List of relations
 Schema |     Name      | Type  |  Owner
--------+---------------+-------+----------
 public | parts         | table | postgres
 public | project_part  | table | postgres
 public | projects      | table | postgres
 public | supplier_part | table | postgres
 public | suppliers     | table | postgres
 public | supply        | table | postgres
(6 rows)






In practice, you should either store the relationship explicitly or avoid claiming it exists. Although the relational model avoids pointers, it is still possible to join through an incorrect path, so the application must enforce the correct one.

In ad-hoc query environments such as data warehouses, data is typically organized into domains and modeled using a dimensional ("star schema") approach. Relationships like project–supplier are represented as fact tables within a single data mart, exposing only semantically valid join paths and preventing invalid joins.





  
  
  MongoDB — reproducing the connection trap


The following MongoDB data mirrors the PostgreSQL example. MongoDB allows representing relationships either as separate collections or by embedding, depending on the bounded context. Here we start with separate collections to mirror the relational model:




db.suppliers.insertMany([
  { _id: "S1" },
  { _id: "S2" }
]);

db.parts.insertMany([
  { _id: "P1" },
  { _id: "P2" },
  { _id: "P3" }
]);

db.projects.insertMany([
  { _id: "Alpha" }
]);

// Supplier capabilities
db.supplier_parts.insertMany([
  { supplier: "S1", part: "P1" },
  { supplier: "S1", part: "P2" },
  { supplier: "S2", part: "P2" },
  { supplier: "S2", part: "P3" }
]);

// Project uses parts P1 and P2
db.project_parts.insertMany([
  { project: "Alpha", part: "P1" },
  { project: "Alpha", part: "P2" }
]);






Using the simple find() API, we cannot fall into the trap directly because there is no implicit connection between suppliers and projects. The application must issue two independent queries and combine the results explicitly.

Simulating the connection trap in a single query therefore requires explicit composition at the application level:




const partsUsedByAlpha = db.project_parts.find(
  { project: "Alpha" },
  { _id: 0, part: 1 }
).toArray();

const suppliersForParts = db.supplier_parts.find(
  { part: { $in: partsUsedByAlpha.map(p => p.part) } },
  { _id: 0, supplier: 1, part: 1 }
).toArray();

const supplierProjectPairs = suppliersForParts.map(sp => ({
  supplier: sp.supplier,
  project: "Alpha"
}));

print(supplierProjectPairs);






When forced by the application logic, here is the connection trap associating suppliers and projects:




[
  { supplier: 'S1', project: 'Alpha' },
  { supplier: 'S1', project: 'Alpha' },
  { supplier: 'S2', project: 'Alpha' }
]






As with SQL joins, a $lookup in an aggregation pipeline can fall in the same connection trap:




db.supplier_parts.aggregate([
  {
    $lookup: {
      from: "project_parts",
      localField: "part",
      foreignField: "part",
      as: "projects"
    }
  },
  { $unwind: "$projects" },
  {
    $project: {
      _id: 0,
      supplier: "$supplier",
      project: "$projects.project"
    }
  }
]);






The result is similar and the projection removed the intermediate attributes:




{ "supplier": "S1", "project": "Alpha" }
{ "supplier": "S1", "project": "Alpha" }
{ "supplier": "S2", "project": "Alpha" }






We reproduced the connection trap by ignoring that $lookup produces a derived relationship, not a real one, and that matching keys does not carry business meaning.


  
  
  MongoDB — normalized solution


As with SQL, we can add an explicit supplies collection that stores the relationship between projects and suppliers:




db.supplies.insertOne({
  project: "Alpha",
  supplier: "S1",
  part: "P2"
});






Then we simply query this collection:




db.supplies.find(
  { project: "Alpha" },
  { _id: 0, supplier: 1, part: 1 }
);

[ { supplier: 'S1', part: 'P2' } ]







The document model is a superset of the relational model as relations can be stored as flat collections.  The difference is that referential integrity is enforced by the application rather than in-database. To enforce relationships in the database, they must be embedded as sub-documents and arrays.


  
  
  MongoDB — domain-driven solution


It's not the only solution in a document database, as we can store a schema based on the domain model rather than normalized. MongoDB allows representing this relationship as part of an aggregate. In a project‑centric bounded context, the project is the aggregate root, and the supplier information can be embedded as part of the supply fact:




db.projects.updateOne(
  { _id: "Alpha" },
  {
    $set: {
      parts: [
        { part: "P2", supplier: "S1" },
        { part: "P1", supplier: null }
      ]
    }
  },
  { upsert: true }
);






The query doesn't need a join and cannot fall into the connection trap:




db.projects.find(
  { _id: "Alpha" },
  { _id: 1, "parts.supplier": 1 }
);

[
  {
    _id: 'Alpha',
    parts: [ 
      { supplier: 'S1' },
      { supplier: null }
    ]
  }
]






This avoids the connection trap by construction. It may look like data duplication—the same supplier name may appear in multiple project documents—and indeed this would be undesirable in a fully normalized model shared across all business domains. However, this structure represents a valid aggregate within a bounded context.

In this context, the embedded supplier information is part of the supply fact, not a reference to a global supplier record. If a supplier’s name changes, it is a business decision, not a database decision, whether that change should be propagated to existing projects or whether historical data should retain the supplier name as it was at the time of supply.

Even when propagation is desired, MongoDB allows updating embedded data efficiently:




db.projects.updateMany(
  // document filter
  { "parts.supplier": "S1" },
  // document update using the array's item from array filter
  {
    $set: {
      "parts.$[p].supplier": "Supplier One"
    }
  },
  // array filter defining the array's item for the update
  {
    arrayFilters: [{ "p.supplier": "S1" }]
  }
);






This update is not atomic across documents, but each document update is atomic and the operation is idempotent and can be safely retried or executed within an explicit transaction if full atomicity is required.





  
  
  Conclusion


The connection trap occurs whenever relationships are inferred from shared keys, at query time, instead of being explicitly represented as facts, at write time. In SQL, this means introducing explicit association tables and enforcing integrity constraints, rather than deriving then though joins. In MongoDB, it means modeling relationships as explicit documents or embedded subdocuments rather than deriving them through lookups.

In a relational database, the schema is designed to be normalized and independent of specific use cases. All many‑to‑many and fact‑bearing relationships must be declared explicitly, and queries must follow the correct relational path. Referential and cardinality constraints are essential to restrict to only actual facts.

In MongoDB, the data model is typically driven by the domain and the application’s use cases. In a domain-driven design (DDD), strong relationships are modeled as aggregates, embedding related data directly within a document in MongoDB collections. This makes the intended semantics explicit and avoids inferring relationships at query time. Apparent duplication is not a flaw here, but a deliberate modeling choice within a bounded context.

Ultimately, the connection trap is not fully avoided by the data model, but can be a query-time error with joins and projections: deriving relationships that were never asserted. Whether using normalized relations or domain‑driven documents, the rule is the same—if a relationship is a fact, it must be stored as one.
                                    
                                    
                                    
                                    
                                        by Franck Pachot
                                    
                                    
                                
                            
                        
                    
                        
                            February 15, 2026
                            
                            
                                
                                    
                                        Small Datum - Mark Callaghan
                                    
                                    
                                        HammerDB tproc-c on a large server, Postgres and MySQL
                                    
                                    
                                    
                                        This has results for HammerDB tproc-c on a small server using MySQL and Postgres. I am new to HammerDB and still figuring out how to explain and present results so I will keep this simple and just share graphs without explaining the results.

The comparison might favor Postgres for the IO-bound workloads because I used smaller buffer pools than normal to avoid OOM. I have to do this because RSS for the HammerDB client grows over time as it buffers more response time stats. And while I used buffered IO for Postgres, I use O_DIRECT for InnoDB. So Postgres might have avoided some read IO thanks to the OS page cache while InnoDB did not.
tl;dr for MySQL
With vu=40 MySQL 8.4.8 uses about 2X more CPU per transaction and does more than 2X more context switches per transaction compared to Postgres 18.1. I will get CPU profiles soon.
Modern MySQL brings us great improvements to concurrency and too many new CPU overheads
MySQL 5.6 and 8.4 have similar throughput at the lowest concurrency (vu=10)
MySQl 8.4 is a lot faster than 5.6 at the highest concurrency (vu=40)
tl;dr for Postgres
Modern Postgres has regressions relative to old Postgres
The regressions increase with the warehouse count, at wh=4000 the NOPM drops between 3% and 13% depending on the virtual user count (vu).
tl;dr for Postgres vs MySQL
Postgres and MySQL have similar throughput for the largest warehouse count (wh=4000)
Otherwise Postgres gets between 1.4X and 2X more throughput (NOPM)
Builds, configuration and hardware

I compiled Postgres versions from source: 12.22, 13.23, 14.20, 15.15, 16.11, 17.7 and 18.1.

I compiled MySQL versions from source: 5.6.51, 5.7.44, 8.0.45, 8.4.8, 9.4.0 and 9.6.0.

I used a 48-core server from Hetzner
an ax162s with an AMD EPYC 9454P 48-Core Processor with SMT disabled
2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
128G RAM
Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)
Postgres configuration files:
prior to v18 the config file is named conf.diff.cx10a50g_c32r128 (x10a_c32r128) and is here for versions 12, 13, 14, 15, 16 and 17.
for Postgres 18 I used conf.diff.cx10b_c32r128 (x10b_c32r128) with io_method=sync to be similar to the config used for versions 12 through 17.
MySQL configuration files
prior to 9.6 the config file is named my.cnf.cz12a50g_c32r128 (z12a50g_c32r128 or z12a50g) and is here for versions 5.6, 5.7, 8.0 and 8.4
for 9.6 it is named my.cnf.cz13a50g_c32r128 (z13a50g_c32r128 or z13a50g) and is here
For both Postgres and MySQL fsync on commit is disabled to avoid turning this into an fsync benchmark. The server has 2 SSDs with SW RAID and low fsync latency.

Benchmark

The benchmark is tproc-c from HammerDB. The tproc-c benchmark is derived from TPC-C.

The benchmark was run for several workloads:
vu=10, wh=1000 - 10 virtual users, 1000 warehouses
vu=20, wh=1000 - 20 virtual users, 1000 warehouses
vu=40, wh=1000 - 40 virtual users, 1000 warehouses
vu=10, wh=2000 - 10 virtual users, 2000 warehouses
vu=20, wh=2000 - 20 virtual users, 2000 warehouses
vu=40, wh=2000 - 40 virtual users, 2000 warehouses
vu=10, wh=4000 - 10 virtual users, 4000 warehouses
vu=20, wh=4000 - 20 virtual users, 4000 warehouses
vu=40, wh=4000 - 40 virtual users, 4000 warehouses
The wh=1000 workloads are less heavy on IO. The wh=4000 workloads are more heavy on IO.

The benchmark for Postgres is run by a variant of this script which depends on scripts here. The MySQL scripts are similar.
stored procedures are enabled
partitioning is used because the warehouse count is >= 1000
a 5 minute rampup is used
then performance is measured for 60 minutes
Basic metrics: iostat

I am still improving my helper scripts to report various performance metrics. The table here has average values from iostat during the benchmark run phase for MySQL 8.4.8 and Postgres 18.1. For these configurations the NOPM values for Postgres and MySQL were similar so I won't present normalized values (average value / NOPM) and NOPM is throughput.
average wMB/s increases with the warehouse count for Postgres but not for MySQL
r/s increases with the warehouse count for Postgres and MySQL
iostat metrics
* r/s = average rate of reads/s from storage
* wMB/s = average MB/s written to storage

my8408
r/s     wMB/s
22833.0 906.2   vu=40, wh=1000
63079.8 1428.5  vu=40, wh=2000
82282.3 1398.2  vu=40, wh=4000

pg181
r/s     wMB/s
30394.9 1261.9  vu=40, wh=1000
59770.4 1267.8  vu=40, wh=2000
78052.3 1272.9  vu=40, wh=4000

Basic metrics: vmstat

I am still improving my helper scripts to report various performance metrics. The table here has average values from vmstat during the benchmark run phase for MySQL 8.4.8 and Postgres 18.1. For these configurations the NOPM values for Postgres and MySQL were similar so I won't present normalized values (average value / NOPM).
CPU utilization is almost 2X larger for MySQL
Context switch rates are more than 2X larger for MySQL
In the future I hope to learn why MySQL uses almost 2X more CPU per transaction and has more than 2X more context switches per transaction relative to Postgres
vmstat metrics
* cs - average value for cs (context switches/s)
* us - average value for us (user CPU)
* sy - average value for sy (system CPU)
* id - average value for id (idle)
* wa - average value for wa (waiting for IO)
* us+sy - sum of us and sy

my8408
cs      us      sy      id      wa      us+sy
455648  61.9    8.2     24.2    5.7     70.1    vu=40, wh=1000
484955  50.4    9.2     19.5    21.0    59.6    vu=40, wh=2000
487410  39.5    8.4     19.4    32.6    48.0    vu=40, wh=4000

pg181
cs      us      sy      id      wa      us+sy
127486  23.5    10.1    63.3    3.0     33.6    vu=40, wh=1000
166257  17.2    11.1    62.5    9.1     28.3    vu=40, wh=2000
203578  13.9    11.3    59.2    15.6    25.2    vu=40, wh=4000

Results

My analysis at this point is simple -- I only consider average throughput. Eventually I will examine throughput over time and efficiency (CPU and IO).

On the charts that follow y-axis does not start at 0 to improve readability at the risk of overstating the differences. The y-axis shows relative throughput. There might be a regression when the relative throughput is less than 1.0. There might be an improvement when it is > 1.0. The relative throughput is:
(NOPM for some-version / NOPM for base-version)
I provide three charts below:
only MySQL - base-version is MySQL 5.6.51
only Postgres - base-version is Postgres 12.22
Postgres vs MySQL - base-version is Postgres 18.1, some-version is MySQL 8.4.8
Results: MySQL 5.6 to 9.6
Legend:
my5651.z12a is MySQL 5.6.51 with the z12a50g config
my5744.z12a is MySQL 5.7.44 with the z12a50g config
my8045.z12a is MySQL 8.0.45 with the z12a50g config
my8408.z12a is MySQL 8.4.8 with the z12a50g config
my9500.z13a is MySQL 9.6.0 with the z13a50g config
Summary
At the lowest concurrency (vu=10) MySQL 8.4.8 has similar throughput as 5.6.51 because CPU regressions in modern MySQL offset the concurrency improvements.
At the highest concurrency (vu=40) MySQL 8.4.8 is much faster than 5.6.51 and the regressions after 5.7 are small. This matches what I have seen elsewhere -- while modern MySQL suffers from CPU regressions it benefits from concurrency improvements. Imagine if we could get those concurrency improvements without the CPU regressions.
And the absolute NOPM values are here:
my5651 my5744 my8045 my8408 my9600
vu=10, wh=1000 163059 183268 156039 155194 151748
vu=20, wh=1000 210506 321670 283282 281038 279269
vu=40, wh=1000 216677 454743 439589 435095 433618
vu=10, wh=2000 107492 130229 111798 110161 108386
vu=20, wh=2000 155398 225068 193658 190717 189847
vu=40, wh=2000 178278 302723 297236 307504 293217
vu=10, wh=4000 81242 103406 89414 89316 88458
vu=20, wh=4000 131241 179112 155134 152998 152301
vu=40, wh=4000 146809 228554 234922 229511 230557
Results: Postgres 12 to 18
Legend:
pg1222 is Postgres 12.22 with the x10a50g config
pg1323 is Postgres 13.23 with the x10a50g config
pg1420 is Postgres 14.20 with the x10a50g config
pg1515 is Postgres 15.15 with the x10a50g config
pg1611 is Postgres 16.11 with the x10a50g config
pg177 is Postgres 17.7 with the x10a50g config
pg181 is Postgres 18.1 with the x10b50g config
Summary
Modern Postgres has regressions relative to old Postgres
The regressions increase with the warehouse count, at wh=4000 the NOPM drops between 3% and 13% depending on the virtual user count (vu).

The relative NOPM values are here:

pg1222 pg1323 pg1420 pg1515 pg1611 pg177 pg181
vu=10, wh=1000 1.000 1.000 1.054 1.042 1.004 1.010 0.968
vu=20, wh=1000 1.000 1.035 1.037 1.028 1.028 1.001 0.997
vu=40, wh=1000 1.000 1.040 0.988 1.000 1.027 0.998 0.970
vu=10, wh=2000 1.000 1.026 1.059 1.075 1.068 1.081 1.029
vu=20, wh=2000 1.000 1.022 1.046 1.043 0.979 0.972 0.934
vu=40, wh=2000 1.000 1.014 1.032 1.036 0.979 1.010 0.947
vu=10, wh=4000 1.000 1.027 1.032 1.035 0.993 0.998 0.974
vu=20, wh=4000 1.000 1.005 1.049 1.048 0.940 0.927 0.876
vu=40, wh=4000 1.000 0.991 1.019 0.983 1.001 0.979 0.937
The absolute NOPM values are here:

pg1222 pg1323 pg1420 pg1515 pg1611 pg177 pg181
vu=10, wh=1000 353077 353048 372015 367933 354513 356469 341688
vu=20, wh=1000 423565 438456 439398 435454 435288 423986 422397
vu=40, wh=1000 445114 462851 439728 445144 457110 444364 431648
vu=10, wh=2000 223048 228914 236231 239868 238117 241185 229549
vu=20, wh=2000 314380 321380 328688 328044 307728 305452 293627
vu=40, wh=2000 320347 324769 330444 331896 313553 323454 303403
vu=10, wh=4000 162054 166461 167320 167761 160962 161716 157872
vu=20, wh=4000 244598 245804 256593 256231 230037 226844 214309
vu=40, wh=4000 252931 250634 257820 248584 253059 247610 236986
Results: MySQL vs Postgres
Legend:
pg181 is Postgres 18.1 with the x10b50g config
my8408 is MySQL 8.4.8 with the z12a50g config
Summary
Postgres and MySQL have similar throughput for the largest warehouse count (wh=4000)
Otherwise Postgres gets between 1.4X and 2X more throughput (NOPM)
The absolute NOPM values are here:

pg181 my8408
vu=10, wh=1000 341688 155194
vu=20, wh=1000 422397 281038
vu=40, wh=1000 431648 435095
vu=10, wh=2000 229549 110161
vu=20, wh=2000 293627 190717
vu=40, wh=2000 303403 307504
vu=10, wh=4000 157872 89316
vu=20, wh=4000 214309 152998

                                        by Mark Callaghan (noreply@blogger.com)
                                    
                                    
                                
                            
                                
                                    
                                        Murat Demirbas
                                    
                                    
                                        Butlers or Architects?
                                    
                                    
                                    
                                        In a recent viral post, Matt Shumer declares dramatically that we've crossed an irreversible threshold. He asserts that the latest AI models now exercise independent judgment, that he simply gives an AI plain-English instructions, steps away for a few hours, and returns to a flawlessly finished product that surpasses his own capabilities. In the near future, he claims, AI will autonomously handle all knowledge work and even build the next generation of AI itself, leaving human creators completely blindsided by the exponential curve.
This was a depressing read. The dramatic tone lands well. And by extrapolating from progress in the last six years, it's hard to argue against what AI might achieve in the next six.
I forwarded this to a friend of mine, who had the misfortune of reading it before bed. He told me he had a nightmare about it, dreaming of himself as an Uber driver, completely displaced from his high-tech career.

Someone on Twitter had a come back: "The thing I don't get is: Claude Code is writing 100% of Claude's code now. But Anthropic has 100+ open dev positions on their jobs page?" Boris Cherny of Anthropic replied: "The reality is that someone has to prompt the Claudes, talk to customers, coordinate with other teams, and decide what to build next. Engineering is changing, and great engineers are more important than ever."
This is strongly reminiscent of the Shell Game podcast I wrote about recently. And it connects to my arguments in "Agentic AI and The Mythical Agent-Month" about the mathematical laws of scaling coordination. Throwing thousands of AI agents at a project does not magically bypass Brooks' Law. Agents can dramatically scale the volume of code generated, but they do not scale insight. Coordination complexity and verification bottlenecks remain firmly in place. Until you solve the epistemic gap of distributed knowledge, adding more agents simply produces a faster, more expensive way to generate merge conflicts. Design, at its core, is still very human.
Trung Phan's recent piece on how Docusign still employs 7,000 people in the age of AI provides useful context as well. Complex organizations don't dissolve overnight. Societal constructs, institutional inertia, regulatory frameworks, and the deeply human texture of business relationships all act as buffers. The world changes slower than the benchmarks suggest.

So we are nowhere near a fully autonomous AI that sweeps up all knowledge work and solves everything. When we step back, two ways of reading the situation come into view.
The first is that we are all becoming butlers for LLMs: priming the model, feeding it context in careful portions, adding constraints, nudging tone, coaxing the trajectory. Then stepping back to watch it cook. We do the setup and it does the real work.
But as a perennial optimist, I think we are becoming architects. Deep work will not disappear, rather it will become the only work that matters. We get to design the blueprint, break down logic in high-level parts, set the vision, dictate strategy, and chart trajectory. We do the real thinking, and then we make the model grind.








In anyway, this shift brings a real danger. If we delegate execution, it becomes tempting to delegate thought gradually. LLMs make thinking feel optional. People were already reluctant to think; now they can bypass it entirely. It is unsettling to watch a statistical prediction machine stand in for reasoning. Humbling, too. Maybe we're not as special as we assumed. 
This reminds me Ted Chiang's story "Catching Crumbs from the Table" where humanity is reduced to interpreting the outputs of a vastly superior intellect. Human scientists no longer produce breakthroughs themselves; they spend their careers reverse-engineering discoveries made by "metahumans". The tragedy is that humans are no longer the source of the insight, they are merely trying to explain metahumans' genius. The title captures the feeling really well. We're not at the table anymore. We're just gathering what falls from it.
Even if things come to that, I know I'll keep thinking, keep learning, keep striving to build things. As I reflected in an earlier post on finding one's true calling, this pursuit of knowledge and creation is my dharma. That basic human drive to understand things and build things is not something an LLM can automate away. This I believe.


I recently launched a free email newsletter for the blog. Subscribe here to get these essays delivered to your inbox, along with behind-the-scenes commentary and curated links on distributed systems, technology, and other curiosities. 
                                    
                                    
                                    
                                    
                                        by Murat (noreply@blogger.com)
                                    
                                    
                                
                            
                                
                                    
                                        Phil Eaton
                                    
                                    
                                        Generating vector embeddings for semantic search locally
                                    
                                    
                                    
                                        This is an external post of mine. Click
here
if you are not redirected.
                                    
                                    
                                    
                                
                            
                        
                    
                    
                    
                    
                        
                        
                            ←
                        
                        
                        
                        
                        
                            
                                1
                            
                        
                            
                                2
                            
                        
                            
                                3
                            
                        
                            
                                4
                            
                        
                            
                                5
                            
                        
                            
                                6
                            
                        
                            
                                7
                            
                        
                            
                                8
                            
                        
                            
                                9
                            
                        
                            
                                10
                            
                        
                        
                        
                            
                                ...
                            
                            87
                        
                        
                        
                        
                            →
                        
                    
                    
                
            
        
    

    
        
            
                
                    notice a problem? @isamlambert
                
                
                    rss feed

dbms	l.i0	l.x	l.i1	l.i2	qr100	qp100	qr500	qp500	qr1000	qp1000
ma120300_rel_withdbg.cz12b_c8r32	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
ma120300_rel_withdbg.cz12c_c8r32	1.03	1.01	1.00	1.03	1.00	0.99	1.00	1.00	1.01	1.00
ma120300_rel_withdbg.cz12b_sync_c8r32	0.04	1.02	0.07	0.01	1.01	1.01	1.00	1.01	1.00	1.00
ma120300_rel_withdbg.cz12c_sync_c8r32	0.08	1.03	0.28	0.06	1.02	1.01	1.01	1.02	1.02	1.01

	my5651	my5744	my8045	my8408	my9600
vu=10, wh=1000	163059	183268	156039	155194	151748
vu=20, wh=1000	210506	321670	283282	281038	279269
vu=40, wh=1000	216677	454743	439589	435095	433618
vu=10, wh=2000	107492	130229	111798	110161	108386
vu=20, wh=2000	155398	225068	193658	190717	189847
vu=40, wh=2000	178278	302723	297236	307504	293217
vu=10, wh=4000	81242	103406	89414	89316	88458
vu=20, wh=4000	131241	179112	155134	152998	152301
vu=40, wh=4000	146809	228554	234922	229511	230557

	pg1222	pg1323	pg1420	pg1515	pg1611	pg177	pg181
vu=10, wh=1000	1.000	1.000	1.054	1.042	1.004	1.010	0.968
vu=20, wh=1000	1.000	1.035	1.037	1.028	1.028	1.001	0.997
vu=40, wh=1000	1.000	1.040	0.988	1.000	1.027	0.998	0.970
vu=10, wh=2000	1.000	1.026	1.059	1.075	1.068	1.081	1.029
vu=20, wh=2000	1.000	1.022	1.046	1.043	0.979	0.972	0.934
vu=40, wh=2000	1.000	1.014	1.032	1.036	0.979	1.010	0.947
vu=10, wh=4000	1.000	1.027	1.032	1.035	0.993	0.998	0.974
vu=20, wh=4000	1.000	1.005	1.049	1.048	0.940	0.927	0.876
vu=40, wh=4000	1.000	0.991	1.019	0.983	1.001	0.979	0.937

February 22, 2026

February 21, 2026

PostgreSQL: WAIT FOR LSN (PG19)

MongoDB: Causal Consistency

Conclusion

The Trap of Micro-Optimizations

Away From the Glowing Rectangle

February 19, 2026

MongoDB with search indexes

Dataset generation

Search index creation

Query and result

Performance breakdown (explain)

February 18, 2026

Basic Concept: Tickets

Lottery vs. Stride vs. CFS scheduling

Challenges in Proportional Sharing

February 17, 2026

February 16, 2026

PostgreSQL — reproducing the connection trap

PostgreSQL — the correct relational solution

MongoDB — reproducing the connection trap

MongoDB — normalized solution

MongoDB — domain-driven solution

Conclusion

February 15, 2026

PostgreSQL: `WAIT FOR LSN` (PG19)