October 22, 2025
Hybrid Search in PostgreSQL: The Missing Manual
October 21, 2025
Monitoring multithreaded replication in Amazon RDS for MySQL, Amazon RDS for MariaDB, and Aurora MySQL
Overview and best practices of multithreaded replication in Amazon RDS for MySQL, Amazon RDS for MariaDB, and Amazon Aurora MySQL
October 20, 2025
Determine how much concurrency to use on a benchmark for small, medium and large servers
What I describe here works for me given my goal, which is to find performance regressions. A benchmark run at low concurrency is used to find regressions from CPU overhead. A benchmark run at high concurrency is used to find regressions from mutex contention. A benchmark run at medium concurrency might help find both.
My informal way for classifying servers by size is:
- small - has less than 10 cores
- medium - has between 10 and 20 cores
- large - has more than 20 cores
I almost always co-locate benchmark clients and the DBMS on the same server. This comes at a cost (less CPU and RAM is available for the DBMS) and might have odd artifacts because clients in the real world are usually not co-located. But it has benefits that matter to me. First, I don't worry about variance from changes in network latency. Second, this is much easier to setup.
Even for benchmarks that have some benchmark steps where the workload will have IO waits, I will still limit the amount of concurrency unless all benchmark steps that I measure will have IO waits.
Finally, I usually set the benchmark concurrency level to be less than the number of CPU cores because I want to leave some cores for the DBMS to do the important background work, which is mostly MVCC garbage collection -- MyRocks compaction, InnoDB purge and dirty page writeback, Postgres vacuum.
October 16, 2025
Why is RocksDB spending so much time handling page faults?
This week I was running benchmarks to understand how fast RocksDB could do IO, and then compared that to fio to understand the CPU overhead added by RocksDB. While looking at flamegraphs taken during the benchmark I was confused that about 20% of the samples were from page fault handling. This confused me at first.
The lesson here is to run your benchmark long enough to reach a steady state before you measure things or there will be confusion. And I was definitely confused when I first saw this. Perhaps my post saves time for the next person who spots this.
The workload is db_bench with a database size that is much larger than memory and read-only microbenchmarks for point lookups and range scans.
Then I wondered if this was a transient issue that occurs while RocksDB is warming up the block cache and growing process RSS until the block cache has been fully allocated.
While b-trees as used by Postgres and MySQL will do a large allocation at process start, RocksDB does an allocation per block read, and when the block is evicted then the allocation is free'd. This can be a stress test for a memory allocator which is why jemalloc and tcmalloc work better than glibc malloc for RocksDB. I revisit the mallocator topic every few years and my most recent post is here.
In this case I use RocksDB with jemalloc. Even though per-block allocations are transient, the memory used by jemalloc is mostly not transient. While there are cases where jemalloc an return memory to the OS, with my usage that is unlikely to happen.
Were I to let the benchmark run for a long enough time, then eventually jemalloc would finish getting memory from the OS. However, my tests were running for about 10 minutes and doing about 10,000 block reads per second while I had configured RocksDB to use a block cache that was at least 36G and the block size was 8kb. So my tests weren't running long enough for the block cache to fill, which means that during the measurement period:
- jemalloc was still asking for memory
- block cache eviction wasn't needed and after each block read a new entry was added to the block cache
When I run the benchmark for more time, the CPU overhead from page fault handling goes away.
ClickHouse® CREATE TABLE example: Follow these steps
ClickHouse® common table expression (CTE) example: how to use WITH in ClickHouse
ClickHouse® ReplacingMergeTree engine: Examples and use cases
ClickHouse® left join examples: step-by-step guide for Clickhouse
ClickHouse materialized view example: how to materialize data in ClickHouse®
Snap, Inc. Launches Snap Cloud, Powered by Supabase
October 14, 2025
Security Advisory: CVE Affecting Percona Monitoring and Management (PMM)
Amazon Aurora MySQL zero-ETL integration with Amazon SageMaker Lakehouse
Is it time for TPC-BLOB?
If you want to store vectors in your database then what you store as a row, KV pair or document is likely to be larger than the fixed-page size (when your DBMS uses fixed-page sizes) and you will soon care about efficient and performant support for large objects. I assume this support hasn't been the top priority for many DBMS implementations and there will be some performance bugs.
In a SQL DBMS, support for large objects will use the plumbing created to handle LOB (Large OBject) datatypes. We should define what the L in LOB means here and I will wave my hands and claim larger than a fixed-page in your favorite DBMS but smaller than 512kb because I limit my focus to online workloads.
Perhaps now is the time for industry standard benchmarks for workloads with large objects. Should it be TPC-LOB or TPC-BLOB?
Most popular DBMS use fixed-size pages whether that storage is index-organized via an update-in-place b-tree (InnoDB) or heap-organized (Postgres, Oracle). For rows that are larger than the page size, which is usually between 4kb and 16kb, the entire row or largest columns will be stored out of line and likely split across several pages in the out of line storage. When the row is read, additional reads will be done to gather all of the too-large parts from the out of line locations.
This approach is far from optimal as there will be more CPU overhead, more random IO and might be more wasted space. But this was good enough because support for LOBs wasn't a priority for these DBMS as their focus was on OLTP where rows were likely to be smaller than a fixed-size page.
Perhaps by luck, perhaps it was fate, but WiredTiger is a great fit for MongoDB because it is more flexible about page sizes. And it is more flexible because it isn't an update-in-place b-tree, instead it is a copy-on-write random (CoW-R) b-tree that doesn't need or use out-of-line storage, although for extra large documents there might be a benefit from out-of-line.
MyRocks, and other LSM-based DBMS, also don't require out-of-line storage but they can benefit from it as shown by WiscKey and other engines that do key-value separation. Even the mighty RocksDB has an implementation of key-value separation via BlobDB.
Benchmarking Postgres 17 vs 18
Here are some ClickHouse® Cloud alternatives to consider
ClickHouse® vs BigQuery for real-time analytics
ClickHouse® vs Databricks: Architecture, performance, cost & real-time analytics
ClickHouse® vs Druid: A battle between two solid real-time analytics engines
ClickHouse® vs Firebolt for real-time data warehousing
ClickHouse® vs PostgreSQL (with extensions)
ClickHouse® vs Snowflake: Speed, pricing, features, migration
October 13, 2025
Postgres 18.0 vs sysbench on a 32-core server
This is yet another great result for Postgres 18.0 vs sysbench. This time I used a 32-core server. Results for a 24-core server are here. The goal for this benchmark is to check for regressions from new CPU overhead and mutex contention.
I repeated the benchmark twice because I had some uncertainty about platform variance (HW and SW) on the first run.
tl;dr, from Postgres 17.6 to 18.0
- There might be regressions from 17.6 to 18.0 but they are small (usually <= 3%)
tl;dr, from Postgres 12.22 through 18.0
- the hot-points test is almost 2X faster starting in 17.6
- scan is ~1.2X faster starting in 14.19
- all write tests are much faster staring in 17.6
For 18.0 I tried 3 configuration files:
- conf.diff.cx10b_c32r128 (x10b) - uses io_method=sync
- conf.diff.cx10c_c32r128 (x10c) - uses io_method=worker
- conf.diff.cx10d_c32r128 (x10d) - uses io_method=io_uring
Benchmark
The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.
The benchmark is run with 24 clients and 8 tables with 10M rows per table. The purpose is to search for regressions from new CPU overhead and mutex contention.
In the second run, all results were collected within 7 days and I am less concerned about variance there.
I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for base version)
I present results for:
- versions 12 through 18 using 12.22 as the base version
- versions 17.6 and 18.0 using 17.6 as the base version
- 18.0 looks better relative to 17.6 in the second run and I explain my uncertainty about the first run above
- But I am skeptical about the great result for 18.0 on the full scan test (scan_range=100) in the second run. That might be variance induced by vacuum.
- There might be regressions from 17.6 to 18.0 but they are small (usually <= 3%)
- The small regression in read-only_range=10 might be from new optimizer overhead, because it doesn't reproduce when the length of the range query is increased -- see read-only_range=100 and read-only_range=10000.
- the hot-points test is almost 2X faster starting in 17.6
- scan is ~1.2X faster starting in 14.19
- all write tests are much faster staring in 17.6
Copy-and-Patch: How It Works
Copy-and-Patch: A Copy-and-Patch Tutorial
October 11, 2025
Geoblocking Multiple Localities With Nginx
A few months back I wound up concluding, based on conversations with Ofcom, that aphyr.com might be illegal in the UK due to the UK Online Safety Act. I wrote a short tutorial on geoblocking a single country using Nginx on Debian.
Now Mississippi’s 2024 HB 1126 has made it illegal for essentially any web site to know a user’s e-mail address, or other “personal identifying information”, unless that site also takes steps to "verify the age of the person creating an account”. Bluesky wound up geoblocking Mississippi. Over on a small forum I help run, we paid our lawyers to look into HB 1126, and the conclusion was that we were likely in the same boat. Collecting email addresses put us in scope of the bill, and it wasn’t clear whether the LLC would shield officers (hi) from personal liability.
This blog has the same problem: people use email addresses to post and confirm their comments. I think my personal blog is probably at low risk, but a.) I’d like to draw attention to this legislation, and b.) my risk is elevated by being gay online, and having written and called a whole bunch of Mississippi legislators about HB 1126. Long story short, I’d like to block both a country and an individual state. Here’s how:
First, set up geoipupdate as before. Then, in /etc/nginx/conf.d.geoblock.conf, pull in the country and city databases, and map the countries and states you’d like to block to short strings explaining the applicable law. This creates variables $geoblock_country_law and $geoblock_state_law.
geoip2 /var/lib/GeoIP/GeoLite2-Country.mmdb {
$geoip2_data_country_iso_code country iso_code;
}
geoip2 /var/lib/GeoIP/GeoLite2-City.mmdb {
$geoip2_data_state_name subdivisions 0 names en;
}
map $geoip2_data_country_iso_code $geoblock_country_law {
GB "the UK Online Safety Act";
default "";
}
map $geoip2_data_state_name $geoblock_state_law {
Mississippi "Mississippi HB 1126";
default "";
}
Create an HTML page to show to geoblocked IPs. I’ve put mine in /var/www/custom_errors/451.html. The special comments here are Server-Side Include (SSI) directives; they’ll insert the contents of the $geoblock_law variable from nginx, which we’ll set shortly.
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Unavailable Due to
<!--# echo var="geoblock_law" default=""-->
</title>
</head>
<body>
<h1>Unavailable Due to
<!--# echo var="geoblock_law" default=""-->
</h1>
</body>
</html>
Then, in /etc/nginx/sites-enabled/whatever.conf, add an error page for status code 451 (unavailable for legal reasons). In the main location block, check the $geoblock_country_law and $geoblock_state_law variables, and use them to return status 451, and set the $geoblock_law variable for the SSI template:
server {
...
# Status 451 renders this page
error_page 451 /451.html;
location /451.html {
ssi on;
internal;
root /var/www/custom_errors/;
}
location / {
# If either geoblock variable is set, return status 451
if ($geoblock_state_law != "") {
set $geoblock_law $geoblock_state_law;
return 451;
}
if ($geoblock_country_law != "") {
set $geoblock_law $geoblock_country_law;
return 451;
}
}
}
Test with nginx -t, and reload with service nginx reload, as usual.
Geoblocking is a bad experience in general. In Amsterdam and Frankfurt, I’ve seen my cell phone’s 5G connection and hotel WiFi improperly identified as being in the UK. I’m certain this is going to block people who aren’t in Mississippi either. If you don’t want to live in this world either, start calling your representatives to demand better legislation.