November 12, 2024
November 09, 2024
Fixing some of the InnoDB scan perf regressions in a MySQL fork
I recently learned of Advanced MySQL, a MySQL fork, and ran my sysbench benchmarks for it. It fixed some, but not all, of the regressions for write heavy workloads that landed in InnoDB after MySQL 8.0.28.
In response to my results, the project lead filed a bug for performance regressions and then quickly came up with a diff. The bug in this case is for regressions that are most obvious during full table scans and the problems arrived in MySQL 8.0.29 and 8.0.30 -- see bug 111538 and this post. The bug is closed for upstream but the perf regressions remain so I am excited to see the community working to solve this problem.
tl;dr
- Advanced MySQL with the fix removes much of the regression in scan performance
I tried 4 builds
- my8028 - upstream MySQL 8.0.28
- my8040 - upstream MySQL 8.0.40
- my8040adv_pre - Advanced MySQL 8.0.40 without the fix (without d347cdb)
- my8040adv_post - Advanced MySQL 8.0.40 with the fix (at d347cdb)
- dell32
- Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX with 32-Cores, 2 m.2 SSD (each 2TB, RAID SW 0, ext4).
- ax162-s
- AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G RAM, Ubuntu 22.04 and ext4 on 2 NVMe devices with SW RAID 1. This is in the Hetzner cloud.
- bee
- Beelink SER 4700u with Ryzen 7 4700u, 16G RAM, Ubuntu 22.04 and ext4 on NVMe
Benchmark
- dell32 - 8 tables, 10M rows per table and 24 threads
- ax162-s - 8 tables, 10M rows per table and 40 threads
- bee - 1 table, 30M rows and 1 thread
- rQPS is: (QPS for my version / QPS for base version)
- base version is the QPS from MySQL 8.0.28
- my version is one of the other versions
- Summary
- QPS with the fix in Advanced MySQL is ~9% better than without the fix
- QPS with the fix in Advanced MySQL is ~2% better than my8040.
- I am not sure why my8040adv_pre did much worse than my8040
- QPS is ~18% larger with the fix in Advanced MySQL
- CPU overhead is ~15% smaller with the fix
- QPS is ~17% larger with the fix in Advanced MySQL
- CPU overhead is ~15% smaller with the fix
RocksDB benchmarks: large server, leveled compaction
I recently shared benchmark results for RocksDB a few weeks ago for both leveled and universal compaction on a small server. This post has results from a large server with leveled compaction.
tl;dr
- there are a few regressions from bug 12038
- QPS for overwrite is ~1.5X to ~2X better in 9.x than 6.0 (ignoring bug 12038)
- otherwise QPS in 9.x is similar to 6.x
Hardware
The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.
- 6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
- 7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
- 8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
- 9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1 and 9.7.3
- fillseq -- load in key order with the WAL disabled
- revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
- fwdrangeww -- like revrangeww except do short forward range scans
- readww - like revrangeww except do point queries
- overwrite - do overwrites (Put) as fast as possible
There are three workloads, all of which use 40 threads:
- byrx - the database is cached by RocksDB (100M KV pairs)
- iobuf - the database is larger than memory and RocksDB uses buffered IO (2B KV pairs)
- iodir - the database is larger than memory and RocksDB uses O_DIRECT (2B KV pairs)
- fillseq is worse from 6.0 to 8.0 but stable since then
- overwrite has large improvements late in 6.0 and small improvements since then
- fwdrangeww has small improvements in early 7.0 and is stable since then
- revrangeww and readww are stable from 6.0 through 9.
- bug 12038 explains the drop in throughput for overwrite since 8.6.7
- otherwise QPS in 9.x is similar to 6.0
- the QPS drop for overwrite in 8.6.7 occurs because the db_bench client wasn't updated to use the new default value for compaction readahead size
- QPS for overwrite is ~2X better in 9.x relative to 6.0
- otherwise QPS in 9.x is similar to 6.0
Efficient MySQL Performance In 10 Sentences
Don’t have time to read Efficient MySQL Performance? Here’s the book (10 chapters) in one-liners.
- Performance is query response time.
- Proper left-most indexing is required for performance.
- The less data, the better.
- Access patterns (part of the workload) help or hinder performance.
- Sharding is how to scale writes when single-node performance is truly reached.
- Server metrics reflect how the app workload causes MySQL to work.
- Replication lag is data loss.
- Locks are held until a transaction commits, so commit quickly.
- There are many other challenges that you might need to address—sorry.
- MySQL in the cloud is slower and more expensive, so performance is more important than ever.
PSA: Most databases do not do checksums by default
PSA: SQLite does not do checksums
November 07, 2024
Introducing sharding on PlanetScale with workflows
November 06, 2024
Application Architecture: Combining DynamoDB and Tinybird
RocksDB on a big server: LRU vs hyperclock
This has benchmark results for RocksDB using a big (48-core) server. I ran tests to document the impact of the the block cache type (LRU vs hyperclock) and a few other configuration choices for a CPU-bound workload. A previous post with great results for the hyperclock block cache is here.
tl;dr
- read QPS is up to ~3X better with auto_hyper_clock_cache vs LRU
- read QPS is up to ~1.3X better with the per-level fanout set to 32 vs 8
- read QPS drops by ~15% as the background write rate increases from 2 to 32 M/s
I used RocksDB 9.6, compiled with gcc 11.4.0.
Hardware
The server is an ax162-s from Hetzner with an AMD EPYC 9454P processor, 48 cores, AMD SMT disabled and 128G RAM. The OS is Ubuntu 22.04. Storage is 2 NVMe devices with SW RAID 1 and ext4.
Benchmark
Overviews on how I use db_bench are here and here.
All of my tests here use a CPU-bound workload with a database that is cached by RocksDB and are repeated for 1, 10, 20 and 40 threads.
I focus on the readwhilewriting benchmark where performance is reported for the reads (point queries) while there is a fixed rate for writes done in the background. I prefer to measure read performance when there are concurrent writes because read-only benchmarks with an LSM suffer from non-determinism as the state (shape) of the LSM tree has a large impact on CPU overhead and throughput.
To save time I did not run the fwdrangewhilewriting benchmark. Were I to repeat this work I would include it because the results from it would be interesting for a few of the configuration options I compared.
I did tests to understand the following:
- LRU vs auto_hyper_clock_cache for the block cache implementation
- LRU is the original implementation. The code was simple, which is nice. The implementation for LRU is sharded with a mutex per shard and that mutex can become a hot spot. The hyperclock implementation is much better at avoiding hot spots.
- per level fanout (8 vs 32)
- By per level fanout I mean the value of --max_bytes_for_level_multiplier which determines the target size difference between adjacent levels. By default I use 8, while 10 is also a common choice. Here I compare 8 vs 32. When the fanout is larger the LSM tree has fewer levels -- meaning there are fewer places to check for data which should reduce CPU overhead and increase QPS.
- background write rate
- I repeated tests with the background write rate (--benchmark_write_rate_limit) set to 2, 8 and 32 MB/s. With a higher write rate there is more chance for interference between reads and writes. The interference might be from mutex contention, compaction threads using more CPU, more L0 files to check or more data in levels L1 and larger.
- target size for L0
- By target size I mean the number of files in the L0 that trigger compaction. The db_bench option for this is --level0_file_num_compaction_trigger. When the value is larger there will be more L0 files on average that a query might have to check and that means there is more CPU overhead. Unfortunately, I configured RocksDB incorrectly so I don't have results to share. The issue is that when the L0 is configured to be larger, the L1 should be configured to be at least as large as the L0 (L1 target size should be >= sizeof(SST) * num(L0 files). If not, then L0->L1 compaction will happen sooner than expected.
These graphs have QPS from the readwhilewriting benchmark for the LRU and AHCC block cache implementations where LRU is the original version with a sharded hash table and a mutex per shard while AHCC is the hyper clock cache (--cache_type=auto_hyper_clock_cache).
- QPS is much better with AHCC than LRU (~3.3X faster at 40 threads)
- QPS with AHCC scales linearly with the thread count
- QPS with LRU does not scale linearly and suffers from mutex contention
- There are some odd effects in the results for 1 thread
- QPS is often 1.1X to 1.3X larger with fanout=32 vs fanout=8
With an 8M/s background write rate and LRU, fanout=8 is faster at 1 thread but then fanout=32 is from 1.1X to 1.3X faster at 10 to 40 threads.
With a 32M/s background write rate and LRU, fanout=8 is ~2X faster at 1 thread but then fanout=32 is from 1.1X to 1.2X faster at 10 to 40 threads.
- With LRU
- QPS drops by up to ~15% as the background write rate grows from 2M/s to 32M/s
- QPS does not scale linearly and suffers from mutex contention
- With AHCC
- QPS drops by up to 13% as the background write rate grows from 2M/s to 32M/s
- QPS scales linearly with the thread count
- There are some odd effects in the results for 1 thread
Exploring Postgres's arena allocator by writing an HTTP server from scratch
This is an external post of mine. Click here if you are not redirected.
How to Learn: Userland Disk I/O
November 05, 2024
Effective unemployment and social media
Being unemployed can be incredibly depressing. So much rejection. Everything seems to be out of your control. Everything except for one thing: what you produce.
You might know that repeatedly posting on social media that you are looking for work is ineffective. That it looks (or at least feels) worse each time you say so. But there is at least one major caveat to this.
Every single time you create something and share it publicly is a chance to also reiterate that you are looking for work. And people actually appreciate and value this!
Whether you write a blog post or build some project, you are seen as working on yourself and contributing to the community. Positive things! And it is no problem at all to learn with each new post you write and each new project you publish that you are also looking for work.
Moreover, dynamics of the internet and social media basically require that you be regularly producing something new. Either regularly producing a new version of some existing project or regularly producing new projects (or blog posts) entirely.
What you did a week ago is old news on social media. What will you do next week?
This could itself feel depressing except for that it's probably actually a fairly healthy thing for yourself anyway! It is a motivation to keep your skills sharp as time goes on.
So while you're unemployed and able to muster the motivation, write about things that are interesting to you! Build projects that intrigue you. Leave a little note on every post and project that you are looking for work. And share every post and project on social media.
You'll expose yourself to opportunities and referrals. And even if no post or project "takes off" you will still be working on yourself and contributing back knowledge to the community.
I wrote a short post on some ideas for effective unemployment and social media.https://t.co/jmiJCOe2Nk pic.twitter.com/pK9AySNdHR
— Phil Eaton (@eatonphil) November 5, 2024
Optimizing query planning in Vitess: a step-by-step approach
November 04, 2024
RocksDB benchmarks: small server, universal compaction
I shared benchmark results for RocksDB a few weeks ago using leveled compaction and a small server. Here I have results for universal compaction and the same small server.
tl;dr- in general the there are some improvements and some small regressions with one exception (see bug 12038)
- for a cached database
- From RocksDB 6.0.2 to 9.x QPS drops by ~10% for fillseq and ~15% for other tests
- Performance has been stable since 7.x
- for an IO-bound database with buffered IO
- bug 12038 hurts QPS for overwrite (will be fixed soon in 9.7)
- QPS is otherwise stable
- for an IO-bound database with O_DIRECT
- QPS for fillseq and overwrite is ~10% less in 9.7 vs 6.0.2 and has been stable since 7.0
- QPS for read-heavy tests is ~5% better in RocksDB 9.7.2 vs 6.0.2
- 6.x - 6.0.2, 6.10.4, 6.20.4, 6.29.5
- 7.x - 7.0.4, 7.3.2, 7.6.0, 7.10.2
- 8.x - 8.0.0, 8.3.3, 8.6.7, 8.9.2, 8.11.4
- 9.x - 9.0.1, 9.1.2, 9.2.2, 9.3.2, 9.4.1, 9.5.2, 9.6.1, 9.6.2, 9.7.2, 9.7.4 and 9.8.1
- fillseq -- load in key order with the WAL disabled
- revrangeww -- reverse range while writing, do short reverse range scans as fast as possible while another thread does writes (Put) at a fixed rate
- fwdrangeww -- like revrangeww except do short forward range scans
- readww - like revrangeww except do point queries
- overwrite - do overwrites (Put) as fast as possible
There are three workloads, all of which use one client (thread):
- byrx - the database is cached by RocksDB
- iobuf - the database is larger than memory and RocksDB uses buffered IO
- iodir - the database is larger than memory and RocksDB uses O_DIRECT
The charts show the relative QPS for a given version of RocksDB 6.0.2. There are two charts with the same data and the y-axis on the second doesn't start at 0 to improve readability.
- bug 12038 explains the regression for overwrite (fixed soon in 9.7)
- QPS for fillseq has been stable
- QPS for revrangeww, fwdrangeww and readww is stable. I am not sure about the variance in 9.6 and 9.7 releases. The cause might be that universal (tiered) is more prone to variance. I will revisit that when I run tests again in a few months.