January 29, 2025
A guide to Tinybird's new pricing model
Edit for clarity
I have the fortune to review a few important blog posts every year and the biggest value I add is to call out sentences or sections that make no sense. It is quite simple and you can do it too.
Without clarity only those at your company in marketing and sales (whose job it is to work with what they get) will give you the courtesy of a cursory read and a like on LinkedIn. This is all that most corporate writing achieves. It is the norm and it is understandable.
But if you want to reach an audience beyond those folks, you have to make sure you're not writing nonsense. And you, as reviewer and editor, have the chance to call out nonsense if you can get yourself to recognize it.
Immune to nonsense
But especially when editing blog posts at work, it is easy to gloss over things that make no sense because we are so constantly bombarded by things that make no sense. Maybe it's buzzwords or cliches, or simply lack of rapport. We become immune to nonsense.
And even worse, without care, as we become more experienced, we become more fearful to say "I have no idea what you are talking about". We're afraid to look incompetent by admitting our confusion. This fear is understandable, but is itself stupid. And I will trust you to deal with this on your own.
Read it out loud
So as you review a post, read it out loud to yourself. And if you find yourself saying "what on earth are you talking about", add that as a comment as gently as you feel you should. It is not offensive to say this (depending on how you say it). It is surely the case that the author did not know they were making no sense. It is worse to not mention your confusion and allow the author to look like an idiot or a bore.
Once you can call out what does not make sense to you, then read the post again and consider what would not make sense to someone without the context you have. Someone outside your company. Of course you need to make assumptions about the audience to a degree. It is likely your customers or prospects you have in mind. Not your friends or family.
With the audience you have in mind, would what you're reading make any sense? Has the author given sufficient background or introduced relevant concepts before bringing up something new?
Again this is a second step though. The first step is to make sure that the post makes sense to you. In almost every draft I read, at my company or not, there is something that does not make sense to me.
Do two paragraphs need to be reordered because the first one accidentally depended on information mentioned in the second? Are you making ambiguous use of pronouns? And so on.
In closing
Clarity on its own will put you in the 99th percentile of writing. Beyond that it definitely still matters if you are compelling and original and whatnot. But too often it seems we focus on being exciting rather than being clear. But it doesn't matter if you've got something exciting if it makes no sense to your reader.
This sounds like mundane guidance, but I have reviewed many posts that were reviewed by other people and no one else called out nonsense. I feel compelled to mention how important it is.
Wrote a new post on the most important, and perhaps least done, thing you can do while reviewing a blog post: edit for clarity. pic.twitter.com/ODblOUzB3g
— Phil Eaton (@eatonphil) January 29, 2025
Why Trees Without Branches Grow Faster: The Case for Reducing Branches in Code
In the same way that arborists remove branches from trees to ensure healthy and desirable tree growth, it can also be beneficial to remove branches in software. We claim that pruning branches is a good thing in some of our blog posts, but we never got around to explaining why. In this post, we will rectify that and explore why, although branches are essential to software, it is a good idea to reduce them where possible to increase CPU efficiency.
January 28, 2025
Monitor the health of Amazon Aurora PostgreSQL instances in large-scale deployments
Vector indexes, MariaDB & pgvector, large server, large dataset: part 1
This post has results from ann-benchmarks to compare MariaDB and Postgres with a larger dataset, gist-960-euclidean. Previous posts (here and here) used fashion-mnist-784-euclidean which is a small dataset. By larger I mean by the standards of what is in ann-benchmarks. This dataset has 1M rows and 960 dimensions. The fashion-mnist-784-euclidean dataset has 60,000 rows and 784 dimensions. Both use Euclidean distance. This work was done by Small Datum LLC and sponsored by the MariaDB Corporation.
tl;dr
- MariaDB gets between 2.5X and 3.9X more QPS than Postgres for recall >= 0.95
This post has much more detail about my approach in general. I ran the benchmark for 1 session. I use ann-benchmarks via my fork of a fork of a fork at this commit. The ann-benchmarks config files are here for MariaDB and for Postgres.
The command line to run the benchmark using my helper scripts is:
bash rall.batch.sh v1 gist-960-euclidean c32r128
This chart shows the best QPS for a given recall. MariaDB gets ~1.5X more QPS than pgvector at low recall and between 2X and 4X more QPS at high recall.
- index sizes are similar between MariaDB and pgvector with halfvec
- time to create the index varies a lot and it is better to consider this in the context of recall which is done in next section
- M - value for M when creating the index
- cons - value for ef_construction when creating the index
- secs - time in seconds to create the index
- size(MB) - index size in MB
With ann-benchmarks the constraint is recall. Below I share the best QPS for a given recall target along with the configuration parameters (M, ef_construction, ef_search) at which that occurs for each of the algorithms (MariaDB, pgvector with float32, pgvector with float16/halfvec).
- Postgres does not get recall=1.0 for the values of M, ef_construction and ef_search I used
- Index create time was less for MariaDB in all cases except the result for recall >= 0.96. However, if you care more about index size than peak QPS then it might be better to look at more results per recall level, as in the best 3 results per DBMS rather than the best as I do here.
- For a given recall target, MariaDB gets between 2.5X and 3.9X more QPS than Postgres
- recall, QPS - best QPS at that recall
- isecs - time to create the index in seconds
- m= - value for M when creating the index
- ef_cons= - value for ef_construction when creating the index
- ef_search= - value for ef_search when running queries
January 26, 2025
Vector indexes, MariaDB & pgvector, large server, small dataset: part 2
This post has results for vector index support in MariaDB and Postgres. This work was done by Small Datum LLC and sponsored by the MariaDB Corporation. This is part 2 in a series that compares QPS and recall for the fashion-mnist-784-euclidean dataset using from 1 to 48 concurrent sessions on a large server. This is part 2 and part 1 is here.
The purpose of this post is to explain the results I shared in part 1 where MariaDB does much better than pgvector at low and high concurrency but the performance gap isn't as large at medium concurrency where low means <= 4 concurrent sessions, medium means 8 to 20 and high means >= 24.
tl;dr
- QPS is ~1.4X larger for MariaDB than for pgvector at 2 and 48 concurrent sessions
- pgvector uses more CPU/query than MariaDB
- MariaDB does more context switches /query than pgvector
- MariaDB appears to use less CPU to compute euclidean distance
This post has much more detail about my approach in general and part 1 has more detail on this particular setup. I repeated the benchmark for 2 to 48 concurrent sessions because my test server has 48 cores. I use ann-benchmarks via my fork of a fork of a fork at this commit.
For more on euclidean distance (L2) see here.
- queries use ORDER by vec_distance_euclidean
- create index uses DISTANCE=euclidean
- queries use ORDER BY embedding::halfvec(...) <-> $name::halfvec(...)
- create index uses USING hnsw ((embedding::halfvec(...)) halfvec_l2_ops
Results: QPS by concurrency
The charts in this section show QPS by concurrency level as the benchmark was repeated for 1 to 48 concurrent sessions (X concurrent sessions means X concurrent queries).
The charts come from this spreadsheet. All of the data from the benchmark is here and the data I scraped to make these charts is here. I used configurations that provide a recall of ~0.96.
- MariaDB - ef_search=10, M=6
- Postgres - ef_search=10, M=16, ef_construction=32
This chart has absolute QPS for each of the systems tested.
This chart has relative QPS which is: (QPS for Postgres / QPS for MariaDB).
- The MariaDB advantage is larger at low and high concurrency.
- The MariaDB advantage isn't as large between 8 and 20 concurrent sessions
And this table also has relative QPS.
This table has (QPS / concurrency). For all systems tested the QPS per session decreases as the concurrency increases. I suspect the benchmark client is part of the problem but I am just speculating
- multiprocessing.pool is used by both MariaDB and pgvector, which is good, less GIL. See here for MariaDB and for pgvector.
- the benchmark client includes all of the time to process queries, including
- creating & start multiprocessing.pool - perhaps the pool can be cached & reused across runs
- creating a database connection
- gathering results from the concurrent sessions - some of this is done in the main thread
- AFAIK, the total number of queries per run is fixed, so the number of queries per session is less when there are more concurrent sessions and setup overhead (create database connection, create multiprocessing.pool, process results) becomes more significant as the concurrency level increases.
Performance debugging
The benchmark client does a lot of work (like checking results for recall) which means there is a brief burst of CPU overhead when queries run followed by longer periods where the benchmark client is processing the results. So I modified the benchmark client to only run queries in a loop and avoid other overheads like checking the results for recall. This makes it easier to collect performance data like CPU profiles (perf), PMP stacks and vmstat samples.
Performance debugging: MariaDB
From a test with 2 concurrent sessions the perf profile shows that much CPU is used to compute the dot product which is used to determine the distance between vectors:
16.89% one_connection mariadbd.orig [.] FVector::dot_product
4.71% one_connection mariadbd.orig [.] escape_string_for_mysql
3.42% one_connection mariadbd.orig [.] search_layer
2.99% one_connection mariadbd.orig [.] buf_page_get_gen
2.31% one_connection mariadbd.orig [.] my_charlen_utf8mb4
2.16% one_connection mariadbd.orig [.] MYSQLparse
2.03% one_connection libc.so.6 [.] __memmove_avx512_unaligned_erms
1.74% one_connection mariadbd.orig [.] PatternedSimdBloomFilter<FVectorNode>::Query
1.58% one_connection libm.so.6 [.] __roundf
1.49% one_connection mariadbd.orig [.] mtr_memo_slot_t::release
1.40% one_connection mariadbd.orig [.] mhnsw_read_first
1.32% one_connection mariadbd.orig [.] page_cur_search_with_match
1.09% one_connection libc.so.6 [.] __memcmp_evex_movbe
1.06% one_connection mariadbd.orig [.] FVectorNode::distance_to
1.03% one_connection mariadbd.orig [.] row_search_mvcc
0.98% one_connection mariadbd.orig [.] rec_get_offsets_func
0.93% one_connection mariadbd.orig [.] cmp_data
0.93% one_connection mariadbd.orig [.] alloc_root
0.75% one_connection mariadbd.orig [.] Visited::cmp
And then the result with 48 concurrent sessions
- the percentage of time in dot_product was ~17% above, but only ~11.5% here
- more time is spent in InnoDB functions like buf_page_get_gen, mtr_memo_slot_t::release, page_cur_search_with_match, btr_cur_t_::search_leaf, ssux_lock::psd_read_lock, rec_get_offsets_func and buf_page_make_young_if_needed. Some of that might be expected but that can also be a sign of too much mutex contention in InnoDB.
- I don't see signs of mutex contention in PMP output
11.49% one_connection mariadbd.orig [.] FVector::dot_product
7.17% one_connection mariadbd.orig [.] buf_page_get_gen
4.44% one_connection mariadbd.orig [.] search_layer
4.00% one_connection mariadbd.orig [.] escape_string_for_mysql
2.49% one_connection mariadbd.orig [.] mtr_memo_slot_t::release
2.23% one_connection mariadbd.orig [.] page_cur_search_with_match
1.86% one_connection mariadbd.orig [.] MYSQLparse
1.85% one_connection mariadbd.orig [.] my_charlen_utf8mb4
1.75% one_connection libc.so.6 [.] __memmove_avx512_unaligned_erms
1.60% one_connection mariadbd.orig [.] btr_cur_t::search_leaf
1.38% one_connection mariadbd.orig [.] ssux_lock::psi_rd_lock
1.37% one_connection mariadbd.orig [.] mhnsw_read_first
1.32% one_connection mariadbd.orig [.] FVectorNode::distance_to
1.23% one_connection mariadbd.orig [.] rec_get_offsets_func
1.19% one_connection mariadbd.orig [.] PatternedSimdBloomFilter<FVectorNode>::Query
1.02% one_connection mariadbd.orig [.] FVectorNode::load
0.97% one_connection libm.so.6 [.] __roundf
0.88% one_connection mariadbd.orig [.] buf_page_make_young_if_needed
0.83% one_connection mariadbd.orig [.] cmp_dtuple_rec_with_match_low
From vmstat with 2 concurrent sessions
r b swpd free buff cache si so bi bo in cs us sy id wa st
From vmstat with 48 concurrent sessions
r b swpd free buff cache si so bi bo in cs us sy id wa st
Comparing the vmstat results for 2 vs 48 sessions
- CPU/query is (vmstat.us + vmstat.sy) / QPS
- For 2 sessions it is ((4+5+4+5+5) / 5) / 6752.7 = .000681
- For 48 sessions it is ((100+100+100+100+100)/5) / 32645.5 = .003063
- CPU/query is ~4.5X larger at 48 sessions
- Context switches /query is vmstat.cs / QPS
- For 2 sessions it is 56743 / 6752.7 = 8.40
- For 48 sessions it is 637320 / 32645.5 = 19.52
- Context switches /query is ~2.3X larger at 48 sessions
Performance debugging: pgvector with halfvec
From a test with 2 concurrent sessions the perf profile shows
- computing L2 distance accounts for the most time, here it is 25.98% while above for MariaDB at 2 concurrent sessions it was 16.89%. Perhaps MariaDB is faster at computing L2 distance, perhaps MariaDB has more overhead elsewhere to reduce the fraction of time in computing L2 distance. But I suspect that the former is true.
- Postgres here appears to have more CPU overhead than MariaDB in accessing the data (PinBuffer, LWLock, etc)
25.98% postgres vector.so [.] HalfvecL2SquaredDistanceF16c
9.68% postgres postgres [.] PinBuffer
6.46% postgres postgres [.] hash_search_with_hash_value
5.39% postgres postgres [.] LWLockRelease
4.25% postgres postgres [.] pg_detoast_datum
3.95% postgres vector.so [.] vector_to_halfvec
3.06% postgres postgres [.] LWLockAttemptLock
2.97% postgres postgres [.] LWLockAcquire
2.89% postgres vector.so [.] HnswLoadUnvisitedFromDisk
2.44% postgres postgres [.] StartReadBuffer
1.82% postgres postgres [.] GetPrivateRefCountEntry
1.70% postgres vector.so [.] tidhash_insert_hash_internal
1.65% postgres postgres [.] LockBuffer
1.59% postgres vector.so [.] HnswLoadElementImpl
0.88% postgres libc.so.6 [.] __memcmp_evex_movbe
0.80% postgres postgres [.] ItemPointerEquals
0.79% postgres postgres [.] ResourceOwnerForget
0.71% postgres vector.so [.] HnswSearchLayer
0.64% postgres postgres [.] pq_getmsgfloat4
And then from a test with 48 concurrent sessions
- The fraction of time computing L2 distance here is less than it is for 2 sessions above. This is similar to the results for MariaDB.
- From PMP I don't see signs of mutex contention, but I only took 3 samples
19.86% postgres vector.so [.] HalfvecL2SquaredDistanceF16c
10.89% postgres postgres [.] PinBuffer
6.78% postgres postgres [.] hash_search_with_hash_value
5.44% postgres postgres [.] LWLockRelease
5.25% postgres postgres [.] LWLockAttemptLock
4.97% postgres postgres [.] pg_detoast_datum
3.30% postgres vector.so [.] vector_to_halfvec
2.62% postgres vector.so [.] HnswLoadUnvisitedFromDisk
2.61% postgres postgres [.] StartReadBuffer
2.03% postgres postgres [.] GetPrivateRefCountEntry
1.70% postgres postgres [.] LockBuffer
1.69% postgres vector.so [.] HnswLoadElementImpl
1.49% postgres postgres [.] LWLockAcquire
1.28% postgres vector.so [.] tidhash_insert_hash_internal
0.79% postgres postgres [.] ReadBufferExtended
0.78% postgres postgres [.] AllocSetAlloc
0.76% postgres libc.so.6 [.] __memcmp_evex_movbe
0.65% postgres postgres [.] ResourceOwnerForget
0.63% postgres vector.so [.] HnswSearchLayer
From vmstat with 2 concurrent sessions
From vmstat with 48 concurrent sessions
r b swpd free buff cache si so bi bo in cs us sy id wa st
88 0 96784 90323440 2001256 36414792 0 0 0 0 13129 224133 92 8 0 0 0
86 0 96784 90323440 2001256 36414796 0 0 0 197 13011 223926 92 8 0 0 0
85 0 96784 90322936 2001256 36414796 0 0 0 48 13118 224239 92 8 0 0 0
83 0 96784 90322688 2001264 36414800 0 0 0 56 13084 223987 92 8 0 0 0
86 0 96784 90324952 2001264 36414800 0 0 0 17 13361 224189 92 8 0 0 0
- QPS is ~1.4X larger for MariaDB at 2 and 48 sessions
- CPU/query is (vmstat.us + vmstat.sy) / QPS
- For 2 sessions it is ((4+4+4+4+5)/5) / 4811.2 = .000872
- For 48 sessions it is ((100+100+100+100+100)/5) / 23802.6 = .004201
- CPU/query is ~4.8X larger at 48 sessions than at 2
- CPU/query for Postgres is ~1.3X larger than MariaDB at 2 sessions and ~1.4X larger at 48
- Context switches is vmstat.cs / QPS
- For 2 sessions it is 24830 / 4811.2 = 5.16
- For 48 sessions it is 224133 / 23802.6 = 9.41
- Context switches /query is ~1.8X larger at 48 sessions
- Context switches /query for MariaDB is ~1.6X larger than Postgres at 2 sessions and ~2.1X larger at 48
January 25, 2025
Vector indexes, MariaDB & pgvector, large server, small dataset: part 1
This post has results for vector index support in MariaDB and Postgres. This work was done by Small Datum LLC and sponsored by the MariaDB Corporation.
My previous posts (here and here) used a server with 8 cores and 32G of RAM. While that was OK for one of the smaller datasets from ann-benchmarks it wasn't enough for larger datasets and the problem was the amount of memory used by the benchmark client. I have some changes to the benchmark client to reduce the transient spikes in memory usage I wasn't able to fully solve the problem, so I moved to a larger server with 48 cores and 128G of RAM.
This post has results for the fashion-mnist-784-euclidean dataset using from 1 to 48 concurrent sessions. This is part 1. There will be parts 2, 3 and 4 to explain the results and then I move on to a larger dataset.
I compare MariaDB with pgvector because I respect the work that the Postgres community has done to support vector search workloads. And I am happy to report that MariaDB has also done a great job on this. While I don't know the full story of the development effort, this feature came from the MariaDB Foundation and the community and it is wonderful to see that collaboration.
tl;dr
- At low and high concurrency levels MariaDB gets much more QPS for a given recall target. Here low means <= 4 concurrent sessions and high means >= 24.
- At middle concurrency levels (8 through 20 concurrent sessions) MariaDB still does better but the gap isn't as large. I try to explain this in future posts.
This post has much more detail. However I switched to a larger server (Hetzner ax162-s) with 48 cores, 128G of RAM, Ubuntu 22.04 and HW RAID 10 using 2 NVMe devices.
I use ann-benchmarks via my fork of a fork of a fork at this commit. Note that parallel index create was disabled for Postgres by my configuration and isn't (yet) supported by MariaDB.
I ran tests for fashion-mnist-784-euclidean at 1, 2, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44 and 44 concurrent sessions. The command lines were the following using pgvector, pgvector_halfvec and mariadb as the value of $alg. When --batch is used the concurrency level (between 2 and 48 concurrent sessions) is set by an environment variable (POSTGRES_BATCH_CONCURRENCY or MARIADB_BATCH_CONCURRENCY)
python3 run.py --algorithm $alg --dataset fashion-mnist-784-euclidean --timeout -1 --local --force \
python3 run.py --algorithm $alg --dataset fashion-mnist-784-euclidean --timeout -1 --local --force \
I filed MDEV-35897 for MariaDB because it allocates and then deallocates too much memory when ef_search is large, and large for me was >= 300. The overhead from this hurts query response times. Fortunately the fix should be easy. For now I changed config.yml for MariaDB to not use ef_search values larger than 200 (see query_args here).
Files:
- The config.yml files are here for MariaDB and for pgvector.
- Files related to these tests are archived here.
- The database configuration files are here for MariaDB and for Postgres.
An explosion of transitive dependencies
A small standard library means an explosion in transitive dependencies. A more comprehensive standard library helps you minimize dependencies. Don't misunderstand me: in a real-world project, it is practically impossible to have zero dependencies.
Armin Ronacher called for a vibe shift among programmers and I think that this actually exists already. Everyone I speak to on this topic has agreed that minimizing dependencies is ideal.
Rust and JavaScript, with their incredibly minimal standard libraries, work against this ideal. Go, Python, Java, and C# in contrast have a decent standard library, which helps minimize the explosion of transitive dependencies.
Examples
I think the standard library should reasonably include:
- JSON, CSV, and Parquet support
- HTTP/2 support (which includes TLS, compression, random number generation, etc.)
- Support for asynchronous IO
- A logging abstraction
- A SQL client abstraction
- Key abstract data types (BTrees, hashmaps, sets, and growable arrays)
- Utilities for working with Unicode, time and timezones
But I don't think it needs to include:
- Excel support
- PostgreSQL or Oracle clients
- Flatbuffers support
- Niche data structures
Neither of these are intended to be complete lists, just examples.
Walled gardens
Minimal standard libraries force growing companies to build out their own internal collection of "standard libraries". As one example, Bloomberg did this with C++. And I've heard of companies doing this already with Rust. This allows larger companies to manage and minimize the explosion of transitive dependencies over time.
All growing companies likely do something like this eventually. But again, smaller standard libraries incentivize companies to build this internal standard library earlier on. And the community benefits relatively little from these internal standard libraries. The community would benefit more if large organizations contributed back to an actual standard library.
Smaller organizations do not have the capacity to build these internal standard libraries.
Maybe the situation will lead to libraries like Boost for JavaScript and Rust programmers. That could be fine.
Versioning
A comprehensive standard library does not prevent the language developers from releasing new versions of the standard library. It is trivial to do this with naming like Go has done with the v2 pattern. math/rand/v2 is an example.
Conclusion
I'm primarily thinking about maintainability, not security. You can read about the security risks of using a language with an ecosystem like Rust from someone who is an expert on the matter.
My concern about the standard library does not stop me from using Rust and JavaScript. They could choose to invest in the standard library at any time. We have already begun to see Bun and Deno to do exactly this. But it is clearly an area for improvement in Rust and JavaScript. And a mistake for other languages to avoid repeating.
While zero dependencies is practically impossible, everyone I've spoken to agrees that minimizing dependencies is ideal. Rust and JavaScript work against this ideal. But they could change at any time. And Bun and Deno are already examples of this.https://t.co/qkSh6oW1Yd pic.twitter.com/mY1MNErZG7
— Phil Eaton (@eatonphil) January 25, 2025
January 24, 2025
Outgrowing Postgres: Handling increased user concurrency
January 23, 2025
Optimize Transaction Throughput: 3 Patterns for Scaling with Convex and ACID Databases
Logging 150M+ link clicks: How Dub built its webhook Event Logs
What You Missed in 2024 in the World of Databases, with Andy Pavlo
GaussDB-Global: A Geographically Distributed Database System
This paper, presented in the industry track of ICDE 2024, introduces GaussDB-Global (GlobalDB), Huawei's geographically distributed database system. GlobalDB replaces the centralized transaction management (GTM) of GaussDB with a decentralized system based on synchronized global clocks (GClock). This approach mirrors Google Spanner's TrueTime approach and its commit-wait technique, which provides externally serializable transactions by waiting out the uncertainty interval. However, GlobalDB claims compatibility with commodity hardware, avoiding the need for specialized networking infrastructure for synchronized clock distribution.
The GClock system uses GPS receivers and atomic clocks as the global time source device at each regional cluster. Each node synchronizes its clock with the global time source over TCP every 1 millisecond. Clock deviation is kept low because synchronization is achieved within 60 microseconds as a TCP round trip, and the CPU’s clock drift is bounded within 200 parts per million. The GClock timestamp includes a clock time and an error bound ($T_{err}$), which accounts for network latency and clock drift. The timestamp generation in GlobalDB the formula $TS_{GClock} = T_{clock} + T_{err}$ where $T_{err} = T_{sync} + T_{drift}$.
Syncronizing over TCP every millisecond seems extreme, and the paper does not go into detail about how tight-synchronization is achieved, and how reliable is $T_{err}$. They make a passing reference to the FaRM paper (SIGMOD 2019), and they seem to have adapted their synchronization mechanism from that work.
Transactions in GlobalDB use the following protocol to obtain timestamps:
- Invocation: Wait until \( T_{clock} > TS_{GClock} \) and begin the transaction. (Single-shard queries bypass this wait by using the node’s last committed transaction timestamp.)
- Commit: Wait until \( T_{clock} > TS_{GClock} \) and commit.
GlobalDB introduces two major algorithmic components:
- Seamless Transition Between Centralized and Decentralized Modes: GlobalDB supports DUAL mode, enabling zero-downtime transitions between centralized (GTM) and decentralized (GClock) transaction management.
- Flexible Asynchronous Replication: GlobalDB supports asynchronous replication with strong consistency guarantees for reads on replicas. This is achieved through a Replica Consistency Point (RCP), which ensures that all replicas provide a consistent snapshot of the database, even if they are not fully up-to-date.
It seems like point one, transitioning between the centralized and decentralized, is also heavily inspired by the FaRM paper's protocol on this. I will explain this below.
The second point, asynchronous replication scheme, raises questions about durability. If the primary crashes before logs are sent to replicas, data loss could occur. The paper does not fully address this issue, leaving it unclear how GlobalDB ensures durability in such scenarios.
So what do we get with GlobalDB? GlobalDB’s use of synchronized clocks gives us decentralized transaction management, removing the need for a centralized service to order transactions. This improves throughput, especially in geo-distributed deployments. Another key benefit is that synchronized clocks enhance read performance on asynchronous local replicas by returning consistent response from a global snapshot of the database at a given time.
OK, that was the overview. Now we can discuss the architecture and protocol details.
Architecture
GaussDB is a **shared-nothing distributed database** consisting of:
- Computing Nodes (CNs): Stateless nodes that handle query parsing, planning, and coordination.
- Data Nodes (DNs): Host portions of tables based on hash or range partitioning. Replica DNs are placed remotely for high availability.
- Global Transaction Manager (GTM): A lightweight centralized service that provides timestamps for transaction invocation and commit.
GlobalDB replaces the GTM with decentralized transaction management using GClock, improving scalability and performance, especially in geo-distributed deployments. Primary DNs continuously transmit updates to replica nodes in the form of **Redo logs**. The paper argues that asynchronous replication avoids the performance degradation of waiting for remote replicas, but it does not discuss how to deal with potential durability gaps, as the primary may crash before logs are replicated.
Clock Transition Between Centralized and Decentralized Modes
As we mentioned above, a key innovation in GlobalDB is the DUAL mode, which enables seamless transitions (in both directions) between centralized (GTM) and decentralized (GClock) transaction management without downtime. This is critical for maintaining system availability during upgrades or failures, such as when global clock synchronization fails. The DUAL mode uses a hybrid timestamp mechanism: \( TS_{DUAL} = \max(TS_{GTM}, TS_{GClock}) + 1 \)
A DUAL mode timestamp TSDUAL is guaranteed to be larger than both the most recent GTM timestamp and clock upperbound. This ensures monotonicity by guaranteeing that new timestamps are always larger than both the most recent GTM and GClock timestamps. Transitions require waiting for \( 2 \times T_{err} \) to prevent anomalies like stale reads due to timestamp inversion. By acting as a bridge between the two modes, DUAL mode allows the system to remain fully operational during transitions, supporting live upgrades and fault recovery while meeting enterprise SLAs.
Asynchronous Replication with Strong Consistency
GlobalDB employs asynchronous replication with strong consistency guarantees for reads on replicas, achieved through a Replica Consistency Point (RCP). In some ways, this reminds me of the Volume Complete LSN calculation in Amazon Aurora. The RCP is a globally consistent snapshot calculated as the minimum of the maximum commit timestamps across all replicas. This ensures that any transaction committed before the RCP is visible and consistent across the system, even if replicas are not fully up-to-date. To maintain progress, heartbeat transactions prevent RCP stagnation on idle replicas. This approach allows GlobalDB to offer strong consistency for read-only queries on replicas, even in an asynchronous replication setup. Users can query nearby replicas for faster response times, without worrying about getting inconsistent or incorrect results.
Data Definition Language (DDL) statements (such as CREATE TABLE or DROP INDEX) impose extra restrictions on using RCP for reads. GlobalDB ensures consistency for read-only queries (ROR) by requiring that the RCP is greater than either the largest DDL timestamp or the timestamp of each table involved in the query, ensuring compatibility with schema changes.
I think several challenges still remain: calculating RCP across regions introduces latency proportional to replica count, frequent updates can widen gaps between primary and replica timestamps, and asynchronous replication risks data loss if the primary fails before logs are replicated.
Evaluation
The paper evaluates GlobalDB on both single-region and geo-distributed clusters using TPC-C and Sysbench. The results show 14x higher read throughput and 50% higher TPC-C throughput compared to the baseline system. Geo-distributed setups achieve 91% of the throughput of a co-located cluster, despite added network latency.
However, the evaluation has severe limitations. Tests use Linux `tc` to simulate delays, not representing real-world variable latency and packet loss. Moreover, the baseline is Huawei’s prior centralized system, with no comparisons to other distributed databases like Spanner, CockroachDB, or Yugabyte.
January 22, 2025
Diving deep into the new Amazon Aurora Global Database writer endpoint
Embedding Python in Rust (for tests)
This is an external post of mine. Click here if you are not redirected.