a curated list of database news from authoritative sources

February 05, 2026

OSTEP Chapters 6,7

How does your computer create the illusion of running dozens of applications simultaneously when it only has a few physical cores?

Wait, I forgot the question because I am now checking my email. Ok, back to it...

The answer is CPU Virtualization. Chapters 6, 7 of OSTEP explore the engine behind this illusion, and how to balance raw performance with absolute control.

The OSTEP textbook is freely available at Remzi's website if you like to follow along.


Chapter 6. The Mechanism: Limited Direct Execution

The crux of the challenge is: How do we run programs efficiently without letting them takeover the machine? 

The solution is Limited Direct Execution (LDE) --the title spoils it. "Direct Execution" means the program runs natively on the CPU for maximum speed. "Limited" means the OS retains authority to stop the process and prevent restricted access. This requires some hardware support.

To prevent chaos, hardware provides two execution modes. Applications run in "User Mode", where they cannot perform privileged actions like I/O. The OS runs in "Kernel Mode" with full access to the machine. When a user program needs a privileged action, it initiates a "System Call". This triggers a 'trap' instruction that jumps into the kernel and raises the privilege level. To ensure security, the OS programs a "trap table" at boot time, telling the hardware exactly which code to run for each event.

If a process enters an infinite loop, how does the OS get the CPU back?

  • The Cooperative Approach: Older systems (like early Mac OS) trusted processes to yield the CPU periodically. If a program locked up, you had to reboot.
  • The Non-Cooperative Approach: Modern systems use a "Timer Interrupt". The hardware raises an interrupt every few milliseconds, forcefully halting the process and handing control back to the OS.

Finally, when the OS regains control and decides to switch to a different process, it executes a "context switch". This low-level assembly routine saves the current process's registers to its kernel stack and restores the next process's registers. By switching the stack pointer, the OS tricks the hardware: the 'return-from-trap' instruction returns into the new process instead of the old one.



Chapter 7. The Policy: CPU Scheduling

With the switching mechanism in place as discussed in Chapter 6, our next piece to attack is deciding which process to run next. Chapter 7 explores these policies, initially assuming all jobs arrive at once and have known run-times.

First, let's look at batch scheduling using "Turnaround Time". This metric is simply the time a job completes minus the time it arrived (T_completion - T_arrival). Now let's consider some batch scheduling policies with this metric:

  • FIFO (First In, First Out): Simple, but suffers from the "convoy effect". If a short job gets stuck behind a long one, average turnaround time suffers.
  • SJF (Shortest Job First): To fix the convoy, SJF runs the shortest job first. This is optimal for turnaround time if all jobs arrive at once, but fails if a short job arrives after a long job has started.
  • STCF (Shortest Time-to-Completion First): By adding preemption, we get STCF. When a new job arrives, the scheduler runs the job with the least time remaining. This guarantees optimal turnaround time, see Fig 7.5.

Now, we consider interactivity using "Response Time". This is the time from when a job arrives to the first time it is scheduled (T_response = T_firstrun - T_arrival).

STCF is great for turnaround but terrible for response time; a user might wait seconds for their interactive job (say a terminal session) to start. Round Robin solves this by time-slicing: it runs a job for a set quantum (e.g., 10 ms) and then switches. This makes the system feel responsive.

However, Round Robin creates a trade-off. While it optimizes fairness and response time, it destroys turnaround time by stretching out the completion of every job. You cannot have your cake and eat it too. See Fig 7.6 & 7.7.

Finally, real programs perform I/O. When a process blocks waiting for a disk, the scheduler treats the time before the I/O as a "sub-job". By running another process during this wait, the OS maximizes "overlap" and system utilization.

There is one last assumption we did not relax: the OS does not actually know how long a job will run. This "No Oracle" problem sets the stage for the next chapter on the "Multi-Level Feedback Queue", which predicts the future by observing the past.

To conclude Section 7, it is worth remembering that there is no silver bullet. The best policy depends entirely on the workload. The more you know about what you are running, the better you can schedule it.

February 04, 2026

Trudging Through Nonsense

Last week Anthropic released a report on disempowerment patterns in real-world AI usage which finds that roughly one in 1,000 to one in 10,000 conversations with their LLM, Claude, fundamentally compromises the user’s beliefs, values, or actions. They note that the prevalence of moderate to severe “disempowerment” is increasing over time, and conclude that the problem of LLMs distorting a user’s sense of reality is likely unfixable so long as users keep holding them wrong:

However, model-side interventions are unlikely to fully address the problem. User education is an important complement to help people recognize when they’re ceding judgment to an AI, and to understand the patterns that make that more likely to occur.

In unrelated news, some folks have asked me about Prothean Systems’ new paper. You might remember Prothean from October, when they claimed to have passed all 400 tests on ARC-AGI-2—a benchmark that only had 120 tasks. Unsurprisingly, Prothean has not claimed their prize money, and seems to have abandoned claims about ARC-AGI-2. They now claim to have solved the Navier-Stokes existence and smoothness problem.

The Clay Mathematics Institute offers a $1,000,000 Millennium Prize for proving either global existence and smoothness of solutions, or demonstrating finite-time blow-up for specific initial conditions.

This system achieves both.

At the risk of reifying XKCD 2501, this is a deeply silly answer to an either-or question. You cannot claim that all conditions have a smooth solution, and also that there is a condition for which no smooth solution exists. This is like being asked to figure out whether all apples are green, or at least one red one exists, and declaring that you’ve done both. Prothean Systems hasn’t just failed to solve the problem—they’ve failed to understand the question.

Prothean goes on to claim that the “demonstration at BeProthean.org provides immediate, verifiable evidence” of their proof. This too is obviously false. As the Clay paper explains, the velocity field must have zero divergence, which is a fancy way of saying that the fluid is incompressible; it can’t be squeezed down or spread out. One of the demo’s “solutions” squeezes everything down to a single point, and another shoves particles away from the center. Both clearly violate Navier-Stokes.

My background is in physics and software engineering, and I’ve written several numeric solvers for various physical systems. Prothean’s demo (initFluidSimulator) is a simple Euler’s method solver with four flavors of externally-applied acceleration, plus a linear drag term to compensate for all the energy they’re dumping into the system. There’s nothing remotely Navier-Stokes-shaped there.

The paper talks about a novel “multi-tier adaptive compression architecture” which “operates on semantic structure rather than raw binary patterns”, enabling “compression ratios exceding 800:1”. How can we tell? Because “the interactive demonstration platform at BeProthean.org provides hands-on capability verification for technical evaluation”.

Prothean’s compression demo wasn’t real in October, and it’s not real today. This time it’s just bog-standard DEFLATE, the same used in .zip files. There’s some fake log messages to make it look like it’s doing something fancy when it’s not.

document.getElementById('compress-status').textContent = `Identifying Global Knowledge Graph Patterns...`;

const stream = file.stream().pipeThrough(new CompressionStream('deflate-raw'));

There’s a fake “Predictive vehicle optimization” tool that has you enter a VIN, then makes up imaginary “expected power gain” and “efficiency improvement” numbers. These are based purely on a hash of the VIN characters, and have nothing to do with any kind of car. Prothean is full of false claims like this, and somehow they’re offering organizational licenses for it.

It’s not just Prothean. I feel like I’ve been been trudging through a wave of LLM nonsense recently. In the last two weeks alone, I’ve watched software engineers use Claude to suggest fatuous changes to my software, like an “improvement” to an error message which deleted key guidance. Contractors proffering LLM-slop descriptions of appliances. Claude-generated documents which made bonkers claims, like saying a JVM program I wrote provided “faster iteration” thanks to “no JVM startup”. Cold emails asking me to analyze dreamlike, vaguely-described software systems—one of whom, in our introductory call, couldn’t even begin to explain what they’d built or what it was for. Someone who claimed to be an engineer wanting to help with fault-injection work on Jepsen, then turned out to be a scammer soliciting investment in their AI video chatbot project.

When people or companies intentionally make false claims about the work they’re doing or the products they’re selling, we call it fraud. What is it when one overlooks LLM mistakes? What do we call it when a person sincerely believes the lies an LLM has told them, and repeats those lies to others? Dedicates months of their life to a transformer model’s fever dream?

Anthropic’s paper argues reality distortion is rare in software domains, but I’m not so sure.

This stuff keeps me up at night. I wonder about my fellow engineers who work at Anthropic, at OpenAI, on Google’s Gemini. I wonder if they see as much slop as I do. How many of their friends or colleagues have been sucked into LLM rabbitholes. I wonder if they too lie awake at three AM, staring at the ceiling, wondering about the future and their role in making it.

Semantic Caching for LLM Apps: Reduce Costs by 40-80% and Speed up by 250x

This post covers the topic of the video in more detail and includes some code samples. The $9,000 Problem You launch a chatbot powered by one of the popular LLMs like Gemini, Claude or GPT-4. It’s amazing and your users love it. Then you check your API bill at the end of the month: $15,000. […]

The F word

Back in 2005, when I first joined the SUNY Buffalo CSE department, the department secretary was a wonderful lady named Joann, who was over 60. She explained that my travel reimbursement process was simple: I'd just hand her the receipts after my trip, she'd fill out the necessary forms, submit them to the university, and within a month, the reimbursement check would magically appear in my department mailbox.

She handled this for every single faculty member, all while managing her regular secretarial duties. Honestly, despite the 30-day turnaround, it was the most seamless reimbursement experience I've ever had.

But over time the department grew, and Joann moved on. The university partnered with Concur, as corporations do, forcing us to file our own travel reimbursements through this system. Fine, I thought, more work for me, but it can't be too bad. But, the department also appointed a staff member to audit our Concur submissions.

This person's job wasn't to help us file reimbursements, but to audit the forms to find errors. Slowly but surely, it became routine for every single travel submission to be returned (sometimes multiple times) for minor format irregularities or rule violations. These were petty violations no human would care about if the goal were simply to get people reimbursed. The experience degraded from effortless to what could be perceived as adversarial.

This was a massive downgrade from the Joann era.


The Source of Friction

This story (probably all to familiar to many) illustrates the danger of not setting the right intention regarding friction. If the goal isn't actively set to help and streamline the process (if the intention isn't "how do we solve this?"), the energy of the system inevitably shifts toward finding problems. Friction becomes the product.

This dynamic is not just true for organizations, it is also true for each of us.

We have to manage the stories we tell ourselves. These stories, whether we tell them knowingly or unknowingly, determine how we manage/conduct ourselves, which in turn determines our success. Just as organizations can start to manufacture friction, individuals can do the same internally. You can install an internal auditor in your own mind.

When intention shifts away from growth, things degrade. You stop asking how to move forward and start looking for violations. You nitpick and reject your own efforts before it has a chance to mature. You begin to find ways to grate against your own progress.

I wrote about this concept previously in my post "Your attitude determines your success". That post tends to get two very different reactions. It gets nitpicked to pieces by cynics (the auditors), and it gets a silent knowing nod from people in the know (the builders). Brooker recently wrote career advice along the same lines, reinforcing that high agency mindset. In a similar vein, I wrote about recently to optimize for momentum.

“When there is a will, there is a way,” as the saying goes. Get the intention right and friction dissolves. Get it wrong and you may end up weaponizing process, tooling, and auditing against your own goals.

My Take on Vibe Coding

I often enjoy vibe coding, but I think we’re still far away from AI writing all your code. Newer models improve your development speed, even for complex applications. However, writing a usable browser from scratch without the heavy involvement of an experienced engineer is certainly not something that’s currently possible.

For me, vibe coding works well most of the time if I have a very good understanding of what I need. For tasks that Claude Code solves well, it saves me a lot of time, but it’s still not “hands-off”. To converge to an acceptable result faster, I often need to give very specific instructions (e.g. “don’t manually create and delete temporary files in Python, just use the tempfile module”). Sometimes, I also just waste a lot of time and don’t get any working result at all.

I generally use Claude Code (currently with Opus 4.5) and regularly try it out for new tasks or older tasks that haven’t worked well in the past. This is a collection of my experience with specific tasks:

My Vibe Coding Tasks

Writing a Rust program to do ARP pings over a range of VLAN ids and IP subnets (to scan my local network):

Claude Code found suitable libraries to generate and send raw packets, understood how to generate ARP packets with VLAN tags and understood how to scan through IPs of a given subnet. It also wrote a raw packet receiver to process ARP responses and added sensible cmdline arguments. Because I already knew very specifically what I wanted, this worked super well. I didn’t need to write a single line of code myself.

Looking at my network configuration (a bunch of config files and screenshots of switch/AP management interfaces) and translating this into a human-readable Markdown file describing the network:

Claude Code asked for missing context (e.g., physical layout of the switch ports) which I think is crucial: In my experience, missing context often leads to bad results with LLMs, so Anthropic does a good job there. It even generated an overview in an SVG file that was correct!

Writing a custom clang-tidy matcher for our internal C++ code style:

It turned out that the matcher I wanted to write just isn’t possible with the current Clang AST API. Claude Code tried a lot of different things (only some solutions compiled), I had to write a lot of code manually to guide it towards a possible solution, and look at the Clang source code with the help of Claude Code to verify what Claude Code was claiming. Eventually, I (not Claude!) understood that the matcher I wanted to write just wasn’t possible and abandoned the project after several hours.

Regularly asking very detailed questions about the core CedarDB database code base:

I know the code base very well, so I tend to ask specific questions such as “We push a data block to S3 once the number of buffered rows exceeds a threshold. What’s the threshold exactly, where do we set it, and where do we check if it’s exceeded?”. Claude Code manages to answer these questions precisely, giving me exact code locations, even if answering the question requires understanding 10+ different source files in detail. This also works well with other code bases I’m familiar with, such as PostgreSQL or LLVM.

Writing a float parser for hexfloats in C++:

I had a specific algorithm in my head that I wanted to try out and implement. Claude Code got the boilerplate and parsing basic numbers correct. It even wrote several helpful test cases. Where I really wasted a lot of time were the edge cases: Overflows around the edges of representable numbers, subnormal numbers, NaN payloads, etc. Even for the tests, Claude Code really wanted to use standard library functions to verify the correctness. But the standard library functions don’t handle these edge cases consistently (which is why I wrote a custom parser in the first place) and I couldn’t convince Claude otherwise. So, I ended up writing the edge cases manually having wasted an hour talking to Claude.

Writing ansible modules in Python for different tasks that I hacked with ansible.builtin.shell before:

Claude Code processed my hacky shell scripts, understood what I wanted to do, and created equivalent Python modules. The modules also have support for check mode and display good error messages.

Writing a Python script that generates a static HTML file from a list of backups (using borg backup):

I wanted a quick read-only overview of my backups. A static HTML page was the easiest solution for me, no monitoring stack required.

How I Will Use LLMs for Coding

I have found that the more I know about a certain problem and programming language, the better the result I get from an LLM. I will definitely continues using LLMs to write boilerplate code and to solve (coding) problems that have been solved before. For really novel problems or algorithms, I think LLMs can assist very well but I’m not yet satisfied with the quality of the code.

February 03, 2026

Percona at 20: Why Our Open Source, Services-Led Model Still Works

In 2026, Percona turns 20. That milestone offers a good opportunity to pause and reflect, not just on where we have been, but on why our business model has worked for two decades in an industry that has seen constant change. From the beginning, Percona has followed a model that is sometimes misunderstood, occasionally questioned, […]

BKND joins Supabase

Dennis Senn, creator of BKND, is joining Supabase to build a Lite offering for agentic workloads.

February 02, 2026

Importance of Tuning Checkpoint in PostgreSQL

The topic of checkpoint tuning is frequently discussed in many blogs. However, I keep coming across cases where it is kept untuned, resulting in huge wastage of server resources, struggling with poor performance and other issues. So it’s time to reiterate the importance again with more details, especially for new users. What is a checkpoint? […]

February 01, 2026

The Doctor's On-Call Shift solved with SQL Assertions

In a previous article, I explained that enforcing application-level rules, such as “each shift must always have at least one doctor on call”, typically requires either serializable isolation or explicit locking.

There's another possibility when enforcing the rules in the database, so they fall under ACID’s C instead of I, with SQL assertions. The SQL databases implemented only part of the SQL standard, limiting the constraints to single-row CHECK constraints, unique constraints between rows in the same table, or referential integrity constraints between table rows. Oracle implemented recently one missing part: SQL assertions that can express cross-row and cross-table conditions, including some joins and subqueries.

Oracle 23.26.1

Here is an example. The version where SQL assertions are available is Oracle AI Database 26ai Enterprise Edition Release 23.26.1.0.0. Here is how I started a Docker container:

# get the image (12GB)
docker pull container-registry.oracle.com/database/enterprise:23.26.1.0

# start the container
docker run -d --name ora26 container-registry.oracle.com/database/enterprise:23.26.1.0

# wait some minutes to start the database
until docker logs ora26 | grep "DATABASE IS READY TO USE" ; do sleep 1 ; done

# use a weak password for this lab
docker exec -it ora26 ./setPassword.sh "franck"

# create a user with the right privileges to test assertions
docker exec -i ora26ai sqlplus sys/franck@ORCLPDB1 as sysdba <<'SQL'
 grant connect, resource to franck identified by franck;
 grant create assertion to franck;
 grant alter session to franck;
 alter user franck quota unlimited on users;
SQL

# Start the command-line
docker exec -it ora26 sqlplus franck/franck@ORCLPDB1

Create the single-table schema

I created the same table as in the previous posts:

SQL> CREATE TABLE doctors (
      shift_id  INT NOT NULL,
      name      VARCHAR2(42) NOT NULL,
      on_call   BOOLEAN NOT NULL,
      CONSTRAINT pk_doctors PRIMARY KEY (shift_id, name)
);

Table DOCTORS created.

Two doctors are on-call for the shift '1':

SQL> INSERT INTO doctors VALUES 
      (1, 'Alice', true),
      (1, 'Bob',   true)
;

2 rows inserted.

SQL>  COMMIT
;

Commit complete.

According to SQL-92 standard, the following assertion would guarantee that for every shift, the number of doctors on_call = 'Y' is ≥ 1:

CREATE ASSERTION at_least_one_doctor_on_call_per_shift
CHECK (
    NOT EXISTS (
        SELECT shift_id
          FROM doctors
         GROUP BY shift_id
        HAVING COUNT(CASE WHEN on_call THEN 1 END) < 1
    )
);

Unfortunately, this is not supported yet:

SQL> CREATE ASSERTION at_least_one_doctor_on_call_per_shift
CHECK (
    NOT EXISTS (
        SELECT shift_id
          FROM doctors
         GROUP BY shift_id
        HAVING COUNT(CASE WHEN on_call THEN 1 END) < 1
    )
);  
          FROM doctors
               *
ERROR at line 5:
ORA-08689: CREATE ASSERTION failed
ORA-08661: Aggregates are not supported.
Help: https://docs.oracle.com/error-help/db/ora-08689/

Oracle implements a pragmatic, performance‑oriented subset of SQL‑92 assertions. It uses internal change tracking, similar to session‑scope materialized view logs, to limit what must be validated. Aggregates are currently disallowed, and assertions are mainly implemented as anti‑joins.

Here is a more creative way to express “Every shift must have at least one on-call doctor” as a double negation: “There must not exist any doctor who belongs to a shift that has no on-call doctor”:

DROP ASSERTION IF EXISTS no_shift_without_on_call_doctor;
CREATE ASSERTION no_shift_without_on_call_doctor
CHECK (
    NOT EXISTS (
        SELECT 'any doctor'
          FROM doctors
         WHERE NOT EXISTS (
             SELECT 'potential on-call doctor in same shift'
               FROM doctors on_call_doctors
              WHERE on_call_doctors.shift_id = doctors.shift_id
                AND on_call_doctors.on_call = TRUE
         )
    )
);

If the inner NOT EXISTS is true, this shift has ZERO on-call doctors, so we found a doctor who belongs to a shift with no on-call doctor, and the outer NOT EXISTS becomes false, which raises the assertion violation.

I re-play the conflicting changes from the previous post. In one session, Bob removes his on-call status:

SQL> UPDATE doctors 
      SET on_call = false 
      WHERE shift_id = 1 AND name = 'Bob'
;

In another session, Alice also tries to remove her on-call status:

SQL> UPDATE doctors 
      SET on_call = false 
      WHERE shift_id = 1 AND name = 'Alice'
;

Bob’s update succeeds, but before committing, Alice’s update hangs, waiting on enq: AN - SQL assertion DDL/DML. Once Bob commits, Alice's statement fails:

SQL> update doctors set on_call = false where shift_id = 1 and name = 'Alice';
update doctors set on_call = false where shift_id = 1 and name = 'Alice'
*
ERROR at line 1:
ORA-08601: SQL assertion (FRANCK.NO_SHIFT_WITHOUT_ON_CALL_DOCTOR) violated.
Help: https://docs.oracle.com/error-help/db/ora-08601/

This prevents race condition anomalies by reading not only the current state—which, without a serializable isolation level, is vulnerable to write-skew anomalies—but also the ongoing changes to other rows from other sessions.

Some internals: change tracking and enqueue

You can trace what happens, recursive statements and locks:

ALTER SESSION SET EVENTS 'sql_trace bind=true, wait=true';
ALTER SESSION SET EVENTS 'trace[ksq] disk medium';
ALTER SESSION SET tracefile_identifier=AliceRetry;

UPDATE doctors 
      SET on_call = false 
      WHERE shift_id = 1 AND name = 'Alice'
;

The update stores the change in an internal change tracking table prefixed by ORA$SA$TE_ (SQL Assertion Table Event):

INSERT INTO "FRANCK"."ORA$SA$TE_DOCTORS" 
(DMLTYPE$$, OLD_NEW$$, SEQUENCE$$, CHANGE_VECTOR$$, ROW$$ ,"SHIFT_ID","ON_CALL")
 VALUES (
'U', -- update
'O', -- old value
  2, -- ordered session-level sequence ORA$SA$SEQ$$
...,  -- change vector bits
'0000031D.0000.0000' -- ROWID,
1, false -- column values
);

INSERT INTO "FRANCK"."ORA$SA$TE_DOCTORS" 
(DMLTYPE$$, OLD_NEW$$, SEQUENCE$$, CHANGE_VECTOR$$, ROW$$ ,"SHIFT_ID","ON_CALL")
 VALUES (
'U',  -- update
'N',  -- new value
  2,  -- ordered per-session sequence ORA$SA$SEQ$$
'0a', -- change vector bits to find quickly which column has changed
'0000031D.0000.0000' -- ROWID,
1, true -- column values
);

You can spot the column names used in materialized view logs—introduced in Oracle7 as a “snapshot log”. It’s an internal feature from 1992 used in 2026 to implement a 1992 SQL feature 🤓 However, in this context the table isn’t a regular MV log, but an internal GTT (global temporary table) with special read restrictions:

SQL> select * from "FRANCK"."ORA$SA$TE_DOCTORS";
select * from "FRANCK"."ORA$SA$TE_DOCTORS"
                       *
ERROR at line 1:
ORA-08709: Reads from SQL assertion auxiliary tables are restricted.
Help: https://docs.oracle.com/error-help/db/ora-08709/

This change-tracking table is used only to decide if revalidation is needed, as well as what to lock, and revalidate only what’s necessary, without scanning whole tables. For example, here, it can detect that it has only to validate one SHIFT_ID and use it as the resource to lock, like a range lock used by databases that provide a serializable isolation level. This is an alternative to a data-modeling solution where we have a "shifts" table and use SELECT FOR UPDATE explicit locking on it.

During the validation, the lock type is AN, different from the well-known TX and TM lock types, and the ksq trace shows an exclusive lock acquired by Bob when it has detected that the changes may necessitate a validation of the assertion:

ORCLCDB_ora_7785_BOB.trc:
2026-01-31 23:44:54.587860*:ksq.c@7635:ksqcmi():ksqcmi AN-00011B78-C8CC96CA-739F2901-00000000 mode=0 timeout=0 lockmode=6 lockreq=0

The locked resource is a hash bucket computed from the assertion definition combined with the SHIFT_ID value. Alice's session requested as similar lock and waited on enq: AN - SQL assertion DDL/DML:

ORCLCDB_ora_7592_ALICE.trc:
2026-01-31 22:55:53.607221*:ksq.c@7635:ksqcmi(): ksqcmi AN-00011B78-C8CC96CA-739F2901-00000000 mode=6 timeout=21474836 lockmode=0 lockreq=0

This is where Alice waited during her update, and before the validation of the assertion.

Conclusion

With Oracle 23.26.1, SQL assertions finally move a long‑standing SQL‑92 feature from theory into practical enforcement, combining declarative constraint semantics with change tracking and fine‑grained locking.

There are now three solution to solve the doctor's on-call shift:

  • Serializable isolation level (not in Oracle Database)
  • Data model and explicit locking from the application (all databases)
  • SQL Assertion when the business logic is in the database (only Oracle Database)

January 31, 2026

OSTEP Chapters 4,5

I recently started reading "Operating Systems: Three Easy Pieces" (OSTEP) as part of Phil Eaton's offline reading group. We are tackling a very doable pace of 2 chapters a week.

The book is structured into three major parts: Virtualization, Concurrency, and Persistence. It is openly accessible to everyone for free, which is a tremendous contribution to computer science education by the Arpaci-Dusseau couple (Remzi and Andrea).

This is a very user-friendly book, sprinkled with a lot of jokes and asides that keep the mood light. The fourth wall is broken upfront, the authors talk directly to you, which is great. It makes it feel like we are learning together rather than being lectured at. Their approach is more than superficial; it inspires you, motivates the problems, and connects them to the big picture context. The book  builds scaffolding through "The Crux" of the problem and "Aside" panels. It actively teaches you the thought processes, not just the results. I’ve talked about the importance of this pedagogical style before: Tell me about your thought process, not just the results.


Chapter 4: The Process Abstraction

In the first part, Virtualization, we start with the most fundamental abstraction: The Process. Informally, a process is simply a running program.

To understand a process, we have to look at its machine state. This includes its address space (memory), registers (like the Program Counter and Stack Pointer), and I/O information (like open files).

One distinct memory component here is the Stack versus the Heap.

  • The Stack: C programs use the stack for local variables, function parameters, and return addresses. It's called a stack because it operates on a Last-In, First-Out basis, growing and shrinking automatically as functions are called and return.
  • The Heap: This is used for explicitly requested, dynamically-allocated data (via malloc in C). It’s needed for data structures like linked lists or hash tables where the size isn't known at compile time.

Reading this reminded me of 30 years ago, using Turbo Pascal to debug programs line-by-line. I remember watching the call stack in the IDE, seeing exactly how variables changed and stack frames were pushed and popped as the program executed. Turbo Pascal in the 1990s was truly a thing of beauty for learning these concepts interactively.

The book provides some excellent diagrams to visualize these concepts.

This figure shows how the OS takes a lifeless program on disk (code and static data) and hydrates it into a living process in memory. Like the trisolarans dehydrating to survive dormancy, the program exists as inert structure until the OS loads its bytes into an address space, reconstructing state and enabling execution. Only after this hydration step can the program spring into action.

To track all this, the OS needs a data structure. In the xv6 operating system (a teaching OS based on Unix), this is the "struct proc".

Figure 4.5 shows the code for the process structure in xv6. It tracks the process state (RUNNING, RUNNABLE, etc.), the process ID (PID), pointers to the parent process, and the context (registers) saved when the process is stopped. In other words, this struct captures the "inventory" the OS keeps for every running program.


Chapter 5: The Process API via Fork and Exec

In Chapter 5, we get to the UNIX process API. This is where the beauty of UNIX design really shines. The way UNIX creates processes is a pair of system calls: fork() and exec(). The fork() call is weird: it creates an almost exact copy of the calling process. Then things get really weird with exec(), as exec() takes an existing process and transforms it into a different running program by overwriting its code and static data.

This code snippet demonstrates the fork() call. It shows how the child process comes to life as if it had called fork() itself, but with a return code of 0, while the parent receives the child's PID.


This figure puts it all together. It shows a child process using execvp() to run the word count program ('wc') on a source file, effectively becoming a new program entirely.

You might ask: Why not just have a single CreateProcess() system call? Why this dance of cloning and then overwriting? The separation of fork() and exec() is essential for building the UNIX Shell. It allows the shell to run code after the fork but before the exec. This space is where the magic of redirection and pipes happens. For example, if you run 'wc p3.c > newfile.txt', the shell:

  1. Calls fork() to create a child.
  2. Inside the child (before exec): Closes standard output and opens 'newfile.txt'. Because UNIX starts looking for free file descriptors at zero, the new file becomes the standard output.
  3. Calls 'exec()' to run 'wc'.
  4. The 'wc' program writes to standard output as usual, unaware that its output is now going to a file instead of the screen.

Figure 5.4 shows the code for redirection. It shows the child closing STDOUT_FILENO and immediately calling 'open()', ensuring the output of the subsequent 'execvp' is routed to the file.

This design is unmatched. It allows us to compose programs using pipes and redirection without changing the programs themselves. It is a testament to the brilliance of Thompson and Ritchie. The book refers to this as Lampson's Law: "Get it right... Neither abstraction nor simplicity is a substitute for getting it right". The UNIX designers simply got it right, and that is why this design still holds up 55 years later as the gold standard. 

As a final thought to spark some discussion, it is worth remembering that science advances by challenging even its most sacred cows. While we laud fork() and exec() as brilliant design, a 2019 HotOS paper titled "A fork() in the road" offers a counterpoint, and argues that fork() was merely a "clever hack" for the constraints of the 1970s that has now become a liability. The authors contend that it is a terrible abstraction for modern programmers that compromises OS implementations, going so far as to suggest we should deprecate it and teach it only as a historical artifact. This kind of debate is exactly how science works; as the OSTEP authors remind us in the end note for the chapter.

January 30, 2026

Durastar Heat Pump Hysteresis

In which I discover that lying to HVAC manufacturers is an important life skill, and share a closely guarded secret: Durastar heat pumps like the DRADH24F2A / DRA1H24S2A with the DR24VINT2 24-volt control interface will infer the set point based on a 24-volt thermostat’s discrete heating and cooling calls, smoothing out the motor speed.

Modern heat pumps often use continuously variable inverters, so their compressors and fans can run at a broad variety of speeds. To support this feature, they usually ship with a “communicating thermostat” which speaks some kind of proprietary wire protocol. This protocol lets the thermostat tell the heat pump detailed information about the temperature and humidity indoors, and together they figure out a nice, constant speed to run the heat pump at. This is important because cycling a heat pump between “off” and “high speed” is noisy, inefficient, and wears it out faster.

Unfortunately, the manufacturer’s communicating thermostats are often Bad, Actually.™ They might be well-known lemons, or they don’t talk to Home Assistant. You might want to use a third-party thermostat like an Ecobee or a Honeywell. The problem is that there is no standard for communicating thermostats. Instead, general-purpose thermostats have just a few binary 24V wires. They can ask for three levels (off, low, and high) of heat pump cooling, of heating, and of auxiliary heat. There’s no way to ask for 53% or 71% heat.

So! How does the heat pump map these three discrete levels to continuously variable motor speeds? Does it use a bang-bang controller which jumps between, say, 30% and 100% intensity on calls for low and high heat, respectively? Or does it perform some sort of temporal smoothing, or try to guess the desire set point based on recently observed behavior?

How the heat pump interprets 24V signals is often hinted at in the heat pump’s manual. Lennox’s manuals, for instance, describe a sort of induced hysteresis mechanism where the heat pump ramps up gradually over time, rather than jumping to maximum. However, Durastar omits this information from their manuals. My HVAC contractor was also confused about this. After weeks of frustration, I tried to reach out to the manufacturer directly, and remembered that heat pump manufacturers are like paranoid wizards who refuse to disclose information about their products to everyday people. Only licensed HVAC professionals can speak to them. I wasted so, so much time on this, and have two secrets to share.

First: “licensed HVAC contractor” is not a real requirement. Many states have no licensing program, so you are just as licensed as anyone else in, say, rural Indiana. The trick that folks in construction use is to simply lie and tell them you’re an HVAC installer. As a midwesterner I do not like this, but it is apparently the only way to get things done. Durastar’s contractor support number is 877-616-2885.

Second: I talked to an actual Durastar engineer who immediately understood the question and why it was important. He explained that they use the thermistor on the air handler’s inlet as a proxy for indoor temperature, and learn the set point by tracking the 24V thermostat’s calls for heating over time. As long as the thermostat maintains a stable set point, the heat pump can run at a nice intermediate rate, trying to keep the indoor temperature close to—but not reaching—the inferred set point. That way the thermostat never stops calling for stage 1 heating/cooling, and the heat pump avoids short-cycling.

Finally, if the industry could please get its act together and make a standard protocol for communicating thermostats, we could all be free of this nonsense. I believe in you.