It’s an experimental home for content which favors the audio medium β mostly non-technical stories & lessons from my life. I will have audio versions of some of the blog posts here.
I was thinking about the growing number of publishing channels I now have and what belongs where. Here’s what I have so far:
Twitter: More polished posts that I feel comfortable directly sharing with a larger audience.
Blog: Home base for everything.
Youtube: Technical topics where screencasting is most natural
Podcast: Stories, life & career lessons, more intimate or personal topics
Reminder: You should unconditionally be using the diff3 merge style config with git. It’s strictly superior to the default config and provides critical context for resolving conflicts.
Instead of simply showing the the state of the original code, and then the incoming conflicting change, it also shows the code before either change was made. This lets you see the changes both sides were attempting, and mechanically reason about how to merge them.
The mechanical process is (credit to Mark Zadel for showing me):
Begin with the common ancestor code
Identify the difference between it and the original code.
Apply that difference to the incoming change. Keep only the incoming change.
The opposite direction (identify difference between middle and incoming change; apply difference to original code and keep it) also works. You can choose whichever is simpler.
Example:
Here’s a diff that happened on master.
int main()
{
int x = 41;
- return x + 1;
+ return x + 2;
}
Here’s a diff that happened in parallel on a development branch.
int main()
{
int x = 41;
- return x + 1;
+ int ret = x + 1;
+ return ret;
}
Here’s the merge conflict from e.g. rebasing the branch onto master.
int main()
{
int x = 41;
<<<<<<< HEAD
return x + 2;
||||||| parent of 4cfa6e2 (Add intermediate variable)
return x + 1;
=======
int ret = x + 1;
return ret;
>>>>>>> 4cfa6e2 (Add intermediate variable)
}
On the first side, we change the core computation. On the second side, we extract a variable.
One way to resolve the conflict is to take that change between the middle and top (x + 1 -> x + 2), then applying it to the bottom.
That produces the correct conflict resolution:
int main()
{
int x = 41;
int ret = x + 2;
return ret;
}
The other way of extracting it (refactor a variable out from the top x+2 code) produces the same end result.
(This is a high level summary of my current knowledge, primarily to help me crystallize the knowledge. It comes entirely from from Jeff Preshing’s blog (see end of post) and youtube talk. This is not intended to be a comprehensive overview; for that, please see the aforementioned materials. I am very much a non-expert on this topic; please treat everything with skepticism.)
When programming with atomics, how are you suppose to know which of the ~four memory orderings to use? For example, the main ones (C++ terminology) are:
memory_order_seq_cst
memory_order_acquire
memory_order_release
memory_order_relaxed
(and a few other niche ones: acq_rel, consume)
First, as Jeff Preshing states, there is a distinction between “sequentially consistent” atomics and “low level” atomics. He describes it as two libraries for atomics masquerading as a single one within the C++ standard library.
The first, “sequentially consistent”, can be considered a higher level way of using atomics. You can safely use seq_cst everywhere. You get simpler semantics and higher likelihood of correctness, just at the expensive of performance. As an optimization, you can then port the code to the second form of “low level atomics”. This is where you must choose the explicit memory orderings.
But why do sequentially consistent atomics come with a performance hit?
The performance hit comes from cross core communication. The sequentially consistent memory model offers a very strong guarantee to the programmer; in addition to the ordering of atomic operations being consistent across cores (which is always the case), the ordering of non-atomic operations is also guaranteed to be consistent (i.e. no reordering) relative to the atomic ones.
This is relevant because programming with atomics often involves “guard” (atomic) variables who regulate access to “normal” (non-atomic) data that is transferred between threads. This guarantee requires extra effort from the memory subsystem of the CPU in the form of cross core communication as the cores need to effectively synchronize their caches.
When one moves to “low level” atomics, the strict constraints required of the memory subsystem are relaxed. Not all orderings of non-atomic accesses relative to atomic accesses must be maintained. The consequence is less cross-core coordination is required. This can be exploited for higher performance in specific scenarios where the strict ordering constraint is not required in both (or any) directions (i.e. non-atomic memory accesses are allowed to move before or after the atomic access).
Exercise: Would one expect to see a performance improvement from porting code from sequentially consistent atomics to low level atomics, if the code is run on a single core system?
The whole point of low level atomics is to optimize performance by relaxing constraints and reducing cross core communication, so no. There is no cross core communication in a single core system, so there is nothing substantial to optimize.
(I am not 100% sure of this answer. This is the current state of my knowledge and I would appreciate being corrected or affirmed either way!)
So how does one choose between all those memory orderings?
With my non-expert understanding, I believe there are some simple rules that make the decision much easier than it might seem.
First off: Decide whether you’re using sequentially consistent or low level atomics. If the former, you use seq_cst everywhere (this is even the default with C++ if you don’t specify anything).
If you want to optimize to use low level atomics, then for most cases, you then only have three choices: acquire, release, and relaxed. (seq_cst is no longer an option; acq_rel is more niche; consume is actively discouraged). Then:
If you’re deciding for a load operation, you then only choose between acquire and relaxed. Loads are never release.
And vice verse, If you’re deciding for a store operation, you then only choose between release and relaxed. Stores are never acquire.
This narrows it down to two choices. To determine whether it’s acquire/release or relaxed, determine whether the load/store has a synchronizes-with relation to a corresponding store/load. If there is one, you want acquire/release. Otherwise, choose relaxed.
Read these blog posts for a fuller answer to this:
(This is not news; just something I was surprised to learn recently.)
The classic virtual memory design for an operating system maps the kernel in the address space of every process. This improves context switch performance; switching into the kernel then requires no expensive page table reset. The kernel can run using the same page tables userspace was running with.
Typically, the kernel is mapped into the upper section of virtual memory. For example, on 32 bit Linux, the kernel is mapped into the top gigabyte. Concretely, to implement this, the page table entries mapping those kernel pages are set with the supervisor bit on. This means only privileged code (running in Ring 0 on x86) can access those pages. This is what enforces security and prevents userspace from accessing kernel memory. The MMU is therefore responsible for enforcing security.
In the world of CPU side-channel vulnerabilities this MMU enforced security boundary is no longer reliable. Specifically, the Meltdown vulnerability allows userspace to read arbitrary memory, anywhere in the virtual address space, regardless of whether the supervisor bit is set. It does this using cache-based timing side-channels that exist due to speculative execution of memory accesses.
This means that it’s no longer safe to map the kernel into the address space of userspace processes, and indeed that’s no longer done. The general name for this mitigation is “Kernel Page Table Isolation” (KPTI). As of “modern” kernels (since 5.15 for aarch64 Linux I believe),it’s on by default. (See CONFIG_UNMAP_KERNEL_AT_EL0). Context switches now must reset the page tables to a set private to the kernel.
KAISER will affect performance for anything that does system calls or interrupts: everything. Just the new instructions (CR3 manipulation) add a few hundred cycles to a syscall or interrupt. Most workloads that we have run show single-digit regressions. 5% is a good round number for what is typical. The worst we have seen is a roughly 30% regression on a loopback networking test that did a ton of syscalls and context switches.
The lesson here? Even the most seemingly fundamental knowledge about how computers work is subject to change. Don’t assume things are still as you learned them, and exercise caution and humility when discussing details of things you haven’t actively kept up with development of.
This will be a basic fact to some, but you don’t need to load code into RAM to execute it. You can execute code straight from ROM.
In fact, this is how most computer systems boot up. After the CPU finishes initializing, it starts executing at a specific physical address which is generally mapped to some kind of Boot ROM.
(On x86, this first instruction is located at 0xFFFFFFF0, which is interestingly almost completely at the top of memory. The code there then needs to contain a jump to the rest of the actual boot code. (Source: Intel 64 and IA-32 Architectures Software Developerβs Manual, Vol 3A Section 9.1.4)
I believe ARM systems are different and the start address can vary.)
The Boot ROM β like the name suggests β is not RAM. It’s ROM. It’s a totally separate device on the memory bus offering nonvolatile storage. It’s mapped into physical memory using the mesh of digital logic that implements the physical memory mapping. (More: https://offlinemark.com/2023/08/09/how-the-hardware-software-interface-works/)
The CPU is generally not aware of what specific device is on the other end of the memory bus, servicing reads and writes. During instruction fetch, it simply issues reads to the memory bus, receives instruction data, then executes it. The data can transparently come from RAM, ROM, or potentially even some other device, provided it is fast enough.
The reason this was unintuitive to me, is because until recently I’ve only ever done “normal” programming, where programs are loaded from disk into memory before running them. This is the domain of probably 99% of programmers. And it’s not even just limited to userspace application programmers; even kernel developers have their code loaded into RAM before its run. It’s usually only the developers of very early stage bootloaders and microcontroller firmware developers that need to be aware of the CPU running code from locations other than RAM.
I love spelunking into unknown codebases with nothing but find and grep. It’s one of the most valuable skills one can develop as a programmer imo and in this video you can see how I approach it.
This video focuses on debugging GUI event handling. At first the bug seemed related to the app’s waveform selection, but I then realized it was a more general topic with the SerenityOS GUI UX β selecting a dropdown entry retains focus, and requires an explicit escape key.
Ultimately I made progress accidentally by hitting the keyboard while the selection was still active, revealing to me that fact (which I hadn’t noticed before).
You can see my general debugging flow:
Get things building
How to run app from command line (to see stdout)?
How to print to stdout?
Using debug prints to understand the GUI event handling
Overall I’m quite impressed with SerenityOS. I only realized after looking into the code exactly how much code they had written and how fully featured the system is. Well done to the team.
Four exercises that touch basic multithreaded and lockfree programming concepts.
Implement a program that attempts to use two threads to increment a global counter to 10,000 with each thread incrementing 5000. But make it buggy so that there are interleaving problems and the end result of the counter is less than 10,000.
Fix the above with atomics.
Implement a variant of the program: instead of simply incrementing the counter, make the counter wrap every 16 increments (as if incrementing through indices of an array of length 16). Make two threads each attempt to increment the counter (16 * 5000) times. The end state should have the counter be back at index zero. Implement it in a buggy naive way that causes the counter to often be nonzero, even if atomics are used.
Fix the above using a CAS loop.
(Bonus question for the above: Why isn’t std::atomic::compare_exchange_strong a good fit here?)
Some rough lab notes on these topics to record the current state of my knowledge. I’m not an expert, so there may be inaccuracies.
Mutexes
On Linux, libpthread mutexes are implemented using the underlying futex syscall
They are basically a combination of a spinlock (in userspace), backed by the kernel for wait/signal operations only when absolutely necessary (i.e. when there’s contention). In the common case of an uncontended lock acquire, there is no context switch which improves performance
The userspace spinlock portion uses atomics as spinlocks usually do, specifically because the compare and set must be atomic
Jeff Preshing (see below) writes that each OS/platform has an analogous concept to this kind of “lightweight” mutex β Windows and macOS have them too
Before futex(2), other syscalls were used for blocking. One option might have been the semaphore API, but commit 56c910668cff9131a365b98e9a91e636aace337a in glibc is before futex, and it seems like they actually use signals. (pthread_mutex_lock -> __pthread_lock (still has spinlock elements, despite being before futex) -> suspend() -> __pthread_suspend -> __pthread_wait_for_restart_signal -> sigsuspend)
A primary advantage of futex over previous implementations is that futexes only require kernel resources when there’s contention
Like atomics, mutexes implementations include memory barriers (maybe even implicitly due to atomics) to prevent loads/stores from inappropriately crossing the lock/unlock boundary due to compiler and/or hardware instruction reordering optimizations
It’s really all about memory. But to start at the beginning, the rough stack looks like this:
Userspace application
Kernel driver
Hardware device
I find it easier to think about this from the middle out. On Linux, the kernel exposes hardware devices as files backed by the /dev virtual filesystem. Userspace can do normal syscalls like open, read, write, and mmap on them, as well as the less typical ioctl (for more arbitrary, device-specific functionality).1.
The files are created by kernel drivers which are modules of kernel code whose sole purpose is to interface with and abstract hardware so it can be used by other parts of the operating system, or userspace. They are implemented implemented using internal driver “frameworks” in the kernel, e.g. the I2C or SPI frameworks. When you interface with a file in /dev, you are directly triggering callback handlers in a driver which execute in the process context.
That’s how userspace interfaces with the kernel. How do drivers interface with hardware? These days, mostly via memory mapped I/O (MMIO)2. This is when device hardware “appears” at certain physical addresses, and can be interfaced with via load and store instructions using an “API” that the device defines. For example, you can read data from a sensor by simply reading a physical address, or write data out to a device by writing to an address. The technical term for the hardware component these reads/writes interface with is “registers” (i.e. memory mapped registers).
(Aside: Other than MMIO, the other main interface the kernel has with hardware is interrupts, for interrupt driven I/O processing (as opposed to polling, which is what MMIO enables). I’m not very knowledgeable about this, so I won’t get into it other than to say drivers can register handlers for specific IRQ (interrupt requests) numbers, which will be invoked by the kernel’s generic interrupt handling infrastructure.)
Using MMIOs looks a lot like embedded bare metal programming you might do on a microcontroller like a PIC or Arduino (AVR). At the lowest level, a kernel driver is really just embedded bare metal programming.
Here’s an example of a device driver for UART (serial port) hardware for ARM platforms: linux/drivers/tty/serial/amba-pl011.c. If you’re debugging an ARM Linux system via a serial connection, this is might be the driver being used to e.g. show the boot messages.
The lines like:
cr = readb(uap->port.membase + UART010_CR);
are where the real magic happens.
This is simply doing a read from a memory address derived from some base address for the device, plus some offset of the specific register in question. In this case it’s reading some control information from a Control Register.
Device interfaces may range from having just a few to many registers.
To go one step deeper down the rabbit hole, how do devices “end up” at certain physical addresses? How is this physical memory map interface implemented?3
The device/physical address mapping is implemented in digital logic outside the CPU, either on the System on Chip (SOC) (for embedded systems), or on the motherboard (PCs)4. The CPU’s physical interface include the address, data, and control buses. Digital logic converts bits of the address bus into signals that mutually exclusively enable devices that are physically connected to the bus. The implementations of load/store instructions in the CPU set a read/write bit appropriately in the Control bus, which lets devices know whether a read or write is happening. The data bus is where data is either transferred out from or into the CPU.
In practice, documentation for real implementations of these systems can be hard to find, unless you’re a customer of the SoC manufacturer. But there are some out there for older chips, e.g.
Here’s a block diagram for the Tegra 2 SoC architecture, which shipped in products like the Motorola Atrix 4G, Motorola Droid X2, and Motorola Photon. Obviously it’s much more complex than my description above. Other than the two CPU cores in the top left, and the data bus towards the middle, I can’t make sense of it. (link)
While not strictly a “System on Chip”, a classic PIC microcontroller has many shared characteristics of a SoC (CPU, memory, peripherals, all in one chip package), but is much more approachable.
We can see the single MIPS core connected to a variety of peripheral devices on the peripheral bus. There’s even layers of peripheral bussing, with a “Peripheral Bridge” connected to a second peripheral bus for things like I2C and SPI.