I love spelunking into unknown codebases with nothing but find and grep. It’s one of the most valuable skills one can develop as a programmer imo and in this video you can see how I approach it.
This video focuses on debugging GUI event handling. At first the bug seemed related to the app’s waveform selection, but I then realized it was a more general topic with the SerenityOS GUI UX — selecting a dropdown entry retains focus, and requires an explicit escape key.
Ultimately I made progress accidentally by hitting the keyboard while the selection was still active, revealing to me that fact (which I hadn’t noticed before).
You can see my general debugging flow:
Get things building
How to run app from command line (to see stdout)?
How to print to stdout?
Using debug prints to understand the GUI event handling
Overall I’m quite impressed with SerenityOS. I only realized after looking into the code exactly how much code they had written and how fully featured the system is. Well done to the team.
Four exercises that touch basic multithreaded and lockfree programming concepts.
Implement a program that attempts to use two threads to increment a global counter to 10,000 with each thread incrementing 5000. But make it buggy so that there are interleaving problems and the end result of the counter is less than 10,000.
Fix the above with atomics.
Implement a variant of the program: instead of simply incrementing the counter, make the counter wrap every 16 increments (as if incrementing through indices of an array of length 16). Make two threads each attempt to increment the counter (16 * 5000) times. The end state should have the counter be back at index zero. Implement it in a buggy naive way that causes the counter to often be nonzero, even if atomics are used.
Fix the above using a CAS loop.
(Bonus question for the above: Why isn’t std::atomic::compare_exchange_strong a good fit here?)
It’s really all about memory. But to start at the beginning, the rough stack looks like this:
I find it easier to think about this from the middle out. On Linux, the kernel exposes hardware devices as files backed by the /dev virtual filesystem. Userspace can do normal syscalls like open, read, write, and mmap on them, as well as the less typical ioctl (for more arbitrary, device-specific functionality).1.
The files are created by kernel drivers which are modules of kernel code whose sole purpose is to interface with and abstract hardware so it can be used by other parts of the operating system, or userspace. They are implemented implemented using internal driver “frameworks” in the kernel, e.g. the I2C or SPI frameworks. When you interface with a file in /dev, you are directly triggering callback handlers in a driver which execute in the process context.
That’s how userspace interfaces with the kernel. How do drivers interface with hardware? These days, mostly via memory mapped I/O (MMIO)2. This is when device hardware “appears” at certain physical addresses, and can be interfaced with via load and store instructions using an “API” that the device defines. For example, you can read data from a sensor by simply reading a physical address, or write data out to a device by writing to an address. The technical term for the hardware component these reads/writes interface with is “registers” (i.e. memory mapped registers).
(Aside: Other than MMIO, the other main interface the kernel has with hardware is interrupts, for interrupt driven I/O processing (as opposed to polling, which is what MMIO enables). I’m not very knowledgeable about this, so I won’t get into it other than to say drivers can register handlers for specific IRQ (interrupt requests) numbers, which will be invoked by the kernel’s generic interrupt handling infrastructure.)
Using MMIOs looks a lot like embedded bare metal programming you might do on a microcontroller like a PIC. At the lowest level, a kernel driver is really just embedded bare metal programming.
Here’s an example of a device driver for UART (serial port) hardware for ARM platforms: linux/drivers/tty/serial/amba-pl011.c. If you’re debugging an ARM Linux system via a serial connection, this is might be the driver being used to e.g. show the boot messages.
The lines like:
cr = readb(uap->port.membase + UART010_CR);
are where the real magic happens.
This is simply doing a read from a memory address derived from some base address for the device, plus some offset of the specific register in question. In this case it’s reading some control information from a Control Register.
Device interfaces may range from having just a few to many registers.
To go one step deeper down the rabbit hole, how do devices “end up” at certain physical addresses? How is this physical memory map interface implemented?3
The device/physical address mapping is implemented in digital logic outside the CPU, either on the System on Chip (SOC) (for embedded systems), or on the motherboard (PCs)4. The CPU’s physical interface include the address, data, and control buses. Digital logic converts bits of the address bus into signals that mutually exclusively enable devices that are physically connected to the bus. The implementations of load/store instructions in the CPU set a read/write bit appropriately in the Control bus, which lets devices know whether a read or write is happening. The data bus is where data is either transferred out from or into the CPU.
In practice, documentation for real implementations of these systems can be hard to find, unless you’re a customer of the SoC manufacturer. But there are some out there for older chips, e.g.
Here’s a block diagram for the Tegra 2 SoC architecture, which shipped in products like the Motorola Atrix 4G, Motorola Droid X2, and Motorola Photon. Obviously it’s much more complex than my description above. Other than the two CPU cores in the top left, and the data bus towards the middle, I can’t make sense of it. (link)
While not strictly a “System on Chip”, a class PIC microcontroller has many shared characteristics of a SoC (CPU, memory, peripherals, all in one chip package), but is much more approachable.
We can see the single MIPS core connected to a variety of peripheral devices on the peripheral bus. There’s even layers of peripheral bussing, with a “Peripheral Bridge” connected to a second peripheral bus for things like I2C and SPI.
I was debugging a microcontroller recently and was surprised to see that I was limited to 4 breakpoints. That surprised me because I’m usually only used to seeing that kind of limitation when using watchpoints, not breakpoints. (See my youtube video for more on this).
But it makes sense after all — code running on this microcontroller is different than in a process in an OS’s userspace because the code will probably be flashed into some kind of ROM (Read Only Memory) (or EEPROM). Since the code is unwritable at a hardware level, the typical method of inserting software breakpoint instructions isn’t possible. So the alternative is to use the microprocessor’s support for hardware debugging, which occupies hardware resources and is thus finite.
Business models are an interesting topic that lives directly at the boundary between business and technology.
The business model describes an abstract plan for how the business is to function sustainably, and is a full time job to develop in and of itself. Then, it must be implemented in the product using technology.
For example, a typical SaaS business model involves subscription pricing at various intervals (monthly & yearly), with a discount given for the yearly plan in exchange for more money paid up front and longer commitment. There may or may not be a limited time free trial period, or alternatively a limited time guaranteed refund. Furthermore, there may be different pricing tiers that unlock more advanced features.
Ultimately, this is all going to end up as code that models the various pricing tiers, plans, timing deadlines, and enforces security (i.e. making sure the advanced features are only available to pro users). Stripe is a common service used to model arbitrary business models and execute payments processing. They provide client libraries offering data models for concepts like users and plans.
While subscription models have grown in popularity as software shifts to the web, it’s not the only model. The older model of buying discrete software packaged versions still exists, mostly used by vendors of desktop software. In this model, customers pay a larger sum for unlimited use of a specific major version of software. Then they can optionally pay again to upgrade to a newer major version when it’s released. A variant of this exists where there’s no upgrade fee (i.e. “lifetime free updates”). Another model exists called “Rent to own” which is like a subscription, except payments stop at a certain point after which there is free unlimited use.
Whether the software runs on the client or server is another dimension to consider which may influence the business model — server side software is most commonly sold using subscriptions these days.
SaaS Web Apps
One time payment per version
Dash, Ableton Live
One time payment, lifetime free updates
FL Studio, Tailwind CSS (technically more assets than software)
SaaS Web Apps Lifetime Plans (e.g Roam Research)
Rent to own
Audio plugins via Splice
The business model impacts technical strategy in a few ways.
Branching and release management
Compatibility of document artifacts
Subscriptions, “lifetime free updates”, and “rent to own” are simplest in terms of branching and release management. All users are expected to run the latest software (because it’s either free or automatic, in the case of server side), so development can largely happen on a single main branch which releases are cut from.
“One time payment per version” is more complex because potentially multiple major versions of the software need to be maintained in parallel. Ideally bugfixes in Version 2 would also be applied to V3 and V4 when appropriate, but no new feature development should make its way back to V2 and risk being included in binaries send to users that only paid for V2. For this situation, long term release branches make sense as they provide code isolation, though they require some mechanism to forward bugfixes, etc.
The second aspect is document artifact compatibility. Again in this scenario, subscriptions, “lifetime free updates”, and “rent to own” are simpler in that all users can be assumed to be running on the latest version — or at least can be told to upgrade at no cost.
“One time payment per version” is again more complex because multiple versions of the software exist in the wild, all producing artifacts of slightly different versions. If artifacts may be sent between users of different versions, it creates a mess of trying dealing with all the compatibility issues that may arise.
A major difference between x86 and ARM32 is that while x86 generally1 only offers conditional execution of branch instructions (e.g. BNE) ARM32 offers conditional execution for many more instructions (e.g. ALU instructions, like ADD.EQ- add if not equal). The idea was that these can be used to avoid emitting traditional branching instruction sequences which may suffer from pipeline stalls when a branch is mispredicted — instead, straight line code with equivalent semantics can be emitted.
This was actually removed in ARM64. The official quote:
The A64 instruction set does not include the concept of predicated or conditional execution. Benchmarking shows that modern branch predictors work well enough that predicated execution of instructions does not offer sufficient benefit to justify its significant use of opcode space, and its implementation cost in advanced implementations.
Turns out that the whole pipeline stall problem is generally not a huge issue anymore as branch prediction has gotten so good, while support for the feature still requires allocating valuable instructions bits to encode the conditions. Note that ARM64 still uses 32 bit instructions, so conserving bits is still useful.
What is very interesting is that Intel’s recent APX extensions to x86-64 (whose purpose is to primarily add more general purpose registers) moves closer to this conditional instructions direction.
The performance features introduced so far will have limited impact in workloads that suffer from a large number of conditional branch mispredictions. As out-of-order CPUs continue to become deeper and wider, the cost of mispredictions increasingly dominates performance of such workloads. Branch predictor improvements can mitigate this to a limited extent only as data-dependent branches are fundamentally hard to predict.
To address this growing performance issue, we significantly expand the conditional instruction set of x86, which was first introduced with the Intel® Pentium® Pro in the form of CMOV/SET instructions. These instructions are used quite extensively by today’s compilers, but they are too limited for broader use of if-conversion (a compiler optimizationthat replaces branches with conditional instructions).”
Including support for conditional loads and stores which is apparently tricky with modern out of order and superscalar architectures.
This might an interesting metric to compare programming languages: distance between application programming and standard library programming.
In general, standard library programming is more difficult than average application programming for any programming language. Or at the very least “different” in some way — language features used, programming style required, etc. That difference might vary depending on the language.
Mainly, I’m thinking about C++, where standard library programming requires expertise in C++ template meta-programming (TMP) (i.e. to implement std::variant), which is effectively an entirely different programming language, existing in the C++ type system. While many application developers may also be competent in TMP, there are many that aren’t (including myself). It’s possible to be a very productive C++ programmer, and STL user, without being an expert at implementing generic libraries using TMP.
Given this, my impression that C++ has a relatively high distance between application programming and stdlib programming.
Python of course also has some distance here. I’m not qualified to speak on it, but I can imagine it also involved more obscure language feature that do not occur often in normal app development. I would guess that this distance is less than C++.
Lastly, C is an interesting language to consider because it offers such a spartan feature set that there aren’t particularly that many more language features available to be used. (I’m probably wrong here and there are obscure things I’m not aware of, including symbol versioning for compatibility). But in general, my assumption would be that C has a some limited distance, given it’s restrained feature set.
Ultimately, this metric is probably impossible to quantify and may have limited value, but is something I find intriguing anyway if it were to exist.
The Linux kernel has an interesting difference compared to the Windows and macOS kernels: it offers syscall ABI compatibility.
This means that applications that program directly against the raw syscall interface are more or less guaranteed to always keep working, even with arbitrarily newer kernel versions. “Programming against the raw syscall interface” means including assembly code in your app that triggers syscalls:
setting the appropriate syscall number in the syscall register
setting arguments in the defined argument registers
executing a syscall instruction
reading the syscall return value register
Here are the ABIs for some common architectures.
Syscall Number Register
Syscall Return Value
EBX, ECX, EDX, ESI, EDI, EBP
RDI, RSI, RDX, R10, R8, R9
Manticore is my go-to source to quickly look these up: https://github.com/trailofbits/manticore
Once you’ve done this, now you’re relying on the kernel to not change any part of this. If the kernel changes any of these registers, or changes the syscall number mapping, your app will not longer trigger the desired syscall correctly and will break.
Aside from writing raw assembly in your app, there’s a more innocuous way of accidentally “programming directly against the syscall interface”: statically linking to libc. When you statically link to a library, that library’s code is directly included in your binary. libc is generally the system component responsible for implementing the assembly to trigger syscalls, and by statically linking to it, you effectively inline those assembly instructions directly into your application.
So why does Linux offer this and Windows and macOS don’t?
In general, compatibility is cumbersome. As a developer, if you can avoid having to maintain compatibility, it’s better. You have more freedom to change, improve, and refactor in the future. So by default it’s preferable to not maintain compatibility — including for kernel development.
Windows and macOS are able to not offer compatibility because they control the libc for their platforms and the rules for using it. And one of their rules is “you are not allowed to statically link libc”. For the exact reason that this would encourage apps that depend directly on the syscall ABI, hindering the kernel developers’ ability to freely change the kernel’s implementation.
If all app developers are forced to dynamically link against libc, then as long as kernel developers also update libc with the corresponding changes to the syscall ABI, everything works. Old apps run on a new kernel will dynamically link against the new libc, which properly implements the new ABI. Compatibility is of course still maintained at the app/libc level — just not at the libc/kernel level.
Linux doesn’t control the libc in the same way Windows and macOS do because in the Linux world, there is a distinct separation between kernel and userspace that isn’t present in commercial operating systems. Strictly speaking Linux is just the kernel, and you’re free to run whatever userspace on top. Most people run GNU userspace components (glibc), but alternatives are not unheard of (musl libc, also bionic libc on Android).
So because Linux kernel developers can’t 100% control the libc that resides on the other end of the syscall interface, they bite the bullet and retain ABI compatibility. This technically allows you to statically link with more confidence than on other OSs. That said, there are other reasons why you shouldn’t statically link libc, even on Linux.
This directory documents the interfaces that the developer has
defined to be stable. Userspace programs are free to use these
interfaces with no restrictions, and backward compatibility for
them will be guaranteed for at least 2 years. Most interfaces
(like syscalls) are expected to never change and always be
What: The kernel syscall interface
This interface matches much of the POSIX interface and is based
on it and other Unix based interfaces. It will only be added to
over time, and not have things removed from it.
Note that this interface is different for every architecture
that Linux supports. Please see the architecture-specific
documentation for details on the syscall numbers that are to be
mapped to each syscall.
Q: I'm trying to link my binary statically, but it's failing to link because it can't find crt0.o. Why?
A: Before discussing this issue, it's important to be clear about terminology:
A static library is a library of code that can be linked into a binary that will, eventually, be dynamically linked to the system libraries and frameworks.
A statically linked binary is one that does not import system libraries and frameworks dynamically, but instead makes direct system calls into the kernel.
Apple fully supports static libraries; if you want to create one, just start with the appropriate Xcode project or target template.
Apple does not support statically linked binaries on Mac OS X. A statically linked binary assumes binary compatibility at the kernel system call interface, and we do not make any guarantees on that front. Rather, we strive to ensure binary compatibility in each dynamically linked system library and framework.
If your project absolutely must create a statically linked binary, you can get the Csu (C startup) module from Darwin and try building crt0.o for yourself. Obviously, we won't support such an endeavor.
Have you every updated your operating system and your computer rebooted multiple times during the process? Why does this happen? Why isn’t one reboot enough?
In general, if you have some running computer system that wants to upgrade itself, it’s easiest to download the new version, exit, and simply restart from scratch.
This is why you usually have to reboot for OS updates to take effect. On a different level, this is also why you usually have to reboot applications for their software updates to be applied.
The alternative to rebooting is hot swapping: the running program dynamically swapping itself out with the new one — while staying running the entire time. This is harder than rebooting and requires the developer to actually write code to implement this. A reboot-based update generally does not require much code beyond downloading the artifact into the place it will be looked for the next time things start.
If you opt for dynamic swapping, you need to make sure that the update applies cleanly and absolutely in memory — such that there are no lingering pieces of state from the old software lurking. You don’t want to be running half old software, half new software. (Or have old data in memory, half new data)
In general, this is not a problem if you reboot. Reboots start from a clean slate, so you can be sure that after reboot, you’re running with 100% the new software, 0% the old.1
These fundamental principles apply to any update on any kind of system: an application, its dependencies, an operating system (including core userspace libraries), or firmware running on a device.
So why would a computer reboot multiple times during an update?
It’s probably because it has updates A, B, and C to make, and they each require a reboot to take effect.
If there are blocking dependencies between then (C requires B first, which requires A first), there’s no way around doing a reboot after applying each.
If there are no dependencies, then it might be technically feasible to try to apply them all at the same time, and have them all apply together on the next reboot. But even so, such mass changes with multiple things updating all at once can be riskier. So it might still be safest to sacrifice update speed and update in a slower, more controlled manner, one update & reboot at a time.
There is a migration to a newer, fancier update server infrastructure
Using this new infra requires a newer version of a core library that is in use on the running old system
One last update is pushed to the old infrastructure which updates the core library. Reboot once.
Now the system can fetch a new update ZZ that’s sitting in the new infrastructure, using its new library support. Reboot again.
And then maybe the version ZZ of the software has some init code that downloads a firmware update for hardware component that ZZ needs to run. This update is incompatible with the previous version and is only applied once ZZ is confirmed to be present and working. Download the firmware update and reboot a third time.
One concrete technique I’m aware of that results in multiple boots is a disk image space optimization.
When building a disk image for a device, it’s common to include a data partition for user data which is separate from the system partition. In real use, the data partition will likely be very large, but when initially building a fresh image, the data partition will be all zeros. This is wasteful and makes the disk image occupy must more space than necessary, which slows down file transfers.
An optimization would be to build the disk image with a smaller data partition — just as big as necessary to hold pre-existing files on the partition. Then on the first boot, resize the partition to occupy the full available space. An easy way to apply that change to the system is to then reboot.
* The exception is persistent state. If your system leaves persistent state on disk or elsewhere, you need to be careful that those artifacts are either compatible with the new system, or migrated.
LLDB supports custom scripts (“variable formatters”) to pretty print C++ data structures. For example, std::vector is typically implemented as a struct with three pointers: begin, end, and capacity. But if you wanted to print out a std::vector variable during a debugging session, printing out these three pointers isn’t likely to be helpful. What you actually want is to print the contents of the vector. Pretty printer scripts allow for doing this for your own data structures.1