Category Archives: C

osdev journal: Gotchas with cc-runtime/libgcc

August 11, 2025: _Lab Notes 🧪, C, Operating Systems

libclang_rt/libgcc are compiler runtime support library implementations, which the compiler occasionally emits calls into instead of directly inlining the codegen. Usually this is for software implementations of math operations (division/mod). Generally you’ll need a runtime support library for all but the most trivial projects.

cc-runtime is a utility library for hobby OS developers. It is a standalone version of libclang_rt, which can be included vendored into an OS build.

The main advantage for me is that it lets me use a prebuilt clang from Homebrew. The problem with prebuilt clang from Homebrew, is it doesn’t come with a libclang_rt compiled for i386 (which makes sense, why would it — I’m on an ARM64 Mac).

(This is unlike the prebuilt i386-elf-gcc in Homebrew, which does come with a prebuilt libgcc for i386).

Since it doesn’t come with libclang_rt for i386, my options are:

Option
Keep using libgcc from i386-elf-gcc in Homebrew	Undesirable — the goal is to only depend on one toolchain, and here I’d depend on both clang and gcc.
Build clang and libclang_rt from source	Undesirable — it’s convenient to avoid building the toolchain from source if possible.
Vendor in a libgcc.a binary from https://codeberg.org/osdev/libgcc-binaries	Undesirable — vendoring binaries should be a last resort
Use cc-runtime	Best — No vendored binaries, no gcc dependency, no building toolchain from source

However, cc-runtime does have a gotcha. If you’re not careful, you’ll balloon your binary size.

This is because the packed branch of cc-runtime (which is the default and easiest to integrate) packs all the libclang_rt source files into a single C file, which produces a single .o file. So the final .a library has a single object file in it.

This is in contrast to libgcc.a (or a typical compilation of libclang_rt) where the .a library probably contains multiple .o files — one for each .c file.

By default, linkers will optimize and only use any .o files in the .a library that are needed. But since cc-runtime is a single .o file, the whole thing will get included! This means, the binary will potentially include many libclang_rt functions that are unused.

In my case, the size of one of my binaries went from 36k (libgcc) to 56k (cc-runtime, naive).

To work around this, you either need to use the trunk branch of cc-runtime (which doesn’t pack them all into one .c file). This is ~30 .c files and slightly more annoying to integrate into the build system.

Or, you can use some compiler/linker flags to make the linker optimization more granular and work at the function level, instead of the object file level.

Those are:

Compiler flag:

-ffunction-sections -fdata-sections

Linker flag

--gc-sections

With this, my binary size reduced to 47k. So there is still a nontrivial size increase, but the situation is slightly improved.

Ultimately, my preferred solution is the first: to use the trunk branch. The build integration is really not that bad, and the advantage is you don’t need to remember to use the special linker flag, which you’d otherwise need to ensure is in any link command for any binary that links against cc-runtime.

That said, those compiler/linker flags are probably a good idea to use anyway, so the best solution might be to do both.

Pure GNU Make is (unfortunately) not enough

Leave a reply

March 31, 2025: _Tech 💻, C, C++, CMake

I’d love to simply use GNU Make for my C/C++ projects because it’s so simple to get started with. Unfortunately it’s lacking a few essential qualify of life features:

Out of tree builds
Automatic header dependency detection
Recompile on CFLAGS change
First class build targets

Out of tree builds

If you want your build to happen in an isolated build directory (as opposed to creating object files in your source tree), you need to implement this yourself.

It involves a lot of juggling paths. Not fun!

Automatic header dependency detection

In C/C++, when a header file changes, you must recompile all translation units (i.e. object files, roughly) that depend on (i.e. include) that header. If you don’t, the object file will become stale and none of your changes to constants, defines, or struct definitions (for example) will be picked up.

In Make rules, you typically express dependencies between source files, and object files, e.g:

%.o: %.c
  # run compiler command here

This will recompile the object file when the source file changes, but won’t recompile when any headers included by that source file change. So it’s not good enough out of the box.

To fix this, you need to manually implement this by:

Passing the proper flags to the compiler to cause it to emit header dependency information. (Something like -MMD. I don’t know them exactly because that’s my whole point =)
Instructing the build to include that generate dependency info (Something like -include $(OBJECTS:.o=.d))

The generated dependency info looks like this:

pmap.c.obj: \
 kern/pmap.c \
 inc/x86.h \
 inc/types.h \
 inc/mmu.h \
 inc/error.h \
 inc/string.h \

Recompile on CFLAGS change

In addition to recompiling if headers change, you also want to recompile if any of your compiler, linker, or other dev tool flags change.

Make doesn’t provide this out of the box, you’ll also have to implement this yourself.

This is somewhat nontrivial. For an example, check out how the JOS OS (from MIT 6.828 (2018)) does it: https://github.com/offlinemark/jos/blob/1d95b3e576dd5f84b739fa3df773ae569fa2f018/kern/Makefrag#L48

First class build targets

In general, it’s nice to have build targets as a first class concept. They express source files compiled by the module, and include paths to reach headers. Targets can depend on each other and seamlessly access the headers of another target (the build system makes sure all -I flags are passed correctly).

This is also something you’d have to implement yourself, and there are probably limitations to how well your can do it in pure Make.

Make definitely has it’s place for certain tasks (easily execute commonly used command lines), but I find it hard to justify using it for anything non-trivially sized compared to more modern alternatives like CMake, Meson, Bazel, etc.

That said, large projects like Linux use it, so somehow they must make it work!

Mutexes, atomics, lockfree programming

Leave a reply

September 7, 2023: _Lab Notes 🧪, C, C++, Computer Architecture, Concurrency

Some rough lab notes on these topics to record the current state of my knowledge. I’m not an expert, so there may be inaccuracies.

Mutexes

On Linux, libpthread mutexes are implemented using the underlying futex syscall
They are basically a combination of a spinlock (in userspace), backed by the kernel for wait/signal operations only when absolutely necessary (i.e. when there’s contention). In the common case of an uncontended lock acquire, there is no context switch which improves performance
The userspace spinlock portion uses atomics as spinlocks usually do, specifically because the compare and set must be atomic
Jeff Preshing (see below) writes that each OS/platform has an analogous concept to this kind of “lightweight” mutex — Windows and macOS have them too
Before futex(2), other syscalls were used for blocking. One option might have been the semaphore API, but commit 56c910668cff9131a365b98e9a91e636aace337a in glibc is before futex, and it seems like they actually use signals. (pthread_mutex_lock -> __pthread_lock (still has spinlock elements, despite being before futex) -> suspend() -> __pthread_suspend -> __pthread_wait_for_restart_signal -> sigsuspend)
A primary advantage of futex over previous implementations is that futexes only require kernel resources when there’s contention
Like atomics, mutexes implementations include memory barriers (maybe even implicitly due to atomics) to prevent loads/stores from inappropriately crossing the lock/unlock boundary due to compiler and/or hardware instruction reordering optimizations

Continue reading →

How the hardware/software interface works

1 Reply

August 9, 2023: _Tech 💻, C, Computer Architecture, Linux Kernel

It’s really all about memory. But to start at the beginning, the rough stack looks like this:

Userspace application
Kernel driver
Hardware device

I find it easier to think about this from the middle out. On Linux, the kernel exposes hardware devices as files backed by the /dev virtual filesystem. Userspace can do normal syscalls like open, read, write, and mmap on them, as well as the less typical ioctl (for more arbitrary, device-specific functionality).¹.

The files are created by kernel drivers which are modules of kernel code whose sole purpose is to interface with and abstract hardware so it can be used by other parts of the operating system, or userspace. They are implemented implemented using internal driver “frameworks” in the kernel, e.g. the I2C or SPI frameworks. When you interface with a file in /dev, you are directly triggering callback handlers in a driver which execute in the process context.

That’s how userspace interfaces with the kernel. How do drivers interface with hardware? These days, mostly via memory mapped I/O (MMIO)². This is when device hardware “appears” at certain physical addresses, and can be interfaced with via load and store instructions using an “API” that the device defines. For example, you can read data from a sensor by simply reading a physical address, or write data out to a device by writing to an address. The technical term for the hardware component these reads/writes interface with is “registers” (i.e. memory mapped registers).

(Aside: Other than MMIO, the other main interface the kernel has with hardware is interrupts, for interrupt driven I/O processing (as opposed to polling, which is what MMIO enables). I’m not very knowledgeable about this, so I won’t get into it other than to say drivers can register handlers for specific IRQ (interrupt requests) numbers, which will be invoked by the kernel’s generic interrupt handling infrastructure.)

Using MMIOs looks a lot like embedded bare metal programming you might do on a microcontroller like a PIC or Arduino (AVR). At the lowest level, a kernel driver is really just embedded bare metal programming.

Here’s an example of a device driver for UART (serial port) hardware for ARM platforms: linux/drivers/tty/serial/amba-pl011.c. If you’re debugging an ARM Linux system via a serial connection, this is might be the driver being used to e.g. show the boot messages.

The lines like:

cr = readb(uap->port.membase + UART010_CR);

are where the real magic happens.

This is simply doing a read from a memory address derived from some base address for the device, plus some offset of the specific register in question. In this case it’s reading some control information from a Control Register.

#define UART010_CR		0x14	/* Control register. */

linux/include/linux/amba/serial.h#L28

Device interfaces may range from having just a few to many registers.

To go one step deeper down the rabbit hole, how do devices “end up” at certain physical addresses? How is this physical memory map interface implemented?³

The device/physical address mapping is implemented in digital logic outside the CPU, either on the System on Chip (SOC) (for embedded systems), or on the motherboard (PCs)⁴. The CPU’s physical interface include the address, data, and control buses. Digital logic converts bits of the address bus into signals that mutually exclusively enable devices that are physically connected to the bus. The implementations of load/store instructions in the CPU set a read/write bit appropriately in the Control bus, which lets devices know whether a read or write is happening. The data bus is where data is either transferred out from or into the CPU.

In practice, documentation for real implementations of these systems can be hard to find, unless you’re a customer of the SoC manufacturer. But there are some out there for older chips, e.g.

Here’s a block diagram for the Tegra 2 SoC architecture, which shipped in products like the Motorola Atrix 4G, Motorola Droid X2, and Motorola Photon. Obviously it’s much more complex than my description above. Other than the two CPU cores in the top left, and the data bus towards the middle, I can’t make sense of it. (link)

While not strictly a “System on Chip”, a classic PIC microcontroller has many shared characteristics of a SoC (CPU, memory, peripherals, all in one chip package), but is much more approachable.

We can see the single MIPS core connected to a variety of peripheral devices on the peripheral bus. There’s even layers of peripheral bussing, with a “Peripheral Bridge” connected to a second peripheral bus for things like I2C and SPI.

Linux Internals: How /proc/self/mem writes to unwritable memory

2 Replies

May 12, 2021: _Deep Dive 🔍, C, Favorite, Linux Kernel

Introduction

An obscure quirk of the /proc/*/mem pseudofile is its “punch through” semantics. Writes performed through this file will succeed even if the destination virtual memory is marked unwritable. In fact, this behavior is intentional and actively used by projects such as the Julia JIT compiler and rr debugger.

This behavior raises some questions: Is privileged code subject to virtual memory permissions? In general, to what degree can the hardware inhibit kernel memory access?

By exploring these questions⁵, this article will shed light on the nuanced relationship between an operating system and the hardware it runs on. We’ll examine the constraints the CPU can impose on the kernel, and how the kernel can bypass these constraints.

Continue reading →

The tradeoffs in using -Weverything

Leave a reply

April 13, 2021: _Tech 💻, C, C++

In addition to the well-known -Wall, clang offers another interesting warning flag: -Weverything. While it sounds like a “strictly better” option (the more warnings, the better?) it’s not.

This topic is already well covered by Arthur O’Dwyer’s blog post and the clang documentation.

Arthur makes a strong case against use of this flag. The clang docs generally agree, though offering a slightly more lenient position:

If you do use -Weverything then we advise that you address all new compiler diagnostics as they get added to Clang, either by fixing everything they find or explicitly disabling that diagnostic with its corresponding Wno- option.

But what I have not seen clearly expressed is how use of -Weverything is a tradeoff.

Continue reading →

Surgical formatting with git-clang-format

1 Reply

April 2, 2021: _Tech 💻, C, C++

If you’re already a 10x engineer, you probably won’t need this article. But for the rest of us, this is what I wish I knew about clang-format as an inexperienced C++ programmer: how to only format the changes in your pull request.

You may have already heard of clang-format. It auto-formats source files for languages including C and C++. You can aim it at a source file and format the entire thing using clang-format -i file.cpp.

If you’re contributing to a project that is already 100% clang-format clean, then this workflow works fine. But you’ll occasionally encounter a project that is not quite 100% formatted, such as LLVM, osquery, or Electron⁶.

For these projects, the “format entire files” workflow doesn’t work because you’ll incidentally format parts of the files that are unrelated to your contribution. This will add noise to your diff and make it harder for your reviewers.

In this case, you need a way to surgically format only the lines changed in your contribution. To do this, you can use the clang-format git extension. This article will cover the basics of git-clang-format, including a practical workflow that allows for messy development, and formatting at the end.

Continue reading →

How setjmp and longjmp work (2016)

2 Replies

February 9, 2016: _Deep Dive 🔍, Assembly, C, Favorite, Linux

Pretty recently I learned about setjmp() and longjmp(). They’re a neat pair of libc functions which allow you to save your program’s current execution context and resume it at an arbitrary point in the future (with some caveats⁷). If you’re wondering why this is particularly useful, to quote the manpage, one of their main use cases is “…for dealing with errors and interrupts encountered in a low-level subroutine of a program.” These functions can be used for more sophisticated error handling than simple error code return values.

I was curious how these functions worked, so I decided to take a look at musl libc’s implementation for x86. First, I’ll explain their interfaces and show an example usage program. Next, since this post isn’t aimed at the assembly wizard, I’ll cover some basics of x86 and Linux calling convention to provide some required background knowledge. Lastly, I’ll walk through the source, line by line.

Continue reading →