An obscure quirk of the /proc/*/mem pseudofile is its βpunch throughβ semantics. Writes performed through this file will succeed even if the destination virtual memory is marked unwritable. In fact, this behavior is intentional and actively used by projects such as the Julia JIT compiler and rr debugger.
This behavior raises some questions: Is privileged code subject to virtual memory permissions? In general, to what degree can the hardware inhibit kernel memory access?
By exploring these questions1, this article will shed light on the nuanced relationship between an operating system and the hardware it runs on. We’ll examine the constraints the CPU can impose on the kernel, and how the kernel can bypass these constraints.
Most people thought I was crazy for doing this, but I spent the last few months of my gap year working as a short order cook at a family-owned fast-food restaurant. (More on this here.) I’m a programmer by trade, so I enjoyed thinking about the restaurant’s systems from a programmer’s point of view. Here’s some thoughts about two such systems.
This post details my adventures with the Linux virtual memory subsystem, and my discovery of a creative way to taunt the OOM (out of memory) killer by accumulating memory in the kernel, rather than in userspace.
Keep reading and you’ll learn:
Internal details of the Linux kernel’s demand paging implementation
How to exploit virtual memory to implement highly efficient sparse data structures
What page tables are and how to calculate the memory overhead incurred by them
A cute way to get killed by the OOM killer while appearing to consume very little memory (great for parties)
Note: Victor Michel wrote a great follow up to this post here.
Pretty recently I learned about setjmp() and longjmp(). Theyβre a neat pair of libc functions which allow you to save your programβs current execution context and resume it at an arbitrary point in the future (with some caveats1). If youβre wondering why this is particularly useful, to quote the manpage, one of their main use cases is ββ¦for dealing with errors and interrupts encountered in a low-level subroutine of a program.β These functions can be used for more sophisticated error handling than simple error code return values.
I was curious how these functions worked, so I decided to take a look at musl libcβs implementation for x86. First, Iβll explain their interfaces and show an example usage program. Next, since this post isnβt aimed at the assembly wizard, Iβll cover some basics of x86 and Linux calling convention to provide some required background knowledge. Lastly, Iβll walk through the source, line by line.
Contributing to open source is a popular recommendation for junior developers, but what do you actually do?
Fixing bugs is a natural first step, and people might say to look a the bug tracker and find a simple bug to fix. However, my advice would be to find your own bugs.
In 2019, I had some free time and really wanted to contribute to the LLVM project in some way. Working on the actual compiler seemed scary, but LLDB, the debugger, seemed more approachable.
I went to the LLVM Dev Meeting, met some LLDB devs, and got super excited to contribute. I went home, found a random bug on the bug trackers, took a look for all of 30 minutes, then … gave up. Fixing some one else’s random string formatting bug simply wasn’t interesting enough to motivate me to contribute.
3 months later I was doing some C++ dev for fun. I was debugging my code and ran into a really, really strange crash in the debugger. It was so strange that I looked into it further and it turned out to be a bug in LLDB’s handling of the “return” command for returning back to the caller of the current function. The command didn’t correctly handle returning from assembly stubs that don’t follow the standard stack layout/ABI, and caused memory corruption in the debugged process which eventually led to a crash.
This was totally different. I had found a super juicy bug and dedicated a couple weeks to doing a root cause analysis and working with the LLDB devs to create a patch, which was accepted.
So if you want to contribute to open source, I would agree with the common advice to fix some bug, but would recommend finding your own β it will be way more rewarding, fulfilling, and a better story to tell.
What if I told you you didn’t have to use just one git client? I use 5, and here’s why:
Command line – Sometimes it’s the simplest fastest way to do something.
Lazygit – Ultra-fast workflow for many git tasks, especially rebasing, reordering, rewriting commits. Quickly doing fixup commits and amending into arbitrary commits feels magical. Custom patches are even more magical.
Fork (Mac app) – Great branch GUI view. Nice drag and drop staging workflow.
Sublime Merge – Good for code review, can easily switch between the diff and commit message just by scrolling, no clicks.
Gitk – Great blame navigator.
One you try one of these GUIs, you’ll never go back to git add -p.
If you’ve ever had a painful move due to having too much stuff, you might have had the urge to become a minimalist to avoid an unpleasant experience like that again.
There’s a lot of good things about minimalism and the philosophy of needing less. In addition to being easier to move, it’s better for the environment, and less costly to have & maintain less things.
But watch out β it’s easy to go too far in the other direction and let the minimalism take on a toxic quality, where you don’t even acquire things that you really would find helpful, and would improve the quality of your life.
If you’re in that position, I’d just remind you that it’s ok to acquire a bunch of stuff, learn what is really valuable to you, then trim things down later. Sometimes to go narrow, you first need to go wide.
When I started self-studying kernel development via MIT 6.828 (2018)’s open source materials (JOS OS), I thought I was making my life easier by not starting from scratch. Doing this allowed me to get going very quickly with a base skeleton for an OS, as well as a fully functioning build system and helper Makefile command for debugging with qemu.
That was great, but I’ve realized that there are also many ways I’m doing this on hard mode:
Doing it in only 2-3 hours a week
This is not really enough time to develop an OS, and is particularly hard for debugging, where it can be helpful to have significant context built up for longer sessions.
Live-streaming almost all of it
This can be very distracting and make me go at a slower pace than usual, since I try to engage with viewers and answer questons. On the other hand, explaining things helps solidify my understanding.
Working with a 6 year old code-base, but using a newer toolchain β which means fighting bitrot
There have been multiple cases where the codebase actually got in my way and produced very hard to debug bugs. Also, when I transition labs, it introduces a bunch of foreign code that I don’t understand. It can be difficult to tell if I truly have something broken, or if the new code is in an intermediate state that is meant to yield issues like crashes or assertion failures.
In the past year I’ve effectively reinvented my public identity as a live-streamer. That wasn’t the goal initially, but it’s been one of the most fun journeys I’ve been on in a long time, and I’m glad I did it.
For many of the people discovering me now, that’s what they know me as, but what they don’t know is the 10+ years of public presence I had pre-streaming. Since 2012 or so, I’ve been on Twitter and blogging (to a lesser extent) as part of the tech & infosec scenes, sharing random projects I was working on, or things I learned about.
In 2019, I revamped my blog and wrote a few viral blog posts about Linux kernel internals. This was the start of reinventing myself as a blogger. Around that time I started posting a lot more on Twitter also.
And now in 2024, I’ve started streaming and funnily enough, that has had more traction for me than any other project I’ve had before. So I guess that makes me a streamer now β until the next self reinvention!
So, be careful of getting stuck in self identities that you’ve historically created, but don’t have intentional reasons to maintain. Don’t be afraid to try new things β even if they potentially reshape your entire identity.
Originally I used a YYYY/MM/DD/<slug> url scheme for my blog, which felt nice since it creates namespacing and one can also get some date context about a blog post simply from the URL.
However, I eventually removed all date context from the URLs entirely. Namespacing isn’t a real benefit in practice (name collisions are rare) and neither is date context. I also found it annoying that I couldn’t type post URLS from memory, which is occasionally useful. Plus shorter URLs is also often a plus.
To migrate to this new URL scheme without breaking links, I used the “Redirection” WordPress plugin. Yet another reason why I like WordPress.
A common artist pitfall is getting too stuck on a particular piece, which builds high expectations for it when it’s eventually released. It can be painful if that piece isn’t appreciated like you hoped it would be.
A remedy is to zoom out and maintain a global perspective over all the art you’ll release in your life. In the grand scheme, this one piece is hopefully a drop in the bucket of all the many other pieces you’ll make, some of which (hopefully) will be many times better than the one you’re stuck on. Staying stuck on one prevents you from moving forward to creating those amazing future works.
This advice doesn’t apply to every artist, but I think it does for many: Release it and move on to the next.
Here’s what I know about x86 kernel development. The usual caveat applies for my lab notes: this is not considered a high quality document and there may be inaccuracies.
Main Processor Modes
Real Mode (16 bit)
CPU boots into this mode for backward compatibility
The IDT is instead the IVT here (Interrupt Vector Table)
Legacy BIOS booting begins here β the BIOS loads the first sector of disk into memory at a fixed address and begins executing it in Real Mode.
Protected Mode (32 bit)
Segmentation is mandatory β a minimal GDT is necessary.
Paging is not mandatory
Long Mode (64 bit)
Paging is mandatory
Segmentation
Originally solved the problem of CPUs having more physical memory than could be addressed with 16 bit registers. (Note, this is the opposite situation of what we have today where virtual address spaces are vastly greater than physical ones)
Introduces the concept of “segments”, which are variable length “windows” into a larger address space.
Data structures
GDT (Global Descriptor Table)
Contains “descriptors” to describe the memory segments (“windows”) available. Segment register contain effectively an index into this table.
There are “normal” descriptors which describe memory segments and “system” descriptors which point to more exotic things, like Task State Segments (TSS) or Local Descriptor Tables (LDT)
These days OSs use the GDT as little as possible, only as much as strictly necessary. On 32 bit, this looks like 4 entries that start at base 0x0, and cover the entire 32 bit space. 2 for kernel, 2 for user β 1 for code, 1 for data for each. (?)
On 64 bit GDT is totally unused (I believe?), as are nearly all segment registers(?), except FS and GS. (Why are they special? There is even a special MSR for GS?)
LDT (Local Descriptor Table)
My understanding is LDTs are really no longer used by nearly any OS. Some parts of segmentation are still required by OSs, like the GDT, but LDT is not required and almost completely unused in modern OSs.
These would contain segments only accessible to a single task, unlike the regions in the GDT (?)
IDT (Interrupt Descriptor Table)
Interrupts: Generally externally triggered, i.e. from hardware devices
Exceptions: Internally generated, I.e. division by zero exception, or software breakpoint
When the processor receives an interrupt or exception, it handles that by executing code β interrupt handler routines.
These routines are registered via the IDT β an array of descriptors that describes how to handle a particular interrupt.
Interrupts/exceptions have numbers which directly map to entries in the IDT.
IDT descriptors are a polymorphic structure β there are several kinds of entities: interrupt, trap, and task gates (maybe others – call gates?).
Interrupt/trap gates are nearly identical and differ only in their handling of the interrupt flag. They contain a pointer to code to execute. This is expressed via a segment/offset.
Task gates make use of HW task switching and offer a more “turnkey” solution for running code in a separate context when an interrupt happens β but generally aren’t used for other reasons (?). Context switch is automatic?
Task gates in the IDT point to a TSS descriptor in the GDT, which points to a TSS (?)
Some interrupt/exceptions have an associated error code, some don’t.
Interrupt gates describe a minimal privilege level required to trigger then with an int instruction β this prevents userspace from triggering arbitrary interrupts with int. If userspace tries to trigger an int without permission, that is converted in to a General Protection fault (#GP)
Hardware task switching
Although long considered obsolete in favor of software task/context switching, x86 provides significant facilities for modeling OS “tasks” at the hardware level, and including functionality for automatic context switching between them.
Hardware task switching may require copying much more machine state than is necessary. Software context switches can be optimized to copy less and be faster, which is one reason why they’re preferred.
Hardware task switching puts a fixed limit on the number of tasks in the system (?)
TSS (Task State Segment)
This is a large structure modeling a “task” (thread of execution)
Contains space for registers & execution context
Even if HW task switching is not used, one TSS is still needed as the single HW Task running on the system, which internally implements all software context switching
TSS is minimally used for stack switches when handling interrupts across privilege levels β when switching from userspace to kernel during interrupt, kernel stack is taken from TSS
Linux task_struct is probably named with “task” due to being original created for i386
The Task Register (TR) contains a descriptor pointing to the current active HW task (?)
The JOS boot process
JOS is the OS used in MIT 6.828 (2018).
Bootloader
JOS includes a small BIOS bootloader in addition to the kernel
The bootloader begins with typical 16 bit Real Mode assembly to do the typical steps to initialize the CPU (Set A20 line, etc)
Transition to protected mode
Set the stack immediately at the start of the code, and transition to C
The kernel is loaded from disk using Port IO
Loaded into physical memory around the 1MB mark, which is generally considered a “safe” area to load into. (Below the 1MB mark has various regions where devices, BIOS data, or other “things” reside and it’s best to not clobber them.)
Call into the kernel entrypoint
Early kernel boot
Receive control from the bootloader in protected mode
Transition to paging
The kernel is linked to run in high memory, starting at 0xf000000 (KERNBASE)
The transition from segmentation to paging virtual memory happens in a few steps. There’s first an initial basic transition using set of minimal page tables.
After that transition is made, a basic memory allocator is set up, which is then used to allocate memory for the production page tables which implement the production virtual memory layout used for the rest of runtime.
The minimal page tables contain two mappings:
1 – Identity map the first 4MB to itself
2 – Map the 4MB region starting at KERNBASE also to the first 4MB
One page directory entry maps a 4MB region, so only two page directory entries are needed
The first identity mapping is critical because without it the kernel would crash immediate after loading CR3, because the next instruction would be unmapped. The identity mapping allows the low mem addresses the kernel resides in to remain valid until the kernel can jump to high mem
The assembly there looks a bit strange because the jump appears redundant. But all the asm labels are linked using highmem addresses, so jumping to them transitions from executing in low mem, to executing in high mem, where the kernel will remain executing for the rest of its lifetime.
Set the stack pointer to a global data/BSS section of internal storage within the kernel and enter C code
Memory allocators
The goal is transition to a production virtual memory setup
This requires allocating memory for page tables
To build the dynamic page/frame allocator, we start with a basic bump allocator
It starts allocating simply from the end of the kernel in memory. We have access to a symbol for the end of the kernel via a linker script.
Kernel queries the physical memory size of the system and dynamically allocates data structures for the dynamic page/frame allocator. This is an array of structure that correspond to each available frame of physical memory. These structures have an embedded linked list pointer (intrusive linked list) and a refcount. They are linked together into a linked list, to implement a stack data structure where frames can be popped (when allocating) and pushed (when freeing).
Using this frame allocator, pages for the production page tables are allocated.