Category Archives: Computers

x86 kernel development lab notes

Here’s what I know about x86 kernel development. The usual caveat applies for my lab notes: this is not considered a high quality document and there may be inaccuracies.


Main Processor Modes

  • Real Mode (16 bit)
    • CPU boots into this mode for backward compatibility
    • The IDT is instead the IVT here (Interrupt Vector Table)
    • Legacy BIOS booting begins here — the BIOS loads the first sector of disk into memory at a fixed address and begins executing it in Real Mode.
  • Protected Mode (32 bit)
    • Segmentation is mandatory — a minimal GDT is necessary.
    • Paging is not mandatory
  • Long Mode (64 bit)
    • Paging is mandatory

Segmentation

  • Originally solved the problem of CPUs having more physical memory than could be addressed with 16 bit registers. (Note, this is the opposite situation of what we have today where virtual address spaces are vastly greater than physical ones)
  • Introduces the concept of “segments”, which are variable length “windows” into a larger address space.

Data structures

GDT (Global Descriptor Table)

  • Contains “descriptors” to describe the memory segments (“windows”) available. Segment register contain effectively an index into this table.
  • There are “normal” descriptors which describe memory segments and “system” descriptors which point to more exotic things, like Task State Segments (TSS) or Local Descriptor Tables (LDT)
  • These days OSs use the GDT as little as possible, only as much as strictly necessary. On 32 bit, this looks like 4 entries that start at base 0x0, and cover the entire 32 bit space. 2 for kernel, 2 for user — 1 for code, 1 for data for each. (?)
  • On 64 bit GDT is totally unused (I believe?), as are nearly all segment registers(?), except FS and GS. (Why are they special? There is even a special MSR for GS?)

LDT (Local Descriptor Table)

  • My understanding is LDTs are really no longer used by nearly any OS. Some parts of segmentation are still required by OSs, like the GDT, but LDT is not required and almost completely unused in modern OSs.
  • These would contain segments only accessible to a single task, unlike the regions in the GDT (?)

IDT (Interrupt Descriptor Table)

  • Interrupts: Generally externally triggered, i.e. from hardware devices
  • Exceptions: Internally generated, I.e. division by zero exception, or software breakpoint
  • When the processor receives an interrupt or exception, it handles that by executing code — interrupt handler routines.
  • These routines are registered via the IDT — an array of descriptors that describes how to handle a particular interrupt.
  • Interrupts/exceptions have numbers which directly map to entries in the IDT.
  • IDT descriptors are a polymorphic structure — there are several kinds of entities: interrupt, trap, and task gates (maybe others – call gates?).
  • Interrupt/trap gates are nearly identical and differ only in their handling of the interrupt flag. They contain a pointer to code to execute. This is expressed via a segment/offset.
  • Task gates make use of HW task switching and offer a more “turnkey” solution for running code in a separate context when an interrupt happens — but generally aren’t used for other reasons (?). Context switch is automatic?
  • Task gates in the IDT point to a TSS descriptor in the GDT, which points to a TSS (?)
  • Some interrupt/exceptions have an associated error code, some don’t.
  • Interrupt gates describe a minimal privilege level required to trigger then with an int instruction — this prevents userspace from triggering arbitrary interrupts with int. If userspace tries to trigger an int without permission, that is converted in to a General Protection fault (#GP)

Hardware task switching

  • Although long considered obsolete in favor of software task/context switching, x86 provides significant facilities for modeling OS “tasks” at the hardware level, and including functionality for automatic context switching between them.
  • Hardware task switching may require copying much more machine state than is necessary. Software context switches can be optimized to copy less and be faster, which is one reason why they’re preferred.
  • Hardware task switching puts a fixed limit on the number of tasks in the system (?)

TSS (Task State Segment)

  • This is a large structure modeling a “task” (thread of execution)
  • Contains space for registers & execution context
  • Even if HW task switching is not used, one TSS is still needed as the single HW Task running on the system, which internally implements all software context switching
  • TSS is minimally used for stack switches when handling interrupts across privilege levels — when switching from userspace to kernel during interrupt, kernel stack is taken from TSS
  • Linux task_struct is probably named with “task” due to being original created for i386
  • The Task Register (TR) contains a descriptor pointing to the current active HW task (?)

The JOS boot process

JOS is the OS used in MIT 6.828 (2018).

Bootloader

  • JOS includes a small BIOS bootloader in addition to the kernel
  • The bootloader begins with typical 16 bit Real Mode assembly to do the typical steps to initialize the CPU (Set A20 line, etc)
  • Transition to protected mode
  • Set the stack immediately at the start of the code, and transition to C
  • The kernel is loaded from disk using Port IO
  • Loaded into physical memory around the 1MB mark, which is generally considered a “safe” area to load into. (Below the 1MB mark has various regions where devices, BIOS data, or other “things” reside and it’s best to not clobber them.)
  • Call into the kernel entrypoint

Early kernel boot

  • Receive control from the bootloader in protected mode
  • Transition to paging
  • The kernel is linked to run in high memory, starting at 0xf000000 (KERNBASE)
  • The transition from segmentation to paging virtual memory happens in a few steps. There’s first an initial basic transition using set of minimal page tables.
  • After that transition is made, a basic memory allocator is set up, which is then used to allocate memory for the production page tables which implement the production virtual memory layout used for the rest of runtime.
  • The minimal page tables contain two mappings:
    • 1 – Identity map the first 4MB to itself
    • 2 – Map the 4MB region starting at KERNBASE also to the first 4MB
  • One page directory entry maps a 4MB region, so only two page directory entries are needed
  • These page tables are constructed statically at compile time
  • The first identity mapping is critical because without it the kernel would crash immediate after loading CR3, because the next instruction would be unmapped. The identity mapping allows the low mem addresses the kernel resides in to remain valid until the kernel can jump to high mem
  • The assembly there looks a bit strange because the jump appears redundant. But all the asm labels are linked using highmem addresses, so jumping to them transitions from executing in low mem, to executing in high mem, where the kernel will remain executing for the rest of its lifetime.
  • Set the stack pointer to a global data/BSS section of internal storage within the kernel and enter C code

Memory allocators

  • The goal is transition to a production virtual memory setup
  • This requires allocating memory for page tables
  • To build the dynamic page/frame allocator, we start with a basic bump allocator
  • It starts allocating simply from the end of the kernel in memory. We have access to a symbol for the end of the kernel via a linker script.
  • Kernel queries the physical memory size of the system and dynamically allocates data structures for the dynamic page/frame allocator. This is an array of structure that correspond to each available frame of physical memory. These structures have an embedded linked list pointer (intrusive linked list) and a refcount. They are linked together into a linked list, to implement a stack data structure where frames can be popped (when allocating) and pushed (when freeing).
  • Using this frame allocator, pages for the production page tables are allocated.

Memory is an abstraction

I was learning about the different types of RAM recently (e.g. SRAM and DRAM) and it occurred to me that “computer memory” is just an abstraction. This is obvious once you think about it but I think as a programmer, it’s very easy to not realize this. The idea of memory as a linear array of bits is an abstraction created and implemented by an electrical device.

Most programmers when they think of memory are thinking of virtual memory, which is a completely different abstraction. While it’s also a linear array of bits, the abstraction is created by the operating system and lives at a higher level.

One level below, the abstraction the operating system itself uses — “physical memory” — is the one I’m talking about, created by a a set of electrical devices connected to the CPU with wires (the memory bus).

I’m projecting without any basis, but I presume the reason so few programmers think of memory as an abstraction is because the abstraction is so strong. Nearly all of the time, it “just works” — you write bits and read them out later.1 The abstraction can leak slightly during programming disciplines that require awareness of low level details like the memory hierarchy and cache coherence (i.e. lockfree) but this is a leak of the abstraction of the memory hierarchy. The core abstraction of physical memory stays in tact — for example, programmers never need to be aware of the internal refresh mechanism of DRAM.2

Of course, you can go infinitely far with this — it’s turtles all the way down.

Syscall ABI compatibility: Linux vs Windows/macOS

The Linux kernel has an interesting difference compared to the Windows and macOS kernels: it offers syscall ABI compatibility.

This means that applications that program directly against the raw syscall interface are more or less guaranteed to always keep working, even with arbitrarily newer kernel versions. “Programming against the raw syscall interface” means including assembly code in your app that triggers syscalls:

  • setting the appropriate syscall number in the syscall register
  • setting arguments in the defined argument registers
  • executing a syscall instruction
  • reading the syscall return value register

Here are the ABIs for some common architectures.

Syscall Number RegisterSyscall ArgumentsSyscall Return Value
x86EAXEBX, ECX, EDX, ESI, EDI, EBPEAX
x86_64RAXRDI, RSI, RDX, R10, R8, R9RAX
Armv7R7R0-R6R0
AArch64X8X0-X5X0
Manticore is my go-to source to quickly look these up: https://github.com/trailofbits/manticore

Once you’ve done this, now you’re relying on the kernel to not change any part of this. If the kernel changes any of these registers, or changes the syscall number mapping, your app will not longer trigger the desired syscall correctly and will break.

Aside from writing raw assembly in your app, there’s a more innocuous way of accidentally “programming directly against the syscall interface”: statically linking to libc. When you statically link to a library, that library’s code is directly included in your binary. libc is generally the system component responsible for implementing the assembly to trigger syscalls, and by statically linking to it, you effectively inline those assembly instructions directly into your application.

So why does Linux offer this and Windows and macOS don’t?

In general, compatibility is cumbersome. As a developer, if you can avoid having to maintain compatibility, it’s better. You have more freedom to change, improve, and refactor in the future. So by default it’s preferable to not maintain compatibility — including for kernel development.

Windows and macOS are able to not offer compatibility because they control the libc for their platforms and the rules for using it. And one of their rules is “you are not allowed to statically link libc”. For the exact reason that this would encourage apps that depend directly on the syscall ABI, hindering the kernel developers’ ability to freely change the kernel’s implementation.

If all app developers are forced to dynamically link against libc, then as long as kernel developers also update libc with the corresponding changes to the syscall ABI, everything works. Old apps run on a new kernel will dynamically link against the new libc, which properly implements the new ABI. Compatibility is of course still maintained at the app/libc level — just not at the libc/kernel level.

Linux doesn’t control the libc in the same way Windows and macOS do because in the Linux world, there is a distinct separation between kernel and userspace that isn’t present in commercial operating systems. This is rooted in the history of Linux, which was originally designed to target a userspace developed by a separate organization (GNU).

So strictly speaking Linux is just the kernel, and you’re free to run whatever userspace on top. Most people run GNU userspace components (glibc), but alternatives are not unheard of (musl libc, also bionic libc on Android).

So because Linux kernel developers can’t 100% control the libc that resides on the other end of the syscall interface, they bite the bullet and retain ABI compatibility. This technically allows you to statically link with more confidence than on other OSs. That said, there are other reasons why you shouldn’t statically link libc, even on Linux.


Links:

https://news.ycombinator.com/item?id=21908824
https://www.kernel.org/doc/Documentation/ABI/README

This directory documents the interfaces that the developer has
defined to be stable.  Userspace programs are free to use these
interfaces with no restrictions, and backward compatibility for
them will be guaranteed for at least 2 years.  Most interfaces
(like syscalls) are expected to never change and always be
available.

kernel docs

What:		The kernel syscall interface
Description:
	This interface matches much of the POSIX interface and is based
	on it and other Unix based interfaces.  It will only be added to
	over time, and not have things removed from it.

	Note that this interface is different for every architecture
	that Linux supports.  Please see the architecture-specific
	documentation for details on the syscall numbers that are to be
	mapped to each syscall.

apple developer docs

Q:  I'm trying to link my binary statically, but it's failing to link because it can't find crt0.o. Why?
A: Before discussing this issue, it's important to be clear about terminology:

A static library is a library of code that can be linked into a binary that will, eventually, be dynamically linked to the system libraries and frameworks.
A statically linked binary is one that does not import system libraries and frameworks dynamically, but instead makes direct system calls into the kernel.
Apple fully supports static libraries; if you want to create one, just start with the appropriate Xcode project or target template.

Apple does not support statically linked binaries on Mac OS X. A statically linked binary assumes binary compatibility at the kernel system call interface, and we do not make any guarantees on that front. Rather, we strive to ensure binary compatibility in each dynamically linked system library and framework.

If your project absolutely must create a statically linked binary, you can get the Csu (C startup) module from Darwin and try building crt0.o for yourself. Obviously, we won't support such an endeavor.

stackoverflow

  • Solaris also stopped supporting static linking against libc.