Category Archives: Operating Systems

osdev journal: bootloaders and booting (grub, multiboot, limine, BIOS, EFI)

Here’s my rough lab notes from what I learned during weeks 69-73 of streaming where I did my “boot tour” and ported JOS to a variety of boot methods.


JOS originally used a custom i386 BIOS bootloader. This is a classic approach for osdev: up to 512 bytes of hand written 16 bit real mode assembly packed into the first sector of a disk.

I wanted to move away from this though โ€” I had the sense that using a real third party bootloader was the more professional way to go about this.

Grub

First I ported to Grub, which is a widely used, standard bootloader on Linux systems.

This requires integrating the OS with the Multiboot standard. Grub is actually designed to simply be a reference implementation of a generic boot protocol, called Multiboot. The goal is to allow different implementations of bootloaders and operating systems to all transparently interoperate with each other, as opposed to the specific bootloaders made for each OS which was common at the time of its development.

(Turns out Multiboot never really took off. Linux and BSDs already had bootloaders and boot protocols and never ported to use Multiboot. Grub supports them via implementing their specific boot protocols in addition to Multiboot. I’m not sure any mainstream OS is natively using Multiboot. Probably mostly hobby os projects.)

This integration looks like:

  • Adding a Multiboot header
  • Optionally making use of an info struct pointer in EBX

The Multiboot header is interesting. Multiboot was designed to be binary format agnostic. While there is native ELF support, OS’s need to advertise that they are Multiboot compatible by including magic bytes in the first few KB of their binary, along with some other metadata (e.g. about architecture). The multiboot conforming boot loader will scan for this header. Exotic binary formats can add basic metadata about what load address they need and have a basic form of loading be done (probably just memcpying the entire OS binary into place. The OS might need to load itself further from there if it has non-contiguous segments.)

Then, for Multiboot v1, the OS receives a pointer to an info struct in EBX. This contains useful information provided from the bootloader (cli args, memory maps, etc), which is the second major reason to use a third party bootloader.

There are two versions of the Multiboot standard. V1 is largely considered obsolete and deprecated because this method of passing a struct wasn’t extensible in a backward compatible way. An OS coded against a newer version of the struct (which might have grown) would possibly crash if loaded against an older bootloader that only provided a smaller struct (because it might dereference struct offsets that go out of bounds of the given struct).

So the Multiboot V2 standard was developed to fix this. Instead of passing a struct, it uses a TLV format where the OS receives an array of tagged values, and can interpret only those whose tags it’s aware of.

The build process is a bit nicer for Grub also compared with a custom bootloader. Instead of creating a “disk image” by concatenating a 512 byte assembly block, and my kernel, with Grub you can use an actual filesystem.

You simply create a directory with a specific directory structure, then you can use grub-mkrescue to convert that into an .iso file with some type of CD-ROM based filesystem format. (Internally it uses xorriso). You can then pass the .iso to QEMU with -cdrom instead of -drive as I was doing previously.

Limine

Limine is a newer, modern bootloader aimed at hobby OS developers. I tried it out because it’s very popular, which I now think is well deserved. In addition to implementing essentially every boot protocol, it includes its own native boot protocol with some advanced features like automatic SMP setup, which is otherwise fairly involved.

It uses a similar build process to grub-mkrescue with creating a special directory structure and running xorriso to produce an iso.

I integrated against Limine, but kept my OS as Multiboot2 since Limine’s native protocol only supported 64 bit.

BIOS vs UEFI

Everything I’ve mentioned so far has been in the context of legacy BIOS booting.

Even though I ported away from a custom bootloader to these fancy third party ones, I’m still using them in BIOS mode. I don’t know exactly what’s in these .iso files, but that means they must populate the first 512 bytes of the media with their own version of the 16 bit real mode assembly, and bootstrap from there.

But BIOS is basically obsolete โ€” the modern way to boot a PC is UEFI.

The nice thing about integrating against a mature third party bootloader, is it abstracts the low level boot interface for you. So all you need to do is target Grub or Limine, and then you can (nearly) seamlessly boot from either BIOS or UEFI.

It was fairly easy to get this working with Limine, because Limine provides prebuilt UEFI binaries (BOOTIA32.EFI) and has good documentation.

The one tricky thing is that QEMU doesn’t come with UEFI firmware by default, unlike with BIOS (where SeaBIOS is included). So you need to get a copy of OVMF to pass to QEMU to do UEFI boot. (Conveniently there are pre-built OVMF binaries available by the Limine author).

I failed at getting UEFI booting with Grub to work on my macOS based dev setup, because I couldn’t easily find a prebuilt Grub BOOTIA32.EFI. There is a package on apt, but I didn’t have a Linux machine quickly available to investigate if I could extract the file out of that.

Even though UEFI is the more modern standard, I’m proceeding with just using BIOS simply to avoid dealing with the whole OVMF thing.

Comparison table

ProsCons
Custom– No external dependency– More finicky custom code to support, more surface area for bugs
– Doable to get basics working but nontrivial effort required to reimplement more advanced features in Grub/Limine (a boot protocol, cli args, memory map, etc)
– No UEFI support
Grub– Well tested, industrial strength
– Available prebuilt from Homebrew
– Simple build process, just use single i386-grub-mkrescue to create iso
– Difficult to get working in UEFI mode on Mac (difficult to find a prebuilt BOOTIA32.EFI)
Limine– Good documentation
– Easy to get working for both BIOS and UEFI
– Supports Multiboot/Multiboot2. Near drop in replacement for grub
– Can opt into custom boot protocol with advanced features (SMP bringup)
– Not used industrially, mostly for hobby osdev
– Not packaged in Homebrew, requires building driver tool from source (but this is trivial)

osdev journal: Gotchas with cc-runtime/libgcc

libclang_rt/libgcc are compiler runtime support library implementations, which the compiler occasionally emits calls into instead of directly inlining the codegen. Usually this is for software implementations of math operations (division/mod). Generally you’ll need a runtime support library for all but the most trivial projects.

cc-runtime is a utility library for hobby OS developers. It is a standalone version of libclang_rt, which can be included vendored into an OS build.

The main advantage for me is that it lets me use a prebuilt clang from Homebrew. The problem with prebuilt clang from Homebrew, is it doesn’t come with a libclang_rt compiled for i386 (which makes sense, why would it โ€” I’m on an ARM64 Mac).

(This is unlike the prebuilt i386-elf-gcc in Homebrew, which does come with a prebuilt libgcc for i386).

Since it doesn’t come with libclang_rt for i386, my options are:

Option
Keep using libgcc from i386-elf-gcc in HomebrewUndesirable โ€” the goal is to only depend on one toolchain, and here I’d depend on both clang and gcc.
Build clang and libclang_rt from sourceUndesirable โ€” it’s convenient to avoid building the toolchain from source if possible.
Vendor in a libgcc.a binary from https://codeberg.org/osdev/libgcc-binariesUndesirable โ€” vendoring binaries should be a last resort
Use cc-runtimeBest โ€” No vendored binaries, no gcc dependency, no building toolchain from source

However, cc-runtime does have a gotcha. If you’re not careful, you’ll balloon your binary size.

This is because the packed branch of cc-runtime (which is the default and easiest to integrate) packs all the libclang_rt source files into a single C file, which produces a single .o file. So the final .a library has a single object file in it.

This is in contrast to libgcc.a (or a typical compilation of libclang_rt) where the .a library probably contains multiple .o files โ€” one for each .c file.

By default, linkers will optimize and only use any .o files in the .a library that are needed. But since cc-runtime is a single .o file, the whole thing will get included! This means, the binary will potentially include many libclang_rt functions that are unused.

In my case, the size of one of my binaries went from 36k (libgcc) to 56k (cc-runtime, naive).

To work around this, you either need to use the trunk branch of cc-runtime (which doesn’t pack them all into one .c file). This is ~30 .c files and slightly more annoying to integrate into the build system.

Or, you can use some compiler/linker flags to make the linker optimization more granular and work at the function level, instead of the object file level.

Those are:

Compiler flag:

-ffunction-sections -fdata-sections

Linker flag

--gc-sections

With this, my binary size reduced to 47k. So there is still a nontrivial size increase, but the situation is slightly improved.

Ultimately, my preferred solution is the first: to use the trunk branch. The build integration is really not that bad, and the advantage is you don’t need to remember to use the special linker flag, which you’d otherwise need to ensure is in any link command for any binary that links against cc-runtime.

That said, those compiler/linker flags are probably a good idea to use anyway, so the best solution might be to do both.

x86 kernel development lab notes

Here’s what I know about x86 kernel development. The usual caveat applies for my lab notes: this is not considered a high quality document and there may be inaccuracies.


Main Processor Modes

  • Real Mode (16 bit)
    • CPU boots into this mode for backward compatibility
    • The IDT is instead the IVT here (Interrupt Vector Table)
    • Legacy BIOS booting begins here โ€” the BIOS loads the first sector of disk into memory at a fixed address and begins executing it in Real Mode.
  • Protected Mode (32 bit)
    • Segmentation is mandatory โ€” a minimal GDT is necessary.
    • Paging is not mandatory
  • Long Mode (64 bit)
    • Paging is mandatory

Segmentation

  • Originally solved the problem of CPUs having more physical memory than could be addressed with 16 bit registers. (Note, this is the opposite situation of what we have today where virtual address spaces are vastly greater than physical ones)
  • Introduces the concept of “segments”, which are variable length “windows” into a larger address space.

Data structures

GDT (Global Descriptor Table)

  • Contains “descriptors” to describe the memory segments (“windows”) available. Segment register contain effectively an index into this table.
  • There are “normal” descriptors which describe memory segments and “system” descriptors which point to more exotic things, like Task State Segments (TSS) or Local Descriptor Tables (LDT)
  • These days OSs use the GDT as little as possible, only as much as strictly necessary. On 32 bit, this looks like 4 entries that start at base 0x0, and cover the entire 32 bit space. 2 for kernel, 2 for user โ€” 1 for code, 1 for data for each. (?)
  • On 64 bit GDT is totally unused (I believe?), as are nearly all segment registers(?), except FS and GS. (Why are they special? There is even a special MSR for GS?)

LDT (Local Descriptor Table)

  • My understanding is LDTs are really no longer used by nearly any OS. Some parts of segmentation are still required by OSs, like the GDT, but LDT is not required and almost completely unused in modern OSs.
  • These would contain segments only accessible to a single task, unlike the regions in the GDT (?)

IDT (Interrupt Descriptor Table)

  • Interrupts: Generally externally triggered, i.e. from hardware devices
  • Exceptions: Internally generated, I.e. division by zero exception, or software breakpoint
  • When the processor receives an interrupt or exception, it handles that by executing code โ€” interrupt handler routines.
  • These routines are registered via the IDT โ€” an array of descriptors that describes how to handle a particular interrupt.
  • Interrupts/exceptions have numbers which directly map to entries in the IDT.
  • IDT descriptors are a polymorphic structure โ€” there are several kinds of entities: interrupt, trap, and task gates (maybe others – call gates?).
  • Interrupt/trap gates are nearly identical and differ only in their handling of the interrupt flag. They contain a pointer to code to execute. This is expressed via a segment/offset.
  • Task gates make use of HW task switching and offer a more “turnkey” solution for running code in a separate context when an interrupt happens โ€” but generally aren’t used for other reasons (?). Context switch is automatic?
  • Task gates in the IDT point to a TSS descriptor in the GDT, which points to a TSS (?)
  • Some interrupt/exceptions have an associated error code, some don’t.
  • Interrupt gates describe a minimal privilege level required to trigger then with an int instruction โ€” this prevents userspace from triggering arbitrary interrupts with int. If userspace tries to trigger an int without permission, that is converted in to a General Protection fault (#GP)

Hardware task switching

  • Although long considered obsolete in favor of software task/context switching, x86 provides significant facilities for modeling OS “tasks” at the hardware level, and including functionality for automatic context switching between them.
  • Hardware task switching may require copying much more machine state than is necessary. Software context switches can be optimized to copy less and be faster, which is one reason why they’re preferred.
  • Hardware task switching puts a fixed limit on the number of tasks in the system (?)

TSS (Task State Segment)

  • This is a large structure modeling a “task” (thread of execution)
  • Contains space for registers & execution context
  • Even if HW task switching is not used, one TSS is still needed as the single HW Task running on the system, which internally implements all software context switching
  • TSS is minimally used for stack switches when handling interrupts across privilege levels โ€” when switching from userspace to kernel during interrupt, kernel stack is taken from TSS
  • Linux task_struct is probably named with “task” due to being original created for i386
  • The Task Register (TR) contains a descriptor pointing to the current active HW task (?)

The JOS boot process

JOS is the OS used in MIT 6.828 (2018).

Bootloader

  • JOS includes a small BIOS bootloader in addition to the kernel
  • The bootloader begins with typical 16 bit Real Mode assembly to do the typical steps to initialize the CPU (Set A20 line, etc)
  • Transition to protected mode
  • Set the stack immediately at the start of the code, and transition to C
  • The kernel is loaded from disk using Port IO
  • Loaded into physical memory around the 1MB mark, which is generally considered a “safe” area to load into. (Below the 1MB mark has various regions where devices, BIOS data, or other “things” reside and it’s best to not clobber them.)
  • Call into the kernel entrypoint

Early kernel boot

  • Receive control from the bootloader in protected mode
  • Transition to paging
  • The kernel is linked to run in high memory, starting at 0xf000000 (KERNBASE)
  • The transition from segmentation to paging virtual memory happens in a few steps. There’s first an initial basic transition using set of minimal page tables.
  • After that transition is made, a basic memory allocator is set up, which is then used to allocate memory for the production page tables which implement the production virtual memory layout used for the rest of runtime.
  • The minimal page tables contain two mappings:
    • 1 – Identity map the first 4MB to itself
    • 2 – Map the 4MB region starting at KERNBASE also to the first 4MB
  • One page directory entry maps a 4MB region, so only two page directory entries are needed
  • These page tables are constructed statically at compile time
  • The first identity mapping is critical because without it the kernel would crash immediate after loading CR3, because the next instruction would be unmapped. The identity mapping allows the low mem addresses the kernel resides in to remain valid until the kernel can jump to high mem
  • The assembly there looks a bit strange because the jump appears redundant. But all the asm labels are linked using highmem addresses, so jumping to them transitions from executing in low mem, to executing in high mem, where the kernel will remain executing for the rest of its lifetime.
  • Set the stack pointer to a global data/BSS section of internal storage within the kernel and enter C code

Memory allocators

  • The goal is transition to a production virtual memory setup
  • This requires allocating memory for page tables
  • To build the dynamic page/frame allocator, we start with a basic bump allocator
  • It starts allocating simply from the end of the kernel in memory. We have access to a symbol for the end of the kernel via a linker script.
  • Kernel queries the physical memory size of the system and dynamically allocates data structures for the dynamic page/frame allocator. This is an array of structure that correspond to each available frame of physical memory. These structures have an embedded linked list pointer (intrusive linked list) and a refcount. They are linked together into a linked list, to implement a stack data structure where frames can be popped (when allocating) and pushed (when freeing).
  • Using this frame allocator, pages for the production page tables are allocated.

Resources for learning systems programming

A note of caution: Be careful of spending too much time doing the easy work of looking for “resources”, rather than the hard work of consistent practice. Have fun ๐Ÿ™‚

Top Recommendations


Textbooks

Websites

University Course Materials

Youtube

Linux/FreeBSD

Linux/POSIX Userspace

  • The Linux Programming Interface: A Linux and UNIX System Programming Handbook

Blogs:

Concurrency & Parallelism