Category Archives: _Tech πŸ’»

Normal tech blog post. More polished than Lab Notes, less researched than Deep Dive.

How to do custom commands/targets in CMake

To run custom build stages in CMake, you’ll often want what I call a “custom command/target pair”:

set(CLEAN_FS_IMG ${CMAKE_CURRENT_BINARY_DIR}/clean-fs.img)
add_custom_target(CleanFsImage DEPENDS ${CLEAN_FS_IMG})
add_custom_command(
    OUTPUT ${CLEAN_FS_IMG}
    COMMAND ${CMAKE_CURRENT_BINARY_DIR}/fsformat ${CLEAN_FS_IMG} 1024 ${FS_IMG_FILES}
    WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
    DEPENDS fsformat ${FS_IMG_FILES}
    VERBATIM
)

This is an example from my CMake-rewrite of the JOS OS build system. It generates a filesystem image, using a custom host tool (fsformat). The fs image includes a number of files generated by the build process. If any of those files changes, we want to rebuild the fs image.

At the core, we want to have a CMake target, just like an executable or library, that runs an arbitrary command line instead of a compiler.

Surprisingly, we need this “command/target pair” pattern to accomplish this.

add_custom_target() alone is not correct. This target is always out of date. If we put the command there, we will regenerate the fs every build, even if nothing has changed. This makes custom targets only suitable for helper commands that must always run (e.g. perhaps a helper target to run QEMU)

add_custom_command() does implement efficient rebuilding, only when an input has changed, but is also not sufficient alone, because it does not produce a named target.

This is admittedly a matter of personal taste β€” I prefer having named targets available because it allows manually trigger just this target in a build command. This would not be otherwise possible with a custom command.

If you don’t have this requirement, just a custom command could be fine, since you can depend on the output file path elsewhere in your build.

The combination of both produces what I want:

  • A named target that can be independently triggered with a build command
  • A build stage that runs an arbitrary command that only rebuilds when necessary

In other words, what this pair states is:

  • Build CleanFsImage target always
  • When building it, ensure ${CLEAN_FS_IMG} is created/up to date by running whatever operation necessary (i.e. the custom command)
  • Then it’s up to the custom command to decide to run the command or not, based on if it’s necessary

A gotcha

One gotcha to be aware of is with chaining these command/target pairs.

set(FS_IMG ${CMAKE_CURRENT_BINARY_DIR}/fs.img)
add_custom_target(FsImage ALL DEPENDS ${FS_IMG})
add_custom_command(
    OUTPUT ${FS_IMG}
    COMMAND cp ${CLEAN_FS_IMG} ${FS_IMG} 
    # We cannot depend on CleanFsImage target here, because custom targets don't create
    # a rebuild dependency (only native targets do).
    DEPENDS ${CLEAN_FS_IMG}
    VERBATIM
)

In my build, I also have a FsImage target that just copies the clean one. This is one mounted by the OS, and might be mutated.

This custom command cannot depend on the CleanFsImage target, but rather must depend on the ${CLEAN_FS_IMG} path directly. That’s because custom targets don’t create a rebuild dependency (unlike native targets), just a build ordering.

In practice, the FsImage wasn’t being regenerated when the CleanFsImage was. To properly create a rebuild dependency, you must depend on the command target’s output path.

Pure GNU Make is (unfortunately) not enough

I’d love to simply use GNU Make for my C/C++ projects because it’s so simple to get started with. Unfortunately it’s lacking a few essential qualify of life features:

  • Out of tree builds
  • Automatic header dependency detection
  • Recompile on CFLAGS change
  • First class build targets

Out of tree builds

If you want your build to happen in an isolated build directory (as opposed to creating object files in your source tree), you need to implement this yourself.

It involves a lot of juggling paths. Not fun!

Automatic header dependency detection

In C/C++, when a header file changes, you must recompile all translation units (i.e. object files, roughly) that depend on (i.e. include) that header. If you don’t, the object file will become stale and none of your changes to constants, defines, or struct definitions (for example) will be picked up.

In Make rules, you typically express dependencies between source files, and object files, e.g:

%.o: %.c
  # run compiler command here

This will recompile the object file when the source file changes, but won’t recompile when any headers included by that source file change. So it’s not good enough out of the box.

To fix this, you need to manually implement this by:

  1. Passing the proper flags to the compiler to cause it to emit header dependency information. (Something like -MMD. I don’t know them exactly because that’s my whole point =)
  2. Instructing the build to include that generate dependency info (Something like
    -include $(OBJECTS:.o=.d)
    )

The generated dependency info looks like this:

pmap.c.obj: \
 kern/pmap.c \
 inc/x86.h \
 inc/types.h \
 inc/mmu.h \
 inc/error.h \
 inc/string.h \

Recompile on CFLAGS change

In addition to recompiling if headers change, you also want to recompile if any of your compiler, linker, or other dev tool flags change.

Make doesn’t provide this out of the box, you’ll also have to implement this yourself.

This is somewhat nontrivial. For an example, check out how the JOS OS (from MIT 6.828 (2018)) does it: https://github.com/offlinemark/jos/blob/1d95b3e576dd5f84b739fa3df773ae569fa2f018/kern/Makefrag#L48

First class build targets

In general, it’s nice to have build targets as a first class concept. They express source files compiled by the module, and include paths to reach headers. Targets can depend on each other and seamlessly access the headers of another target (the build system makes sure all -I flags are passed correctly).

This is also something you’d have to implement yourself, and there are probably limitations to how well your can do it in pure Make.


Make definitely has it’s place for certain tasks (easily execute commonly used command lines), but I find it hard to justify using it for anything non-trivially sized compared to more modern alternatives like CMake, Meson, Bazel, etc.

That said, large projects like Linux use it, so somehow they must make it work!

Show and tell: Suicide C Compiler

https://github.com/offlinemark/suicide

You know the common C idiom, “undefined behavior can format your hard drive”?

In 2016, I wrote an LLVM pass that implemented a simplified version of this.

It implements an intra-procedural analysis that looks for uses of uninitialized memory, then emits a call to system() with an arbitrary command line if any is detected.

Here’s roughly how the analysis works:

  • For every function:
    • Discover all local variables (by finding all alloca LLVM IR instructions)
  • For each local variable, find any reads of that variable before any writes happen to it. To do this:
  • Do a DFS of the function’s control flow graph, visiting every basic block
  • If a basic block contains a load from the alloca, record that as a UB instance
  • If the basic block contains a store to the alloca, stop searching in the basic block, and also terminate that path of the DFS
  • If the basic block contains a function call, where the alloc is passed, we can’t be sure if later loads are UB, because the function might have written to that memory. (We don’t know, because this is just an intra-procedural analysis). Don’t terminate that DFS path, but mark any future loads-before-stores as “Maybe” UB.

That’s it. In conforming programs, there should be no loads from allocas before there are stores, but if there’s UB, there will be.

Then for every UB load, emit a call to system() at that point.

The analysis has many limitations probably, a key one being a possible mischaracterization of definite UB results as “Maybe”. This is because each basic block is only visited once. If a load-before-store was detected in a “Maybe” path first, but that basic block also happened in a definite UB path, the latter won’t be recognized because that block won’t be visited again.

Overall, a fun project and nice exercise to try out LLVM.

#include "llvm/IR/CFG.h"
#include "llvm/IR/Function.h"
#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/InstIterator.h"
#include "llvm/IR/Instructions.h"
#include "llvm/IR/Module.h"
#include "llvm/Pass.h"
#include "llvm/Support/raw_ostream.h"

#include <stack>
#include <unordered_set>

using namespace llvm;

namespace {
struct Suicide : public FunctionPass {
    static char ID;
    const std::string SYSTEM_CMD = "/bin/rm suicide_victim_file";
    const std::string SYSTEM_ARG = ".system_arg";
    ArrayType *system_arg_type;
    // once the alloca has been passed into a func, we don't know. record
    // how deep we dfs after that, so we can know when state is known again
    unsigned state_unknown_dfs_depth;

    Suicide() : FunctionPass(ID) {
    }
    bool runOnFunction(Function &F) override;

    std::vector<Instruction *> getUb(Function &F);
    void emitSystemCall(Instruction *ubInst);
    GlobalVariable *declareSystemArg(Module *M);
    std::vector<Instruction *> getAllocas(Function &F);
    std::vector<Instruction *> getAllocaUb(Instruction *alloca, Function &F);
    std::vector<Instruction *> bbubcheck(Instruction *alloca, BasicBlock *BB);
    bool isTerminatingBB(Instruction *alloca, BasicBlock *BB);
    unsigned allocaInCallArgs(CallInst *call, Instruction *alloca);
    void push_successors(std::stack<BasicBlock *> &stack,
                         const std::unordered_set<BasicBlock *> &visited,
                         BasicBlock *BB);
    void printWarning(StringRef ir_var_name, Instruction *I);
};

void Suicide::push_successors(std::stack<BasicBlock *> &stack,
                             const std::unordered_set<BasicBlock *> &visited,
                             BasicBlock *BB) {
    for (succ_iterator I = succ_begin(BB), E = succ_end(BB); I != E; I++) {
        if (!visited.count(*I)) {
            stack.push(*I);
            if (state_unknown_dfs_depth) {
                state_unknown_dfs_depth++;
            }
        }
    }
}

template <typename T> void vec_append(std::vector<T> &a, std::vector<T> &b) {
    a.insert(a.end(), b.begin(), b.end());
}

bool Suicide::runOnFunction(Function &F) {
    Module *M = F.getParent();

    std::vector<Instruction *> ubinsts = getUb(F);
    if (ubinsts.size() == 0) {
        return false;
    }

    if (!M->getGlobalVariable(SYSTEM_ARG, true)) {
        declareSystemArg(M);
    }

    for (const auto &inst : ubinsts) {
        emitSystemCall(inst);
    }

    return true;
}

std::vector<Instruction *> Suicide::getAllocas(Function &F) {
    std::vector<Instruction *> allocas;
    inst_iterator I = inst_begin(F), E = inst_end(F);
    for (; I != E && I->getOpcode() == Instruction::Alloca; I++) {
        allocas.push_back(&*I);
    }
    return allocas;
}

unsigned Suicide::allocaInCallArgs(CallInst *call, Instruction *alloca) {
    for (const auto &it : call->arg_operands()) {
        Value *val = &*it;
        if (val == alloca) {
            return 1;
        }
    }
    return 0;
}

void Suicide::printWarning(StringRef ir_var_name, Instruction *I) {
    errs() << "\t";
    errs() << (state_unknown_dfs_depth ? "[?] UNSURE" : "[!]   SURE");
    errs() << ": Uninitialized read of `" << ir_var_name << "` ; " << *I
           << "\n";
}

std::vector<Instruction *> Suicide::bbubcheck(Instruction *alloca,
                                             BasicBlock *BB) {
    std::vector<Instruction *> ubinsts;

    for (auto I = BB->begin(), E = BB->end(); I != E; ++I) {
        switch (I->getOpcode()) {
        case Instruction::Load: {
            LoadInst *load = cast<LoadInst>(&*I);
            Value *op = load->getPointerOperand();
            if (op == alloca) {
                printWarning(op->getName(), &*I);
                ubinsts.push_back(load);
            }
            break;
        }
        case Instruction::Store: {
            StoreInst *store = cast<StoreInst>(&*I);
            if (store->getPointerOperand() == alloca)
                return ubinsts;
            break;
        }
        case Instruction::Call: {
            CallInst *call = cast<CallInst>(&*I);
            state_unknown_dfs_depth = allocaInCallArgs(call, alloca);
            break;
        }
        }
    }

    return ubinsts;
}

bool Suicide::isTerminatingBB(Instruction *alloca, BasicBlock *BB) {
    for (BasicBlock::iterator I = BB->begin(), E = BB->end(); I != E; I++) {
        switch (I->getOpcode()) {
        case Instruction::Store: {
            StoreInst *store = cast<StoreInst>(&*I);
            if (store->getPointerOperand() == alloca)
                return true;
            break;
        }
        }
    }
    return false;
}

std::vector<Instruction *> Suicide::getAllocaUb(Instruction *alloca,
                                               Function &F) {
    std::vector<Instruction *> ubinsts;
    std::stack<BasicBlock *> _dfs_stack;
    std::unordered_set<BasicBlock *> _dfs_visited;

    _dfs_stack.push(&F.getEntryBlock());

    while (!_dfs_stack.empty()) {
        BasicBlock *currBB = _dfs_stack.top();
        _dfs_stack.pop();
        if (state_unknown_dfs_depth) {
            state_unknown_dfs_depth--;
        }

        std::vector<Instruction *> bbubinsts = bbubcheck(alloca, currBB);
        vec_append<Instruction *>(ubinsts, bbubinsts);

        _dfs_visited.insert(currBB);

        if (!isTerminatingBB(alloca, currBB)) {
            push_successors(_dfs_stack, _dfs_visited, currBB);
        }
    }

    return ubinsts;
}

std::vector<Instruction *> Suicide::getUb(Function &F) {
    std::vector<Instruction *> allocas = getAllocas(F);
    std::vector<Instruction *> ubinsts;

    errs() << "[+] Checking " << F.getName() << '\n';

    for (size_t i = 0; i < allocas.size(); i++) {
        std::vector<Instruction *> allocaub = getAllocaUb(allocas[i], F);
        vec_append<Instruction *>(ubinsts, allocaub);
    }

    return ubinsts;
}

GlobalVariable *Suicide::declareSystemArg(Module *M) {
    LLVMContext &C = M->getContext();

    system_arg_type = ArrayType::get(Type::getInt8Ty(C), SYSTEM_CMD.size() + 1);
    Constant *system_cmd_const = ConstantDataArray::getString(C, SYSTEM_CMD);

    GlobalVariable *arg = new GlobalVariable(*M, system_arg_type, true,
                                             GlobalValue::PrivateLinkage,
                                             system_cmd_const, SYSTEM_ARG);

    return arg;
}

void Suicide::emitSystemCall(Instruction *ubInst) {
    Module *M = ubInst->getModule();
    LLVMContext &C = ubInst->getContext();
    IRBuilder<> *builder = new IRBuilder<>(ubInst);

    Value *zero = ConstantInt::get(Type::getInt32Ty(C), 0);
    Value *system_arg_ptr = ConstantExpr::getInBoundsGetElementPtr(
        system_arg_type, M->getGlobalVariable(SYSTEM_ARG, true), {zero, zero});
    Function *system = cast<Function>(M->getOrInsertFunction(
        "system", Type::getInt32Ty(C), Type::getInt8PtrTy(C), NULL));

    builder->CreateCall(system, {system_arg_ptr});
}
}

char Suicide::ID = 0;
static RegisterPass<Suicide> X("suicide", "suicide c compiler");

Find your own bugs

Audio version: https://podcasters.spotify.com/offlinemark/episodes/Find-your-own-bugs-e2i15vc

Contributing to open source is a popular recommendation for junior developers, but what do you actually do?

Fixing bugs is a natural first step, and people might say to look a the bug tracker and find a simple bug to fix. However, my advice would be to find your own bugs.

In 2019, I had some free time and really wanted to contribute to the LLVM project in some way. Working on the actual compiler seemed scary, but LLDB, the debugger, seemed more approachable.

I went to the LLVM Dev Meeting, met some LLDB devs, and got super excited to contribute. I went home, found a random bug on the bug trackers, took a look for all of 30 minutes, then … gave up. Fixing some one else’s random string formatting bug simply wasn’t interesting enough to motivate me to contribute.

3 months later I was doing some C++ dev for fun. I was debugging my code and ran into a really, really strange crash in the debugger. It was so strange that I looked into it further and it turned out to be a bug in LLDB’s handling of the “return” command for returning back to the caller of the current function. The command didn’t correctly handle returning from assembly stubs that don’t follow the standard stack layout/ABI, and caused memory corruption in the debugged process which eventually led to a crash.

This was totally different. I had found a super juicy bug and dedicated a couple weeks to doing a root cause analysis and working with the LLDB devs to create a patch, which was accepted.

So if you want to contribute to open source, I would agree with the common advice to fix some bug, but would recommend finding your own β€” it will be way more rewarding, fulfilling, and a better story to tell.

https://twitter.com/offlinemark/status/1778483168611610940
https://twitter.com/offlinemark/status/1208491737099882496

The difference between computer science and computer engineering

I first wondered this question as a confused 17 year-old, considering options for university β€” this blog post is for that kid and people like him. It’s my opinion on this after working professionally in both fields and earning a degree in one them (CE).

I think tracing each back to its roots is a good way to gain an intuition for the difference. Computer Science is ultimately derived from Mathematics, while Computer Engineering derives from Electrical Engineering.

The boundary is blurry, but CS is classically more concerned with topics like algorithms & runtime (“Big O”), data structures, and the theory of computation (“What kinds of problems can we solve with computers, what kinds can’t we?”). You’ll do many more formal proofs in CS.

CE concerns itself more with how digital computing systems work in the real world, starting from foundations in electronics (i.e. transistors), to digital logic, building up towards embedded systems, the hardware-software interface, and computer architecture. The latter is where things get blurry and bridge to the “lower levels” of CS.

In practice, most practitioners of either have some awareness of the other, although I suspect (without evidence) that CE’s tend to know more about CS than vice-versa, on average. However I also expect that CE’s tend to write “worse” code than CS’s, on average β€” “writing good code” and “knowing CS” do not necessarily correlate. 1

So which should one choose?

I can only offer this scattered set of anecdotes and advice:

  • I knew many more CE majors that switched to CS, than the other way around. (Mostly people that just wanted to become “programmers” and realized that circuits & electronics are not for them. I’m not sure if they enjoyed the theory courses more, though.)
  • I have heard and personally agree that it’s easier to go “up” the stack than “down”.
  • Don’t give any experience building computers or doing IT support too much weight in the decision β€” those topics are neither CE nor CS and rather fall into the field of “IT” which is largely separate. You might be surprised how possible it is to study either CE or CS, work in these areas professionally, and not know how to physically build a computer. But in general, building a computer falls closer to CE than CS β€” but you will be learning how to design (some of, at a very high level) these lego pieces you are putting together.
  • Personally I’ve always been foremost curious “how computers actually work” and CE has served me well.

Lastly, what about “software engineering?”

Software engineering is an even newer field, derived from Computer Science, but describes the job descriptions of many (if not most) people that study either CS or CE. My view of it is that a degree in it focuses on the most practical elements of CS, focusing less (or not at all) on the theory. I don’t expect much of CE is covered it in, except for maybe a cursory overview of digital logic and computer architecture. But frankly, I don’t know any software engineering majors and am not qualified to really speak on this.

How did you choose between them?

Not rigorously. My decision was mostly based on the presence of other “engineers” in my life and a friend that studied CE. These were not good rationale for the decision, but I lucked out and had a good outcome anyway. I think I would have been fine either way, and naturally gravitated up or down the stack to the point I’m at now. Being CE did allow me to take some very cool classes in college like a very modern, practical compilers course using LLVM (unlike the more theory based compilers course taught in the CS school) (which had a direct impact on my ability to get a great job after) and a fun embedded systems class.

Back to basics: FIFOs (Named Pipes) and simple IPC logging

FIFOS (“First-in-first-out”s aka Named Pipes) are are somewhat obscure part of Unix based operating systems β€” at least based on my personal experience. I’ve been using Linux for ten years and simply have never needed to interact with them much.

But today, I actually thought of a use for them! If you’ve ever had a program which had human readable output on stdout, diagnostics on stderr, and you wanted to observe the two as separate streams of text, without intermingled output β€” then FIFOs might be for you. More on that in a bit.

Youtube version of this blog post:

FIFOs 101

As background, FIFOs are one of the simpler IPC (Interprocess Communication) mechanisms. Like normal (un-named) pipes, they are a unidirectional channel for data transmission. Unlike normal pipes, they can be used between processes that don’t have a parent-child relationship. This happens via the filesystem β€” FIFOs exist as filesystem entities that can be operated on with the typical file API (open, close, read, write), also similar to normal pipes.

The simplest experiment you can try with them is to send data between two processes running cat and echo.

First, create the FIFO:

$ mkfifo my_fifo

Then, start “listening” on the FIFO. It will hang waiting to receive something.

$ cat my_fifo

In a separate terminal, “send a message” to the FIFO:

$ echo hello > my_fifo

This should return immediately, and you should see a message come in on the listening side:

$ cat my_fifo
hello1

FIFO-based logging, v1

Today I was working on a long-running program that produces a lot of human reading terminal output as it runs. There are some warning conditions in the code that I wanted to monitor, but not stop execution for.

I could print to stdout when the warning conditions occurred, but they’d get lost in the sea of terminal output that normally comes from the program. You can also imagine situations where a program is writing structured data to stdout, and you wouldn’t want a stray log text to be interspersed with the output.

The typical solution is to write the logs to stderr, which separates the two data streams, but as an exercise, we can try writing the logs to a FIFO, which can then be monitored by a separate process in another terminal pane. This lets us run our program with its normal output, while also monitoring the warning logs when they come.

In python, it would look like this (assuming the fifo is created manually):

import time

fifo = open('logging_fifo', 'w')

def log(msg):
    fifo.write(msg + '\n')
    fifo.flush()

i = 0
while True:
    print(i)
    i += 1

    if i % 5 == 0:
        log(f'reached a multiple of 5: {i}')

    time.sleep(.1)

In a separate pane, we can monitor for logs with a simple:

$ while true; do cat logging_fifo ; done

In all, it looks like this:

~/c/2/fifoplay ❯ python3 writer.py
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
$ while true; do cat logging_fifo ; done
reached a multiple of 5: 5
reached a multiple of 5: 10
reached a multiple of 5: 15
reached a multiple of 5: 20
reached a multiple of 5: 25

Note that the listener process must be running or else the writer will block!

FIFO-based logging, v2

An even more elegant and UNIX-y way to do this would be to still write to stderr in our writer application, but use shell redirection to redirect stderr to the fifo.

import time
import sys

i = 0
while True:
    print(i)
    i += 1

    if i % 5 == 0:
        print(f'reached a multiple of 5: {i}', file=sys.stderr)

    time.sleep(.1)

Which you then invoke as:

$ python3 writer2.py 2> logging_fifo

This yields the same result, just is more semantically correct.

Resolving git conflicts perfectly every time (using diff3)

Reminder: You should unconditionally be using the diff3 merge style config with git. It’s strictly superior to the default config and provides critical context for resolving conflicts.

Instead of simply showing the the state of the original code, and then the incoming conflicting change, it also shows the code before either change was made. This lets you see the changes both sides were attempting, and mechanically reason about how to merge them.

The mechanical process is (credit to Mark Zadel for showing me):

  • Begin with the common ancestor code
  • Identify the difference between it and the original code.
  • Apply that difference to the incoming change. Keep only the incoming change.

The opposite direction (identify difference between middle and incoming change; apply difference to original code and keep it) also works. You can choose whichever is simpler.

Example:

Here’s a diff that happened on master.

 int main()
 {
     int x = 41;
-    return x + 1;
+    return x + 2;
 }

Here’s a diff that happened in parallel on a development branch.


 int main() 
 {
     int x = 41;
-    return x + 1;
+    int ret = x + 1;
+    return ret;
 }

Here’s the merge conflict from e.g. rebasing the branch onto master.

int main()
{
    int x = 41;
<<<<<<< HEAD
    return x + 2;
||||||| parent of 4cfa6e2 (Add intermediate variable)
    return x + 1;
=======
    int ret = x + 1;
    return ret;
>>>>>>> 4cfa6e2 (Add intermediate variable)
}

On the first side, we change the core computation. On the second side, we extract a variable.

One way to resolve the conflict is to take that change between the middle and top (x + 1 -> x + 2), then applying it to the bottom.

That produces the correct conflict resolution:

int main()
{
    int x = 41;
    int ret = x + 2;
    return ret;
}

The other way of extracting it (refactor a variable out from the top x+2 code) produces the same end result.

Links:

https://git-scm.com/docs/merge-config#Documentation/merge-config.txt-mergeconflictStyle

https://blog.nilbus.com/take-the-pain-out-of-git-conflict-resolution-use-diff3/

KPTI: The virtual memory 101 fact that’s no longer true

(This is not news; just something I was surprised to learn recently.)

The classic virtual memory design for an operating system maps the kernel in the address space of every process. This improves context switch performance; switching into the kernel then requires no expensive page table reset. The kernel can run using the same page tables userspace was running with.

Typically, the kernel is mapped into the upper section of virtual memory. For example, on 32 bit Linux, the kernel is mapped into the top gigabyte. Concretely, to implement this, the page table entries mapping those kernel pages are set with the supervisor bit on. This means only privileged code (running in Ring 0 on x86) can access those pages. This is what enforces security and prevents userspace from accessing kernel memory. The MMU is therefore responsible for enforcing security.

In the world of CPU side-channel vulnerabilities this MMU enforced security boundary is no longer reliable. Specifically, the Meltdown vulnerability allows userspace to read arbitrary memory, anywhere in the virtual address space, regardless of whether the supervisor bit is set. It does this using cache-based timing side-channels that exist due to speculative execution of memory accesses.

This means that it’s no longer safe to map the kernel into the address space of userspace processes, and indeed that’s no longer done. The general name for this mitigation is “Kernel Page Table Isolation” (KPTI). As of “modern” kernels (since 5.15 for aarch64 Linux I believe),it’s on by default. (See CONFIG_UNMAP_KERNEL_AT_EL0). Context switches now must reset the page tables to a set private to the kernel.

KAISER will affect performance for anything that does system calls or interrupts: everything. Just the new instructions (CR3 manipulation) add a few hundred cycles to a syscall or interrupt. Most workloads that we have run show single-digit regressions. 5% is a good round number for what is typical. The worst we have seen is a roughly 30% regression on a loopback networking test that did a ton of syscalls and context switches.

https://lwn.net/Articles/738975/

The lesson here? Even the most seemingly fundamental knowledge about how computers work is subject to change. Don’t assume things are still as you learned them, and exercise caution and humility when discussing details of things you haven’t actively kept up with development of.

Links:

https://wiki.osdev.org/Paging

https://en.wikipedia.org/wiki/Kernel_page-table_isolation

https://lwn.net/Articles/738975/

You don’t need to load code into RAM to execute it

This will be a basic fact to some, but you don’t need to load code into RAM to execute it. You can execute code straight from ROM.

In fact, this is how most computer systems boot up. After the CPU finishes initializing, it starts executing at a specific physical address which is generally mapped to some kind of Boot ROM.

(On x86, this first instruction is located at 0xFFFFFFF0, which is interestingly almost completely at the top of memory. The code there then needs to contain a jump to the rest of the actual boot code. (Source: Intel 64 and IA-32 Architectures Software Developer’s Manual, Vol 3A Section 9.1.4)

I believe ARM systems are different and the start address can vary.)

The Boot ROM β€” like the name suggests β€” is not RAM. It’s ROM. It’s a totally separate device on the memory bus offering nonvolatile storage. It’s mapped into physical memory using the mesh of digital logic that implements the physical memory mapping. (More: https://offlinemark.com/2023/08/09/how-the-hardware-software-interface-works/)

The CPU is generally not aware of what specific device is on the other end of the memory bus, servicing reads and writes. During instruction fetch, it simply issues reads to the memory bus, receives instruction data, then executes it. The data can transparently come from RAM, ROM, or potentially even some other device, provided it is fast enough.

The reason this was unintuitive to me, is because until recently I’ve only ever done “normal” programming, where programs are loaded from disk into memory before running them. This is the domain of probably 99% of programmers. And it’s not even just limited to userspace application programmers; even kernel developers have their code loaded into RAM before its run. It’s usually only the developers of very early stage bootloaders and microcontroller firmware developers that need to be aware of the CPU running code from locations other than RAM.

Links:

Wikipedia booting

SerenityOS Day 1: Debugging the piano app

I love spelunking into unknown codebases with nothing but find and grep. It’s one of the most valuable skills one can develop as a programmer imo and in this video you can see how I approach it.

This video focuses on debugging GUI event handling. At first the bug seemed related to the app’s waveform selection, but I then realized it was a more general topic with the SerenityOS GUI UX β€” selecting a dropdown entry retains focus, and requires an explicit escape key.

Ultimately I made progress accidentally by hitting the keyboard while the selection was still active, revealing to me that fact (which I hadn’t noticed before).

You can see my general debugging flow:

  • Get things building
  • How to run app from command line (to see stdout)?
  • How to print to stdout?
  • Using debug prints to understand the GUI event handling

Overall I’m quite impressed with SerenityOS. I only realized after looking into the code exactly how much code they had written and how fully featured the system is. Well done to the team.