Contributing to open source is a popular recommendation for junior developers, but what do you actually do?
Fixing bugs is a natural first step, and people might say to look a the bug tracker and find a simple bug to fix. However, my advice would be to find your own bugs.
In 2019, I had some free time and really wanted to contribute to the LLVM project in some way. Working on the actual compiler seemed scary, but LLDB, the debugger, seemed more approachable.
I went to the LLVM Dev Meeting, met some LLDB devs, and got super excited to contribute. I went home, found a random bug on the bug trackers, took a look for all of 30 minutes, then … gave up. Fixing some one else’s random string formatting bug simply wasn’t interesting enough to motivate me to contribute.
3 months later I was doing some C++ dev for fun. I was debugging my code and ran into a really, really strange crash in the debugger. It was so strange that I looked into it further and it turned out to be a bug in LLDB’s handling of the “return” command for returning back to the caller of the current function. The command didn’t correctly handle returning from assembly stubs that don’t follow the standard stack layout/ABI, and caused memory corruption in the debugged process which eventually led to a crash.
This was totally different. I had found a super juicy bug and dedicated a couple weeks to doing a root cause analysis and working with the LLDB devs to create a patch, which was accepted.
So if you want to contribute to open source, I would agree with the common advice to fix some bug, but would recommend finding your own β it will be way more rewarding, fulfilling, and a better story to tell.
I first wondered this question as a confused 17 year-old, considering options for university β this blog post is for that kid and people like him. It’s my opinion on this after working professionally in both fields and earning a degree in one them (CE).
I think tracing each back to its roots is a good way to gain an intuition for the difference. Computer Science is ultimately derived from Mathematics, while Computer Engineering derives from Electrical Engineering.
The boundary is blurry, but CS is classically more concerned with topics like algorithms & runtime (“Big O”), data structures, and the theory of computation (“What kinds of problems can we solve with computers, what kinds can’t we?”). You’ll do many more formal proofs in CS.
CE concerns itself more with how digital computing systems work in the real world, starting from foundations in electronics (i.e. transistors), to digital logic, building up towards embedded systems, the hardware-software interface, and computer architecture. The latter is where things get blurry and bridge to the “lower levels” of CS.
In practice, most practitioners of either have some awareness of the other, although I suspect (without evidence) that CE’s tend to know more about CS than vice-versa, on average. However I also expect that CE’s tend to write “worse” code than CS’s, on average β “writing good code” and “knowing CS” do not necessarily correlate. 1
So which should one choose?
I can only offer this scattered set of anecdotes and advice:
I knew many more CE majors that switched to CS, than the other way around. (Mostly people that just wanted to become “programmers” and realized that circuits & electronics are not for them. I’m not sure if they enjoyed the theory courses more, though.)
I have heard and personally agree that it’s easier to go “up” the stack than “down”.
Don’t give any experience building computers or doing IT support too much weight in the decision β those topics are neither CE nor CS and rather fall into the field of “IT” which is largely separate. You might be surprised how possible it is to study either CE or CS, work in these areas professionally, and not know how to physically build a computer. But in general, building a computer falls closer to CE than CS β but you will be learning how to design (some of, at a very high level) these lego pieces you are putting together.
Personally I’ve always been foremost curious “how computers actually work” and CE has served me well.
Lastly, what about “software engineering?”
Software engineering is an even newer field, derived from Computer Science, but describes the job descriptions of many (if not most) people that study either CS or CE. My view of it is that a degree in it focuses on the most practical elements of CS, focusing less (or not at all) on the theory. I don’t expect much of CE is covered it in, except for maybe a cursory overview of digital logic and computer architecture. But frankly, I don’t know any software engineering majors and am not qualified to really speak on this.
How did you choose between them?
Not rigorously. My decision was mostly based on the presence of other “engineers” in my life and a friend that studied CE. These were not good rationale for the decision, but I lucked out and had a good outcome anyway. I think I would have been fine either way, and naturally gravitated up or down the stack to the point I’m at now. Being CE did allow me to take some very cool classes in college like a very modern, practical compilers course using LLVM (unlike the more theory based compilers course taught in the CS school) (which had a direct impact on my ability to get a great job after) and a fun embedded systems class.
FIFOS (“First-in-first-out”s aka Named Pipes) are are somewhat obscure part of Unix based operating systems β at least based on my personal experience. I’ve been using Linux for ten years and simply have never needed to interact with them much.
But today, I actually thought of a use for them! If you’ve ever had a program which had human readable output on stdout, diagnostics on stderr, and you wanted to observe the two as separate streams of text, without intermingled output β then FIFOs might be for you. More on that in a bit.
Youtube version of this blog post:
FIFOs 101
As background, FIFOs are one of the simpler IPC (Interprocess Communication) mechanisms. Like normal (un-named) pipes, they are a unidirectional channel for data transmission. Unlike normal pipes, they can be used between processes that don’t have a parent-child relationship. This happens via the filesystem β FIFOs exist as filesystem entities that can be operated on with the typical file API (open, close, read, write), also similar to normal pipes.
The simplest experiment you can try with them is to send data between two processes running cat and echo.
First, create the FIFO:
$ mkfifo my_fifo
Then, start “listening” on the FIFO. It will hang waiting to receive something.
$ cat my_fifo
In a separate terminal, “send a message” to the FIFO:
$ echo hello > my_fifo
This should return immediately, and you should see a message come in on the listening side:
$ cat my_fifo
hello1
FIFO-based logging, v1
Today I was working on a long-running program that produces a lot of human reading terminal output as it runs. There are some warning conditions in the code that I wanted to monitor, but not stop execution for.
I could print to stdout when the warning conditions occurred, but they’d get lost in the sea of terminal output that normally comes from the program. You can also imagine situations where a program is writing structured data to stdout, and you wouldn’t want a stray log text to be interspersed with the output.
The typical solution is to write the logs to stderr, which separates the two data streams, but as an exercise, we can try writing the logs to a FIFO, which can then be monitored by a separate process in another terminal pane. This lets us run our program with its normal output, while also monitoring the warning logs when they come.
In python, it would look like this (assuming the fifo is created manually):
import time
fifo = open('logging_fifo', 'w')
def log(msg):
fifo.write(msg + '\n')
fifo.flush()
i = 0
while True:
print(i)
i += 1
if i % 5 == 0:
log(f'reached a multiple of 5: {i}')
time.sleep(.1)
In a separate pane, we can monitor for logs with a simple:
$ while true; do cat logging_fifo ; done
reached a multiple of 5: 5
reached a multiple of 5: 10
reached a multiple of 5: 15
reached a multiple of 5: 20
reached a multiple of 5: 25
Note that the listener process must be running or else the writer will block!
FIFO-based logging, v2
An even more elegant and UNIX-y way to do this would be to still write to stderr in our writer application, but use shell redirection to redirect stderr to the fifo.
import time
import sys
i = 0
while True:
print(i)
i += 1
if i % 5 == 0:
print(f'reached a multiple of 5: {i}', file=sys.stderr)
time.sleep(.1)
Which you then invoke as:
$ python3 writer2.py 2> logging_fifo
This yields the same result, just is more semantically correct.
Reminder: You should unconditionally be using the diff3 merge style config with git. It’s strictly superior to the default config and provides critical context for resolving conflicts.
Instead of simply showing the the state of the original code, and then the incoming conflicting change, it also shows the code before either change was made. This lets you see the changes both sides were attempting, and mechanically reason about how to merge them.
The mechanical process is (credit to Mark Zadel for showing me):
Begin with the common ancestor code
Identify the difference between it and the original code.
Apply that difference to the incoming change. Keep only the incoming change.
The opposite direction (identify difference between middle and incoming change; apply difference to original code and keep it) also works. You can choose whichever is simpler.
Example:
Here’s a diff that happened on master.
int main()
{
int x = 41;
- return x + 1;
+ return x + 2;
}
Here’s a diff that happened in parallel on a development branch.
int main()
{
int x = 41;
- return x + 1;
+ int ret = x + 1;
+ return ret;
}
Here’s the merge conflict from e.g. rebasing the branch onto master.
int main()
{
int x = 41;
<<<<<<< HEAD
return x + 2;
||||||| parent of 4cfa6e2 (Add intermediate variable)
return x + 1;
=======
int ret = x + 1;
return ret;
>>>>>>> 4cfa6e2 (Add intermediate variable)
}
On the first side, we change the core computation. On the second side, we extract a variable.
One way to resolve the conflict is to take that change between the middle and top (x + 1 -> x + 2), then applying it to the bottom.
That produces the correct conflict resolution:
int main()
{
int x = 41;
int ret = x + 2;
return ret;
}
The other way of extracting it (refactor a variable out from the top x+2 code) produces the same end result.
(This is not news; just something I was surprised to learn recently.)
The classic virtual memory design for an operating system maps the kernel in the address space of every process. This improves context switch performance; switching into the kernel then requires no expensive page table reset. The kernel can run using the same page tables userspace was running with.
Typically, the kernel is mapped into the upper section of virtual memory. For example, on 32 bit Linux, the kernel is mapped into the top gigabyte. Concretely, to implement this, the page table entries mapping those kernel pages are set with the supervisor bit on. This means only privileged code (running in Ring 0 on x86) can access those pages. This is what enforces security and prevents userspace from accessing kernel memory. The MMU is therefore responsible for enforcing security.
In the world of CPU side-channel vulnerabilities this MMU enforced security boundary is no longer reliable. Specifically, the Meltdown vulnerability allows userspace to read arbitrary memory, anywhere in the virtual address space, regardless of whether the supervisor bit is set. It does this using cache-based timing side-channels that exist due to speculative execution of memory accesses.
This means that it’s no longer safe to map the kernel into the address space of userspace processes, and indeed that’s no longer done. The general name for this mitigation is “Kernel Page Table Isolation” (KPTI). As of “modern” kernels (since 5.15 for aarch64 Linux I believe),it’s on by default. (See CONFIG_UNMAP_KERNEL_AT_EL0). Context switches now must reset the page tables to a set private to the kernel.
KAISER will affect performance for anything that does system calls or interrupts: everything. Just the new instructions (CR3 manipulation) add a few hundred cycles to a syscall or interrupt. Most workloads that we have run show single-digit regressions. 5% is a good round number for what is typical. The worst we have seen is a roughly 30% regression on a loopback networking test that did a ton of syscalls and context switches.
The lesson here? Even the most seemingly fundamental knowledge about how computers work is subject to change. Don’t assume things are still as you learned them, and exercise caution and humility when discussing details of things you haven’t actively kept up with development of.
This will be a basic fact to some, but you don’t need to load code into RAM to execute it. You can execute code straight from ROM.
In fact, this is how most computer systems boot up. After the CPU finishes initializing, it starts executing at a specific physical address which is generally mapped to some kind of Boot ROM.
(On x86, this first instruction is located at 0xFFFFFFF0, which is interestingly almost completely at the top of memory. The code there then needs to contain a jump to the rest of the actual boot code. (Source: Intel 64 and IA-32 Architectures Software Developerβs Manual, Vol 3A Section 9.1.4)
I believe ARM systems are different and the start address can vary.)
The Boot ROM β like the name suggests β is not RAM. It’s ROM. It’s a totally separate device on the memory bus offering nonvolatile storage. It’s mapped into physical memory using the mesh of digital logic that implements the physical memory mapping. (More: https://offlinemark.com/2023/08/09/how-the-hardware-software-interface-works/)
The CPU is generally not aware of what specific device is on the other end of the memory bus, servicing reads and writes. During instruction fetch, it simply issues reads to the memory bus, receives instruction data, then executes it. The data can transparently come from RAM, ROM, or potentially even some other device, provided it is fast enough.
The reason this was unintuitive to me, is because until recently I’ve only ever done “normal” programming, where programs are loaded from disk into memory before running them. This is the domain of probably 99% of programmers. And it’s not even just limited to userspace application programmers; even kernel developers have their code loaded into RAM before its run. It’s usually only the developers of very early stage bootloaders and microcontroller firmware developers that need to be aware of the CPU running code from locations other than RAM.
I love spelunking into unknown codebases with nothing but find and grep. It’s one of the most valuable skills one can develop as a programmer imo and in this video you can see how I approach it.
This video focuses on debugging GUI event handling. At first the bug seemed related to the app’s waveform selection, but I then realized it was a more general topic with the SerenityOS GUI UX β selecting a dropdown entry retains focus, and requires an explicit escape key.
Ultimately I made progress accidentally by hitting the keyboard while the selection was still active, revealing to me that fact (which I hadn’t noticed before).
You can see my general debugging flow:
Get things building
How to run app from command line (to see stdout)?
How to print to stdout?
Using debug prints to understand the GUI event handling
Overall I’m quite impressed with SerenityOS. I only realized after looking into the code exactly how much code they had written and how fully featured the system is. Well done to the team.
Four exercises that touch basic multithreaded and lockfree programming concepts.
Implement a program that attempts to use two threads to increment a global counter to 10,000 with each thread incrementing 5000. But make it buggy so that there are interleaving problems and the end result of the counter is less than 10,000.
Fix the above with atomics.
Implement a variant of the program: instead of simply incrementing the counter, make the counter wrap every 16 increments (as if incrementing through indices of an array of length 16). Make two threads each attempt to increment the counter (16 * 5000) times. The end state should have the counter be back at index zero. Implement it in a buggy naive way that causes the counter to often be nonzero, even if atomics are used.
Fix the above using a CAS loop.
(Bonus question for the above: Why isn’t std::atomic::compare_exchange_strong a good fit here?)
It’s really all about memory. But to start at the beginning, the rough stack looks like this:
Userspace application
Kernel driver
Hardware device
I find it easier to think about this from the middle out. On Linux, the kernel exposes hardware devices as files backed by the /dev virtual filesystem. Userspace can do normal syscalls like open, read, write, and mmap on them, as well as the less typical ioctl (for more arbitrary, device-specific functionality).2.
The files are created by kernel drivers which are modules of kernel code whose sole purpose is to interface with and abstract hardware so it can be used by other parts of the operating system, or userspace. They are implemented implemented using internal driver “frameworks” in the kernel, e.g. the I2C or SPI frameworks. When you interface with a file in /dev, you are directly triggering callback handlers in a driver which execute in the process context.
That’s how userspace interfaces with the kernel. How do drivers interface with hardware? These days, mostly via memory mapped I/O (MMIO)3. This is when device hardware “appears” at certain physical addresses, and can be interfaced with via load and store instructions using an “API” that the device defines. For example, you can read data from a sensor by simply reading a physical address, or write data out to a device by writing to an address. The technical term for the hardware component these reads/writes interface with is “registers” (i.e. memory mapped registers).
(Aside: Other than MMIO, the other main interface the kernel has with hardware is interrupts, for interrupt driven I/O processing (as opposed to polling, which is what MMIO enables). I’m not very knowledgeable about this, so I won’t get into it other than to say drivers can register handlers for specific IRQ (interrupt requests) numbers, which will be invoked by the kernel’s generic interrupt handling infrastructure.)
Using MMIOs looks a lot like embedded bare metal programming you might do on a microcontroller like a PIC or Arduino (AVR). At the lowest level, a kernel driver is really just embedded bare metal programming.
Here’s an example of a device driver for UART (serial port) hardware for ARM platforms: linux/drivers/tty/serial/amba-pl011.c. If you’re debugging an ARM Linux system via a serial connection, this is might be the driver being used to e.g. show the boot messages.
The lines like:
cr = readb(uap->port.membase + UART010_CR);
are where the real magic happens.
This is simply doing a read from a memory address derived from some base address for the device, plus some offset of the specific register in question. In this case it’s reading some control information from a Control Register.
Device interfaces may range from having just a few to many registers.
To go one step deeper down the rabbit hole, how do devices “end up” at certain physical addresses? How is this physical memory map interface implemented?4
The device/physical address mapping is implemented in digital logic outside the CPU, either on the System on Chip (SOC) (for embedded systems), or on the motherboard (PCs)5. The CPU’s physical interface include the address, data, and control buses. Digital logic converts bits of the address bus into signals that mutually exclusively enable devices that are physically connected to the bus. The implementations of load/store instructions in the CPU set a read/write bit appropriately in the Control bus, which lets devices know whether a read or write is happening. The data bus is where data is either transferred out from or into the CPU.
In practice, documentation for real implementations of these systems can be hard to find, unless you’re a customer of the SoC manufacturer. But there are some out there for older chips, e.g.
Here’s a block diagram for the Tegra 2 SoC architecture, which shipped in products like the Motorola Atrix 4G, Motorola Droid X2, and Motorola Photon. Obviously it’s much more complex than my description above. Other than the two CPU cores in the top left, and the data bus towards the middle, I can’t make sense of it. (link)
While not strictly a “System on Chip”, a classic PIC microcontroller has many shared characteristics of a SoC (CPU, memory, peripherals, all in one chip package), but is much more approachable.
We can see the single MIPS core connected to a variety of peripheral devices on the peripheral bus. There’s even layers of peripheral bussing, with a “Peripheral Bridge” connected to a second peripheral bus for things like I2C and SPI.
I was debugging a microcontroller recently and was surprised to see that I was limited to 4 breakpoints. That surprised me because I’m usually only used to seeing that kind of limitation when using watchpoints, not breakpoints. (See my youtube video for more on this).
But it makes sense after all β code running on this microcontroller is different than in a process in an OS’s userspace because the code will probably be flashed into some kind of ROM (Read Only Memory) (or EEPROM). Since the code is unwritable at a hardware level, the typical method of inserting software breakpoint instructions isn’t possible. So the alternative is to use the microprocessor’s support for hardware debugging, which occupies hardware resources and is thus finite.