Featured post

Linux Internals: How /proc/self/mem writes to unwritable memory

Introduction

An obscure quirk of the /proc/*/mem pseudofile is its “punch through” semantics. Writes performed through this file will succeed even if the destination virtual memory is marked unwritable. In fact, this behavior is intentional and actively used by projects such as the Julia JIT compiler and rr debugger.

This behavior raises some questions: Is privileged code subject to virtual memory permissions? In general, to what degree can the hardware inhibit kernel memory access?

By exploring these questions1, this article will shed light on the nuanced relationship between an operating system and the hardware it runs on. We’ll examine the constraints the CPU can impose on the kernel, and how the kernel can bypass these constraints.

Continue reading
Featured post

Open source licensing for supervillains

This post covers my research into open source software licensing and my analysis of real-world open source projects that profit off of open source code via proprietary licenses.

Keep reading and you’ll learn:

  • What the difference between a restrictive and permissive license is
  • What dual licensing is and how you can use it make money off of open source code
  • What CLAs are and the specific clause your CLA needs for use with dual licensing
  • Examples of companies that implement dual licensing and how they do it

And of course: I am not a lawyer and none of this is legal advice.


Let’s talk evil. And by evil, I mean money.

Continue reading
Featured post

What they don’t tell you about demand paging in school

This post details my adventures with the Linux virtual memory subsystem, and my discovery of a creative way to taunt the OOM (out of memory) killer by accumulating memory in the kernel, rather than in userspace.

Keep reading and you’ll learn:

  • Internal details of the Linux kernel’s demand paging implementation
  • How to exploit virtual memory to implement highly efficient sparse data structures
  • What page tables are and how to calculate the memory overhead incurred by them
  • A cute way to get killed by the OOM killer while appearing to consume very little memory (great for parties)

Note: Victor Michel wrote a great follow up to this post here.

Continue reading
Featured post

How setjmp and longjmp work (2016)

Note, 2023-07-29 : I wrote this in 2016 when I was a much less experienced writer.


Pretty recently I learned about setjmp() and longjmp(). They’re a neat pair of libc functions which allow you to save your program’s current execution context and resume it at an arbitrary point in the future (with some caveats1). If you’re wondering why this is particularly useful, to quote the manpage, one of their main use cases is “…for dealing with errors and interrupts encountered in a low-level subroutine of a program.” These functions can be used for more sophisticated error handling than simple error code return values.

I was curious how these functions worked, so I decided to take a look at musl libc’s implementation for x86. First, I’ll explain their interfaces and show an example usage program. Next, since this post isn’t aimed at the assembly wizard, I’ll cover some basics of x86 and Linux calling convention to provide some required background knowledge. Lastly, I’ll walk through the source, line by line.

Continue reading

SerenityOS Day 1: Debugging the piano app

I love spelunking into unknown codebases with nothing but find and grep. It’s one of the most valuable skills one can develop as a programmer imo and in this video you can see how I approach it.

This video focuses on debugging GUI event handling. At first the bug seemed related to the app’s waveform selection, but I then realized it was a more general topic with the SerenityOS GUI UX — selecting a dropdown entry retains focus, and requires an explicit escape key.

Ultimately I made progress accidentally by hitting the keyboard while the selection was still active, revealing to me that fact (which I hadn’t noticed before).

You can see my general debugging flow:

  • Get things building
  • How to run app from command line (to see stdout)?
  • How to print to stdout?
  • Using debug prints to understand the GUI event handling

Overall I’m quite impressed with SerenityOS. I only realized after looking into the code exactly how much code they had written and how fully featured the system is. Well done to the team.

Do you need to learn how to implement a red-black tree?

4 basic multithreading & lockfree programming exercises

Four exercises that touch basic multithreaded and lockfree programming concepts.

  1. Implement a program that attempts to use two threads to increment a global counter to 10,000 with each thread incrementing 5000. But make it buggy so that there are interleaving problems and the end result of the counter is less than 10,000.
  2. Fix the above with atomics.
  3. Implement a variant of the program: instead of simply incrementing the counter, make the counter wrap every 16 increments (as if incrementing through indices of an array of length 16). Make two threads each attempt to increment the counter (16 * 5000) times. The end state should have the counter be back at index zero. Implement it in a buggy naive way that causes the counter to often be nonzero, even if atomics are used.
  4. Fix the above using a CAS loop.
  5. (Bonus question for the above: Why isn’t std::atomic::compare_exchange_strong a good fit here?)
Continue reading

Mutexes, atomics, lockfree programming

Some rough lab notes on these topics to record the current state of my knowledge. I’m not an expert, so there may be inaccuracies.

Mutexes

  • On Linux, libpthread mutexes are implemented using the underlying futex syscall
  • They are basically a combination of a spinlock (in userspace), backed by the kernel for wait/signal operations only when absolutely necessary (i.e. when there’s contention). In the common case of an uncontended lock acquire, there is no context switch which improves performance
  • The userspace spinlock portion uses atomics as spinlocks usually do, specifically because the compare and set must be atomic
  • Jeff Preshing (see below) writes that each OS/platform has an analogous concept to this kind of “lightweight” mutex — Windows and macOS have them too
  • Before futex(2), other syscalls were used for blocking. One option might have been the semaphore API, but commit 56c910668cff9131a365b98e9a91e636aace337a in glibc is before futex, and it seems like they actually use signals. (pthread_mutex_lock -> __pthread_lock (still has spinlock elements, despite being before futex) -> suspend() -> __pthread_suspend -> __pthread_wait_for_restart_signal -> sigsuspend)
  • A primary advantage of futex over previous implementations is that futexes only require kernel resources when there’s contention
  • Like atomics, mutexes implementations include memory barriers (maybe even implicitly due to atomics) to prevent loads/stores from inappropriately crossing the lock/unlock boundary due to compiler and/or hardware instruction reordering optimizations
Continue reading

How the hardware/software interface works

It’s really all about memory. But to start at the beginning, the rough stack looks like this:

  • Userspace application
  • Kernel driver
  • Hardware device

I find it easier to think about this from the middle out. On Linux, the kernel exposes hardware devices as files backed by the /dev virtual filesystem. Userspace can do normal syscalls like open, read, write, and mmap on them, as well as the less typical ioctl (for more arbitrary, device-specific functionality).1.

The files are created by kernel drivers which are modules of kernel code whose sole purpose is to interface with and abstract hardware so it can be used by other parts of the operating system, or userspace. They are implemented implemented using internal driver “frameworks” in the kernel, e.g. the I2C or SPI frameworks. When you interface with a file in /dev, you are directly triggering callback handlers in a driver which execute in the process context.

That’s how userspace interfaces with the kernel. How do drivers interface with hardware? These days, mostly via memory mapped I/O (MMIO)2. This is when device hardware “appears” at certain physical addresses, and can be interfaced with via load and store instructions using an “API” that the device defines. For example, you can read data from a sensor by simply reading a physical address, or write data out to a device by writing to an address. The technical term for the hardware component these reads/writes interface with is “registers” (i.e. memory mapped registers).

(Aside: Other than MMIO, the other main interface the kernel has with hardware is interrupts, for interrupt driven I/O processing (as opposed to polling, which is what MMIO enables). I’m not very knowledgeable about this, so I won’t get into it other than to say drivers can register handlers for specific IRQ (interrupt requests) numbers, which will be invoked by the kernel’s generic interrupt handling infrastructure.)

Using MMIOs looks a lot like embedded bare metal programming you might do on a microcontroller like a PIC. At the lowest level, a kernel driver is really just embedded bare metal programming.

Here’s an example of a device driver for UART (serial port) hardware for ARM platforms: linux/drivers/tty/serial/amba-pl011.c. If you’re debugging an ARM Linux system via a serial connection, this is might be the driver being used to e.g. show the boot messages.

The lines like:

cr = readb(uap->port.membase + UART010_CR);

are where the real magic happens.

This is simply doing a read from a memory address derived from some base address for the device, plus some offset of the specific register in question. In this case it’s reading some control information from a Control Register.

#define UART010_CR		0x14	/* Control register. */

linux/include/linux/amba/serial.h#L28

Device interfaces may range from having just a few to many registers.

To go one step deeper down the rabbit hole, how do devices “end up” at certain physical addresses? How is this physical memory map interface implemented?3

The device/physical address mapping is implemented in digital logic outside the CPU, either on the System on Chip (SOC) (for embedded systems), or on the motherboard (PCs)4. The CPU’s physical interface include the address, data, and control buses. Digital logic converts bits of the address bus into signals that mutually exclusively enable devices that are physically connected to the bus. The implementations of load/store instructions in the CPU set a read/write bit appropriately in the Control bus, which lets devices know whether a read or write is happening. The data bus is where data is either transferred out from or into the CPU.

In practice, documentation for real implementations of these systems can be hard to find, unless you’re a customer of the SoC manufacturer. But there are some out there for older chips, e.g.

Here’s a block diagram for the Tegra 2 SoC architecture, which shipped in products like the Motorola Atrix 4G, Motorola Droid X2, and Motorola Photon. Obviously it’s much more complex than my description above. Other than the two CPU cores in the top left, and the data bus towards the middle, I can’t make sense of it. (link)

While not strictly a “System on Chip”, a class PIC microcontroller has many shared characteristics of a SoC (CPU, memory, peripherals, all in one chip package), but is much more approachable.

We can see the single MIPS core connected to a variety of peripheral devices on the peripheral bus. There’s even layers of peripheral bussing, with a “Peripheral Bridge” connected to a second peripheral bus for things like I2C and SPI.

How to be happy

Note to self:

  1. Remember: You are already enough just as you are, right here, right now. You don’t need to achieve or do anything. 1
  2. Remember: The only competition in life — if you must think of it that way — is to know yourself as fully as possible, and act with maximum authenticity towards that truth.
  3. Remember: All things considered, you have it good — many around the world would kill to switch places and inherit every single one of your problems.

It will never be easier than right now

This is a mindset I use to help with procrastination. It first came to me in my senior year on university when I needed to do lab reports. At that point, I had been doing lab reports for 8 years — ever since the start of high school. And throughout that whole time, they were always excruciating.

But I realized that they were excruciating partly because I always waited until the days before the deadline to do them, which was about a week after the actual lab. By that point, the details of the lab were much fuzzier, making the lab report way harder.

It occurred to me that even though all I wanted to do after the lab is forget everything about it and push it off to the side, that exact moment — right after the lab — would be the easiest moment to ever do the report. As more time passes, it will strictly get harder as I begin to lose the context of the lab.

So I sucked it up and started to immediately go to the library right after the lab and simply do the report right then. It worked very well and I only wished I had started the habit years earlier.

I try to remember this lesson and apply it to my life now. If there are situations where I need to do something, and no additional information will arrive that will influence how the job gets done, I try to do it as quickly as possible to take advantage of the context fresh in my brain.

TIL: Debugging microcontrollers may require hardware breakpoints

I was debugging a microcontroller recently and was surprised to see that I was limited to 4 breakpoints. That surprised me because I’m usually only used to seeing that kind of limitation when using watchpoints, not breakpoints. (See my youtube video for more on this).

But it makes sense after all — code running on this microcontroller is different than in a process in an OS’s userspace because the code will probably be flashed into some kind of ROM (Read Only Memory) (or EEPROM). Since the code is unwritable at a hardware level, the typical method of inserting software breakpoint instructions isn’t possible. So the alternative is to use the microprocessor’s support for hardware debugging, which occupies hardware resources and is thus finite.

WIP: Business models can have significant technical impact

Business models are an interesting topic that lives directly at the boundary between business and technology.

The business model describes an abstract plan for how the business is to function sustainably, and is a full time job to develop in and of itself. Then, it must be implemented in the product using technology.

For example, a typical SaaS business model involves subscription pricing at various intervals (monthly & yearly), with a discount given for the yearly plan in exchange for more money paid up front and longer commitment. There may or may not be a limited time free trial period, or alternatively a limited time guaranteed refund. Furthermore, there may be different pricing tiers that unlock more advanced features.

Ultimately, this is all going to end up as code that models the various pricing tiers, plans, timing deadlines, and enforces security (i.e. making sure the advanced features are only available to pro users). Stripe is a common service used to model arbitrary business models and execute payments processing. They provide client libraries offering data models for concepts like users and plans.

While subscription models have grown in popularity as software shifts to the web, it’s not the only model. The older model of buying discrete software packaged versions still exists, mostly used by vendors of desktop software. In this model, customers pay a larger sum for unlimited use of a specific major version of software. Then they can optionally pay again to upgrade to a newer major version when it’s released. A variant of this exists where there’s no upgrade fee (i.e. “lifetime free updates”). Another model exists called “Rent to own” which is like a subscription, except payments stop at a certain point after which there is free unlimited use.

Whether the software runs on the client or server is another dimension to consider which may influence the business model — server side software is most commonly sold using subscriptions these days.

Client Server
SubscriptionAdobeSaaS Web Apps
One time payment per versionDash, Ableton Live
One time payment, lifetime free updatesFL Studio, Tailwind CSS (technically more assets than software)SaaS Web Apps Lifetime Plans (e.g Roam Research)
Rent to ownAudio plugins via Splice

The business model impacts technical strategy in a few ways.

  1. Branching and release management
  2. Compatibility of document artifacts

Subscriptions, “lifetime free updates”, and “rent to own” are simplest in terms of branching and release management. All users are expected to run the latest software (because it’s either free or automatic, in the case of server side), so development can largely happen on a single main branch which releases are cut from.

“One time payment per version” is more complex because potentially multiple major versions of the software need to be maintained in parallel. Ideally bugfixes in Version 2 would also be applied to V3 and V4 when appropriate, but no new feature development should make its way back to V2 and risk being included in binaries send to users that only paid for V2. For this situation, long term release branches make sense as they provide code isolation, though they require some mechanism to forward bugfixes, etc.

The second aspect is document artifact compatibility. Again in this scenario, subscriptions, “lifetime free updates”, and “rent to own” are simpler in that all users can be assumed to be running on the latest version — or at least can be told to upgrade at no cost.

“One time payment per version” is again more complex because multiple versions of the software exist in the wild, all producing artifacts of slightly different versions. If artifacts may be sent between users of different versions, it creates a mess of trying dealing with all the compatibility issues that may arise.

TIL: ARM64 doesn’t include conditional instructions

A major difference between x86 and ARM32 is that while x86 generally1 only offers conditional execution of branch instructions (e.g. BNE) ARM32 offers conditional execution for many more instructions (e.g. ALU instructions, like ADD.EQ- add if not equal). The idea was that these can be used to avoid emitting traditional branching instruction sequences which may suffer from pipeline stalls when a branch is mispredicted — instead, straight line code with equivalent semantics can be emitted.

This was actually removed in ARM64. The official quote:

The A64 instruction set does not include the concept of predicated or conditional execution. Benchmarking shows that modern branch predictors work well enough that predicated execution of instructions does not offer sufficient benefit to justify its significant use of opcode space, and its implementation cost in advanced implementations.

https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf

Turns out that the whole pipeline stall problem is generally not a huge issue anymore as branch prediction has gotten so good, while support for the feature still requires allocating valuable instructions bits to encode the conditions. Note that ARM64 still uses 32 bit instructions, so conserving bits is still useful.

What is very interesting is that Intel’s recent APX extensions to x86-64 (whose purpose is to primarily add more general purpose registers) moves closer to this conditional instructions direction.

The performance features introduced so far will have limited impact in workloads that suffer from a large number of conditional branch mispredictions. As out-of-order CPUs continue to become deeper and wider, the cost of mispredictions increasingly dominates performance of such workloads. Branch predictor improvements can mitigate this to a limited extent only as data-dependent branches are fundamentally hard to predict.

To address this growing performance issue, we significantly expand the conditional instruction set of x86, which was first introduced with the Intel® Pentium® Pro in the form of CMOV/SET instructions. These instructions are used quite extensively by today’s compilers, but they are too limited for broader use of if-conversion (a compiler optimization that replaces branches with conditional instructions).”

Including support for conditional loads and stores which is apparently tricky with modern out of order and superscalar architectures.


https://stackoverflow.com/questions/22168992/why-are-conditionally-executed-instructions-not-present-in-later-arm-instruction

https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html