An obscure quirk of the /proc/*/mem pseudofile is its “punch through” semantics. Writes performed through this file will succeed even if the destination virtual memory is marked unwritable. In fact, this behavior is intentional and actively used by projects such as the Julia JIT compiler and rr debugger.
This behavior raises some questions: Is privileged code subject to virtual memory permissions? In general, to what degree can the hardware inhibit kernel memory access?
By exploring these questions1, this article will shed light on the nuanced relationship between an operating system and the hardware it runs on. We’ll examine the constraints the CPU can impose on the kernel, and how the kernel can bypass these constraints.
This post details my adventures with the Linux virtual memory subsystem, and my discovery of a creative way to taunt the OOM (out of memory) killer by accumulating memory in the kernel, rather than in userspace.
Keep reading and you’ll learn:
Internal details of the Linux kernel’s demand paging implementation
How to exploit virtual memory to implement highly efficient sparse data structures
What page tables are and how to calculate the memory overhead incurred by them
A cute way to get killed by the OOM killer while appearing to consume very little memory (great for parties)
Note: Victor Michel wrote a great follow up to this post here.
In the future, we will see an industry of AI coaches. Some of these will compete and take business away from existing human coaches (e.g. a writing coach) but much of it will fill gaps that are simply unfilled right now. Think coaching for areas that could be helpful for people, but are too niche to go out and find a coach for. Human coaches will not totally go away because the flood of AI into society reinforce in people the desire to speak to “real humans”. We see this today with customer support and how people simple just want to get on a phone with a “real person” to can help them.
Just like how smartphones eventually became ubiquitous, personal AI assistants living on your phone or $BIG_TECH account will become the norm. Signing up for a Google account will initialize an AI assistant that will get to know you as you start to use GMail, GCal, etc. When you buy an Android phone, it will be there as soon as you log in.
Apps will come with AI assistants built into them similar to how we have chatbots in the lower right hand corner of websites.
Things will really start to get interesting once AIs can spend money for you. Think a monthly budget you give to your AI for buying groceries, household supplies, and more. Maybe it asks you before it triggers a buy, maybe it doesn’t.
The nice part about working for big companies is that you can be a part of massive product launches that makes real waves and get massive press coverage… and you didn’t even have to be there for the decades of work that led to it. You can join at the end for the “fun parts”.
This week we launched Push 3 at Ableton. I was only barely involved, but I was still able to participate in the excitement of releasing our new product to an audience that has been hungrily waiting for it for years.
This is not something you get when working for yourself, or for most startups. Or at least not without 100x the effort.
It’s also not a guarantee at large companies, but if you choose your company and team right, it’s a significant positive that makes many of the negatives of a large company worth it.
After calling fork(), a parent process gets its entire address space write protected to facilitate COW. This causes page faults.
This makes fork() unsafe to call from anywhere in a process with realtime deadlines — including non realtime threads! Usually non RT can do what they want, but that is an interesting exception.
On modern glibc, system() doesn’t use fork(), it uses posix_spawn(). But is posix_spawn() safe from a non RT thread?
posix_spawn() doesn’t COW — the parent/child literally share memory — so the page fault issue doesn’t apply. However the parent is suspended to prevent races between the child and parent. This seems RT unsafe…
However, only the caller thread of the parent is suspended, meaning the RT threads are not suspended and continue running with no page faults.
So it is safe to use system() or posix_spawn() from a non RT thread.
linux virtual memory fun fact of the day:
after a fork+exec to run a subshell, a parent process will page fault on every first write to every single writable page in its address space
There are a few main reasons why we get busier as we get older:
Adulting
As you age, you increasingly lose free time towards dealing with “adulting” type of tasks: taxes, paying bills, taking your car to the shop, researching insurance alternatives
Relationships
We as age, we accumulate relationships. And while they have numerous benefits and make life worth living, they don’t come for free. They require time and energy to maintain — and at the end of the day, become tasks on our todo lists. Even something as innocuous as an old friend reaching to send a text or schedule a call can, at times, feel like burdensome tasks to accomplish.
When you’re a child or teenager, the only people you know are your friends and your friends (your first generation of friends). Since you barely know anyone, you don’t really have to keep in touch with anyone. Thus, more free time.
Hobbies
We as age, we accumulate interests, hobbies, and pursuits. These also don’t come for free.
As an adult, you begin to explore the world — reading books, picking up rock climbing, learning to paint, planning & taking trips one or twice a year. Your old interests don’t exactly go away, and there are always worlds of new interests to discover. Part of you feels like you should maintain or get back to some of those old interests you cherished so much. Another part is excited to get into scuba diving.
When you’re a child or teenager, you you might have just one or two pursuits that occupy your time outside school. That lack of all the historical hobbies from your past = more free time.
Aging
Aging implies that your body will start to perform worse and more slowly, likely even breaking in ways. You’ll spend more and more time going to doctor’s appointments, surgeries, tending to medical conditions. It will take more effort to maintain your body through fitness. This all takes time.
Like many problems in life, the frustration at your seemingly decreasing time as one aged can be helped by setting expectations properly. Instead of feeling cheated as you feel your allowance of time seems to shrink year by year, expect it. Expect that by all logic, given the adulting to do, relationships to maintain, pursuits to keep up with, and the natural course of aging, you should have no free time at all — which gives you more reason to celebrate and appreciate the rare free moment when it comes along.
I’m trying something new. I added WordPress tags for higher level categories of content that cut across the typical tags:
Deep Dive: Longer, more detailed posts that require significant research. Very expensive to produce.
Essay: Nonfiction writing, usually nontechnical.
Lab Notes: Rough notes, typically technical, usually bullet points about some topic.
Micropost: Tiny, short thought.
I hope that by categorizing things this way and acknowledging that this blog is a collection of different art forms (that appear similar because they’re all writing), I’ll be more comfortable publishing publicly. For a while I was scared to publish because I don’t have as much time for deep dives before, which is what people seemed to really like, but acknowledging these different categories and specifically calling out a shorter, rougher, less polished piece as such takes the pressure off of publishing it.
There are many instances where your application has some expensive work to do, that would be not good to execute in the hot path. (e.g. responding to an HTTP request)
The typical solution is to enqueue a task in a queue and have a worker process it “offline”
In Python, Celery is a popular library for this. It uses backends for the actual queue. Popular backends are RabbitMQ and Redis.
RQ is another Python that supports Redis only.
RabbitMQ is designed to be a queue — it’s in the name.
Redis is a in memory key value store/database. (or “data structure” store). It includes a number of primitives that might be used to implement a queue. It’s basically a in memory hashmap. Keys are strings in a flat namespace. Value are a set of supported fundamental data types. Everything is serialized as it’s interacted with IPC.
List — (Implemented in RQ.) You can use opcodes like LPUSH and RPOP.
Pub/Sub — Unsure about this. (Possibly implemented in Celery?)
Stream — Advanced but apparently is not implemented in either Celery or RQ.
Celery has sleek Pythonic syntactic sugar for specifying a “Task” and then calling it from the client (web app). It completely abstracts the queue. It returns a future to the client (AsyncResult) — the interface is conceptually similar to std::async in C++.
RQ is less opinionated and any callable can be passed into this .enqueue() function, with arguments to call it with. This has the advantage that the expensive code does not need to have Celery as a dependency (to decorate it). However that is not a real downside, as you can always keep things separate by making Celery wrappers around otherwise dependency free Python code. But it is an addition level of layers.
Heroku offers support for this — you just need to add a new process to your Procfile for the celery/RQ worker process. Both celery/RQ generate a main.
Redis has other uses beyond being a queue: it can be a simple cache that you application accesses on the same server before accessing the real database server. It can also be used to implement a distributed lock (sounds fancy but is basically just a single entry in redis that clients can check to see if a “lock” is taken. Of course it’s more complicated than just that). Redis also supports transactions, in a similar way to transactional memory on CPUs. In general there are direct parallels from from local parallel programming to nearly everything in this distributed system world. That said there are unique elements — like the Redis distributed lock includes concepts like a timeout and a nonce in case the client that acquired the lock crashes or disappears forever. That is generally not something you’d see in a mutex implementation. Another difference is that even though accessing Redis is shared mutable state, clients probably don’t need some other out of band mutex because Redis implements atomicity likely. That’s different than local systems because even if the shared, mutable data type you’re writing to/reading from locally is atomic (like a int/word), you should still use a mutex/atomic locally due to instruction reordering (mutexes and atomics insert barriers).
It’s been three years since I launched https://timestamps.me, and a little less than three years since I stopped working on it.
Since March 25th, 2020, here are the stats:
4465 uploads
$437 revenue earned
This works out to about 4 uploads per day, which for me, is a great success.
Rough recap:
Feb 2020: I was taking my gap year and wanted to code again. I had the idea to work on a little tool for exporting locators from Ableton Live sets. I thought it would be fun to just quickly make it into a web app, and ship it. I ended up hyperfocusing on the problem space and making it super high accuracy (handling tempo automation). I also made it work for FL Studio which was difficult but a fun challenge.
March: I launch the web app (timestamps.fm at the time) and start trying to get users. I was extremely difficult. I posted on subreddits for DJing, Ableton, FL Studio, and Rekordbox.
Learnings and mistakes:
I wasted a ton of time porting to Google Cloud in an attempt to make the site run for free. It was an utter failure and I ended up porting back to Heroku.
I spent a ton of time DM’ing music producers that were performing at “e-fests” which were popular at the time (due to Covid-19, which was in full swing at the time). This was doomed to failure — none of them would find value in this niche product.
Ironically, the customer and user that got the most use out of it reached out to me, not the other way around. The CEO of a company that makes a high volume of DJ mixes for hotels and restaurants DM’d me on LinkedIn asking to set up a call. He found me via SEO/Google search and was able to find me on LinkedIn because I had been bold enough to put myself as “CEO, Timestamps.fm” on my LinkedIn profile. Lesson: Be bold!
If I really wanted to do this in a time and capital efficient way, I should have put in way more customer research before building this whole product (including super advanced features like hyper accurate tempo automation support). I didn’t care about this though because I was on my gap year, and was first and foremost doing it because it was a fun programming challenge.
I wasted a ton of time hand coding HTML and modifying a free HTML theme I found online. I eventually rewrote the whole site in Bootstrap which took a bunch of time. The breaking people when I was trying to make a pricing page with different subscription options. It was going terribly with my hand-hacking of the HTML page, and Bootstrap included great looking UI components for this already. Learning Bootstrap was probably a good investment.
If I were to do it again: I would use no-code WAY more. I would try to avoid hand-coding any HTML if at all possible, and just do the minimal amount of code to have an API server running for doing the actual processing.
I put up a donation button. In 3 years, I’ve had 3 donors, for a total of $25 in donations.
I eventually learned that the majority of DJs don’t care about time accurate records for their DJ mixes. A small subset of them do — those that operate in the radio DJ world where they have licensing or reporting requirements. One DJ said they were required to submit timestamps so a radio show could show the track title on their web ui or something like that. But I was later surprised to see the site continuing to get traffic. Clearly there are some people out there that care. I haven’t bothered to figure out who they are or why. The site continues to be free, with no accounts necessary.
Smart things that I did right:
I had a lot of requests to support of DJ software like Traktor. I ignored them which was a great move — it would have taken a lot more time and wouldn’t have moved the needle on the proejct.
I negotiated a good rate initially for the commercial customer’s subscription — $40 a month! I then did a questionable move and lowered itsignificantly to $15 or a so per month when I made it free. The deal was that I would make the site free, but the customer would pay to help me break even on it. I later lowered it even more for them when I switched to Timestamps.me ($~20) which is a much cheaper domain then timestamps.fm ($80). It was a good move to move domains — that domain was a risky liability — if the customer ever left, I would have been stuck with an expensive, vanity domain for no real reason. I wasn’t going to become the next “last.fm” anytime soon.
SEO is the main driver, and continues to be until this day. I dominate the results for “ableton timestamps” etc. Posting on reddit and the Ableton forum were good calls.
I experimented with different monetization strategies. Pay per use was an interesting experiment and I made a small amount of money.
If I were to actually try to start a business again, I would do a lot of things different:
Be much more deliberate about picking the market and kind of customer to server
If you want to make money, make something that helps people that already make (and spend) money make even more money (the fact that they already make and spend it powerful and important)
Pick a product idea that isn’t totally novel so that it’s not so hard to sell it or introduce. It’s great to be able to say “I’m like X, but different because of A and B and specifically designed for C”
The Linux kernel has an interesting difference compared to the Windows and macOS kernels: it offers syscall ABI compatibility.
This means that applications that program directly against the raw syscall interface are more or less guaranteed to always keep working, even with arbitrarily newer kernel versions. “Programming against the raw syscall interface” means including assembly code in your app that triggers syscalls:
setting the appropriate syscall number in the syscall register
setting arguments in the defined argument registers
executing a syscall instruction
reading the syscall return value register
Here are the ABIs for some common architectures.
Syscall Number Register
Syscall Arguments
Syscall Return Value
x86
EAX
EBX, ECX, EDX, ESI, EDI, EBP
EAX
x86_64
RAX
RDI, RSI, RDX, R10, R8, R9
RAX
Armv7
R7
R0-R6
R0
AArch64
X8
X0-X5
X0
Manticore is my go-to source to quickly look these up: https://github.com/trailofbits/manticore
Once you’ve done this, now you’re relying on the kernel to not change any part of this. If the kernel changes any of these registers, or changes the syscall number mapping, your app will not longer trigger the desired syscall correctly and will break.
Aside from writing raw assembly in your app, there’s a more innocuous way of accidentally “programming directly against the syscall interface”: statically linking to libc. When you statically link to a library, that library’s code is directly included in your binary. libc is generally the system component responsible for implementing the assembly to trigger syscalls, and by statically linking to it, you effectively inline those assembly instructions directly into your application.
So why does Linux offer this and Windows and macOS don’t?
In general, compatibility is cumbersome. As a developer, if you can avoid having to maintain compatibility, it’s better. You have more freedom to change, improve, and refactor in the future. So by default it’s preferable to not maintain compatibility — including for kernel development.
Windows and macOS are able to not offer compatibility because they control the libc for their platforms and the rules for using it. And one of their rules is “you are not allowed to statically link libc”. For the exact reason that this would encourage apps that depend directly on the syscall ABI, hindering the kernel developers’ ability to freely change the kernel’s implementation.
If all app developers are forced to dynamically link against libc, then as long as kernel developers also update libc with the corresponding changes to the syscall ABI, everything works. Old apps run on a new kernel will dynamically link against the new libc, which properly implements the new ABI. Compatibility is of course still maintained at the app/libc level — just not at the libc/kernel level.
Linux doesn’t control the libc in the same way Windows and macOS do because in the Linux world, there is a distinct separation between kernel and userspace that isn’t present in commercial operating systems. Strictly speaking Linux is just the kernel, and you’re free to run whatever userspace on top. Most people run GNU userspace components (glibc), but alternatives are not unheard of (musl libc, also bionic libc on Android).
So because Linux kernel developers can’t 100% control the libc that resides on the other end of the syscall interface, they bite the bullet and retain ABI compatibility. This technically allows you to statically link with more confidence than on other OSs. That said, there are other reasons why you shouldn’t statically link libc, even on Linux.
This directory documents the interfaces that the developer has
defined to be stable. Userspace programs are free to use these
interfaces with no restrictions, and backward compatibility for
them will be guaranteed for at least 2 years. Most interfaces
(like syscalls) are expected to never change and always be
available.
What: The kernel syscall interface
Description:
This interface matches much of the POSIX interface and is based
on it and other Unix based interfaces. It will only be added to
over time, and not have things removed from it.
Note that this interface is different for every architecture
that Linux supports. Please see the architecture-specific
documentation for details on the syscall numbers that are to be
mapped to each syscall.
Q: I'm trying to link my binary statically, but it's failing to link because it can't find crt0.o. Why?
A: Before discussing this issue, it's important to be clear about terminology:
A static library is a library of code that can be linked into a binary that will, eventually, be dynamically linked to the system libraries and frameworks.
A statically linked binary is one that does not import system libraries and frameworks dynamically, but instead makes direct system calls into the kernel.
Apple fully supports static libraries; if you want to create one, just start with the appropriate Xcode project or target template.
Apple does not support statically linked binaries on Mac OS X. A statically linked binary assumes binary compatibility at the kernel system call interface, and we do not make any guarantees on that front. Rather, we strive to ensure binary compatibility in each dynamically linked system library and framework.
If your project absolutely must create a statically linked binary, you can get the Csu (C startup) module from Darwin and try building crt0.o for yourself. Obviously, we won't support such an endeavor.
It can be handy to have two checkouts of a repo. One is the primary one (A) for working in. And the other is the “spare” (B).
This can be useful in a number of situations:
You’re in the middle of a rebase in A and want to quickly reference something in another branch. Instead of having to mess up your rebase state, or go to github, you just go to B.
You’re code reviewing an intense refactor of an API. It can be handy to quickly flip back and forth between the versions of the codebase before and after the API change to get a better sense of what changes. Sometimes the diff isn’t quite enough.
You’re code reviewing one branch and want to quickly code review another in a way that’s “immutable” to your work environment.
If you want to quickly flip back and forth between builds of two different branches.
I posit that in many software projects there is a small core of the “most interesting” work, surrounded by a larger core of “support engineering”. The “support engineering” is in service of the “most interesting” core in order to make it usable and a good product.
For example:
Compilers: Core optimizations vs cli arg parsing
Kernels: Core context switching vs some module that prints out the config the kernel was built with
Audio software: Core engine, data model, or file format work vs UI work on the settings menu
ChatGPT: Core machine learning vs front end web dev to implement the chat web UI
But the funny thing is that “interesting” is in the eye of the beholder. For every person that thinks X is the “most interesting”, perhaps most technical part of the project, there will be a different person that is totally uninterested in X and is delighted to let someone else handle this for them. This very well may be because X is too technical, too in the weeds.
The generalizes to work and society as a whole — people gravitate towards work that suits their interests. The areas they don’t find interesting are hopefully filled by others who are naturally wired differently. Of course this plays out less cleanly in real life.