Category Archives: Assembly

TIL: Debugging microcontrollers may require hardware breakpoints

I was debugging a microcontroller recently and was surprised to see that I was limited to 4 breakpoints. That surprised me because I’m usually only used to seeing that kind of limitation when using watchpoints, not breakpoints. (See my youtube video for more on this).

But it makes sense after all — code running on this microcontroller is different than in a process in an OS’s userspace because the code will probably be flashed into some kind of ROM (Read Only Memory) (or EEPROM). Since the code is unwritable at a hardware level, the typical method of inserting software breakpoint instructions isn’t possible. So the alternative is to use the microprocessor’s support for hardware debugging, which occupies hardware resources and is thus finite.

TIL: ARM64 doesn’t include conditional instructions

A major difference between x86 and ARM32 is that while x86 generally1 only offers conditional execution of branch instructions (e.g. BNE) ARM32 offers conditional execution for many more instructions (e.g. ALU instructions, like ADD.EQ- add if not equal). The idea was that these can be used to avoid emitting traditional branching instruction sequences which may suffer from pipeline stalls when a branch is mispredicted — instead, straight line code with equivalent semantics can be emitted.

This was actually removed in ARM64. The official quote:

The A64 instruction set does not include the concept of predicated or conditional execution. Benchmarking shows that modern branch predictors work well enough that predicated execution of instructions does not offer sufficient benefit to justify its significant use of opcode space, and its implementation cost in advanced implementations.

Turns out that the whole pipeline stall problem is generally not a huge issue anymore as branch prediction has gotten so good, while support for the feature still requires allocating valuable instructions bits to encode the conditions. Note that ARM64 still uses 32 bit instructions, so conserving bits is still useful.

What is very interesting is that Intel’s recent APX extensions to x86-64 (whose purpose is to primarily add more general purpose registers) moves closer to this conditional instructions direction.

The performance features introduced so far will have limited impact in workloads that suffer from a large number of conditional branch mispredictions. As out-of-order CPUs continue to become deeper and wider, the cost of mispredictions increasingly dominates performance of such workloads. Branch predictor improvements can mitigate this to a limited extent only as data-dependent branches are fundamentally hard to predict.

To address this growing performance issue, we significantly expand the conditional instruction set of x86, which was first introduced with the Intel® Pentium® Pro in the form of CMOV/SET instructions. These instructions are used quite extensively by today’s compilers, but they are too limited for broader use of if-conversion (a compiler optimization that replaces branches with conditional instructions).”

…including support for conditional loads and stores which is apparently tricky with modern out of order and superscalar architectures.

Syscall ABI compatibility: Linux vs Windows/macOS

The Linux kernel has an interesting difference compared to the Windows and macOS kernels: it offers syscall ABI compatibility.

This means that applications that program directly against the raw syscall interface are more or less guaranteed to always keep working, even with arbitrarily newer kernel versions. “Programming against the raw syscall interface” means including assembly code in your app that triggers syscalls:

  • setting the appropriate syscall number in the syscall register
  • setting arguments in the defined argument registers
  • executing a syscall instruction
  • reading the syscall return value register

Here are the ABIs for some common architectures.

Syscall Number RegisterSyscall ArgumentsSyscall Return Value
x86_64RAXRDI, RSI, RDX, R10, R8, R9RAX
Manticore is my go-to source to quickly look these up:

Once you’ve done this, now you’re relying on the kernel to not change any part of this. If the kernel changes any of these registers, or changes the syscall number mapping, your app will not longer trigger the desired syscall correctly and will break.

Aside from writing raw assembly in your app, there’s a more innocuous way of accidentally “programming directly against the syscall interface”: statically linking to libc. When you statically link to a library, that library’s code is directly included in your binary. libc is generally the system component responsible for implementing the assembly to trigger syscalls, and by statically linking to it, you effectively inline those assembly instructions directly into your application.

So why does Linux offer this and Windows and macOS don’t?

In general, compatibility is cumbersome. As a developer, if you can avoid having to maintain compatibility, it’s better. You have more freedom to change, improve, and refactor in the future. So by default it’s preferable to not maintain compatibility — including for kernel development.

Windows and macOS are able to not offer compatibility because they control the libc for their platforms and the rules for using it. And one of their rules is “you are not allowed to statically link libc”. For the exact reason that this would encourage apps that depend directly on the syscall ABI, hindering the kernel developers’ ability to freely change the kernel’s implementation.

If all app developers are forced to dynamically link against libc, then as long as kernel developers also update libc with the corresponding changes to the syscall ABI, everything works. Old apps run on a new kernel will dynamically link against the new libc, which properly implements the new ABI. Compatibility is of course still maintained at the app/libc level — just not at the libc/kernel level.

Linux doesn’t control the libc in the same way Windows and macOS do because in the Linux world, there is a distinct separation between kernel and userspace that isn’t present in commercial operating systems. This is rooted in the history of Linux, which was originally designed to target a userspace developed by a separate organization (GNU).

So strictly speaking Linux is just the kernel, and you’re free to run whatever userspace on top. Most people run GNU userspace components (glibc), but alternatives are not unheard of (musl libc, also bionic libc on Android).

So because Linux kernel developers can’t 100% control the libc that resides on the other end of the syscall interface, they bite the bullet and retain ABI compatibility. This technically allows you to statically link with more confidence than on other OSs. That said, there are other reasons why you shouldn’t statically link libc, even on Linux.


This directory documents the interfaces that the developer has
defined to be stable.  Userspace programs are free to use these
interfaces with no restrictions, and backward compatibility for
them will be guaranteed for at least 2 years.  Most interfaces
(like syscalls) are expected to never change and always be

kernel docs

What:		The kernel syscall interface
	This interface matches much of the POSIX interface and is based
	on it and other Unix based interfaces.  It will only be added to
	over time, and not have things removed from it.

	Note that this interface is different for every architecture
	that Linux supports.  Please see the architecture-specific
	documentation for details on the syscall numbers that are to be
	mapped to each syscall.

apple developer docs

Q:  I'm trying to link my binary statically, but it's failing to link because it can't find crt0.o. Why?
A: Before discussing this issue, it's important to be clear about terminology:

A static library is a library of code that can be linked into a binary that will, eventually, be dynamically linked to the system libraries and frameworks.
A statically linked binary is one that does not import system libraries and frameworks dynamically, but instead makes direct system calls into the kernel.
Apple fully supports static libraries; if you want to create one, just start with the appropriate Xcode project or target template.

Apple does not support statically linked binaries on Mac OS X. A statically linked binary assumes binary compatibility at the kernel system call interface, and we do not make any guarantees on that front. Rather, we strive to ensure binary compatibility in each dynamically linked system library and framework.

If your project absolutely must create a statically linked binary, you can get the Csu (C startup) module from Darwin and try building crt0.o for yourself. Obviously, we won't support such an endeavor.


  • Solaris also stopped supporting static linking against libc.

Reproducing a GCC 8.1 ABI compatibility bug

I was reading about GCC and noticed this very suspicious warning line about an accidental compatibility break:

I thought it would be interesting to reproduce this. I reproduced this specific scenario they outline and compiled two translations units, one with GCC 8.1, one with an earlier version (GCC 7) and observed the segfault that happens when two incompatible calling conventions interact with each other.

How setjmp and longjmp work (2016)

Pretty recently I learned about setjmp() and longjmp(). They’re a neat pair of libc functions which allow you to save your program’s current execution context and resume it at an arbitrary point in the future (with some caveats1). If you’re wondering why this is particularly useful, to quote the manpage, one of their main use cases is “…for dealing with errors and interrupts encountered in a low-level subroutine of a program.” These functions can be used for more sophisticated error handling than simple error code return values.

I was curious how these functions worked, so I decided to take a look at musl libc’s implementation for x86. First, I’ll explain their interfaces and show an example usage program. Next, since this post isn’t aimed at the assembly wizard, I’ll cover some basics of x86 and Linux calling convention to provide some required background knowledge. Lastly, I’ll walk through the source, line by line.

Continue reading

Beginner Crackme

As part of an Intro to Security course I’m taking, my professor gave us a crackme style exercise to practice reading x86 assembly and basic reverse engineering.

The program is pretty simple. It accepts a password as an argument and we’re told that if the password is correct, "ok" is printed.

$ ./crackme
usage: ./crackme <secret>
$ ./crackme test

As usual, I start by running file on the binary, which shows that it’s a standard x64 ELF binary. file also says that the binary is "not stripped", which means that it includes symbols. All I really know about symbols are that they can include debugging information about a binary like function and variable names and some symbols aren’t really necessary; they can be stripped out to reduce the binary’s size and make reverse engineering more challenging. Maybe I’ll do a more in depth post on this in the future.

$ file crackme
crackme: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=0x3fcf895b7865cb6be6b934640d1519a1e6bd6d39, not stripped

Next, I run strings, hoping to get lucky and find the password amongst the strings in the binary. Strings looks for series of printable characters followed by a NULL, but unfortunately nothing here works as the password.

$ strings crackme
usage: %s <secret>

Since that didn’t work, we’re forced to disassemble the binary and actually try to reverse engineer it. We’ll start with main.

$ gdb -batch -ex 'file crackme' -ex 'disas main'
Dump of assembler code for function main:
   0x00000000004004a0 <+0>:     sub    rsp,0x8
   0x00000000004004a4 <+4>:     cmp    edi,0x1
   0x00000000004004a7 <+7>:     jle    0x4004c7 <main+39>
   0x00000000004004a9 <+9>:     mov    rdi,QWORD PTR [rsi+0x8]
   0x00000000004004ad <+13>:    call   0x4005e0 <verify_secret>
   0x00000000004004b2 <+18>:    test   eax,eax
   0x00000000004004b4 <+20>:    je     0x4004c2 <main+34>
   0x00000000004004b6 <+22>:    mov    edi,0x4006e8
   0x00000000004004bb <+27>:    call   0x400450 <puts@plt>
   0x00000000004004c0 <+32>:    xor    eax,eax
   0x00000000004004c2 <+34>:    add    rsp,0x8
   0x00000000004004c6 <+38>:    ret
   0x00000000004004c7 <+39>:    mov    rsi,QWORD PTR [rsi]
   0x00000000004004ca <+42>:    mov    edi,0x4006d4
   0x00000000004004cf <+47>:    xor    eax,eax
   0x00000000004004d1 <+49>:    call   0x400460 <printf@plt>
   0x00000000004004d6 <+54>:    mov    eax,0x1
   0x00000000004004db <+59>:    jmp    0x4004c2 <main+34>
End of assembler dump.

Let’s break this down a little.

    0x00000000004004a0 <+0>:     sub    rsp,0x8
    0x00000000004004a4 <+4>:     cmp    edi,0x1
    0x00000000004004a7 <+7>:     jle    0x4004c7 <main+39>

Starting at the beginning, we see the stack pointer decremented as part of the function prologue. The prologue is a set of setup steps involving saving the old frame’s base pointer on the stack, reassigning the base pointer to the current stack pointer, then subtracting the stack pointer a certain amount to make room on the stack for local variables, etc. We don’t see the former two steps because this is the main function so it doesn’t really have a function calling it, so saving/setting the base pointer isn’t necessary.

Then the edi register is compared to 1 and if it is less than or equal, we jump to offset 39.

   0x00000000004004c2 <+34>:    add    rsp,0x8
   0x00000000004004c6 <+38>:    ret
   0x00000000004004c7 <+39>:    mov    rsi,QWORD PTR [rsi]
   0x00000000004004ca <+42>:    mov    edi,0x4006d4
   0x00000000004004cf <+47>:    xor    eax,eax
   0x00000000004004d1 <+49>:    call   0x400460 <printf@plt>
   0x00000000004004d6 <+54>:    mov    eax,0x1
   0x00000000004004db <+59>:    jmp    0x4004c2 <main+34>

Here at offset 39, we print something then jump to offset 34 where we repair the stack (undo the sub instruction from the prologue) and return (ending execution).

This is likely how the program checks the arguments and prints the usage message if no arguments are supplied (which would cause argc/edi to be 1).

However if we supply an argument, edi is 0x2 and we move past the jle instruction.

   0x00000000004004a9 <+9>:     mov    rdi,QWORD PTR [rsi+0x8]
   0x00000000004004ad <+13>:    call   0x4005e0 <verify_secret>

Here we can see the verify_secret function being called with a parameter in rdi. This is most likely the argument we passed into the program. We can confirm this with gdb (I’m using it with peda here).

gdb-peda$ tele $rsi
0000| 0x7fffffffeb48 --> 0x7fffffffed6e ("/home/vagrant/crackme/crackme")
0008| 0x7fffffffeb50 --> 0x7fffffffed8c --> 0x4548530074736574 ('test')
0016| 0x7fffffffeb58 --> 0x0

Indeed rsi points to the first element of argv, so incrementing that by 8 bytes (because 64 bit) points to argv[1], which is our input.

If we look after the verify_secret call we can see the program checks if eax is 0 and if it is, jumps to offset 34, ending the program. However, if eax is not zero, we’ll hit a puts call before exiting, which will presumably print out the "ok" message we want.

   0x00000000004004b2 <+18>:    test   eax,eax
   0x00000000004004b4 <+20>:    je     0x4004c2 <main+34>
   0x00000000004004b6 <+22>:    mov    edi,0x4006e8
   0x00000000004004bb <+27>:    call   0x400450 <puts@plt>
   0x00000000004004c0 <+32>:    xor    eax,eax
   0x00000000004004c2 <+34>:    add    rsp,0x8
   0x00000000004004c6 <+38>:    ret

Now lets disassemble verify_secret to see how the input validation is performed, and to see how we can make it return non-zero.

Dump of assembler code for function verify_secret:
   0x00000000004005e0 <+0>:     sub    rsp,0x408
   0x00000000004005e7 <+7>:     movzx  eax,BYTE PTR [rdi]
   0x00000000004005ea <+10>:    mov    rcx,rsp
   0x00000000004005ed <+13>:    test   al,al
   0x00000000004005ef <+15>:    je     0x400622 <verify_secret+66>
   0x00000000004005f1 <+17>:    mov    rdx,rsp
   0x00000000004005f4 <+20>:    jmp    0x400604 <verify_secret+36>
   0x00000000004005f6 <+22>:    nop    WORD PTR cs:[rax+rax*1+0x0]
   0x0000000000400600 <+32>:    test   al,al
   0x0000000000400602 <+34>:    je     0x400622 <verify_secret+66>
   0x0000000000400604 <+36>:    xor    eax,0xfffffff7
   0x0000000000400607 <+39>:    lea    rsi,[rsp+0x400]
   0x000000000040060f <+47>:    add    rdx,0x1
   0x0000000000400613 <+51>:    mov    BYTE PTR [rdx-0x1],al
   0x0000000000400616 <+54>:    add    rdi,0x1
   0x000000000040061a <+58>:    movzx  eax,BYTE PTR [rdi]
   0x000000000040061d <+61>:    cmp    rdx,rsi
   0x0000000000400620 <+64>:    jb     0x400600 <verify_secret+32>
   0x0000000000400622 <+66>:    mov    edx,0x18
   0x0000000000400627 <+71>:    mov    esi,0x600a80
   0x000000000040062c <+76>:    mov    rdi,rcx
   0x000000000040062f <+79>:    call   0x400480 <memcmp@plt>
   0x0000000000400634 <+84>:    test   eax,eax
   0x0000000000400636 <+86>:    sete   al
   0x0000000000400639 <+89>:    add    rsp,0x408
   0x0000000000400640 <+96>:    movzx  eax,al
   0x0000000000400643 <+99>:    ret
End of assembler dump.

I won’t walk through this one in detail because understanding each line isn’t necessary to crack this. Let’s skip to the memcmp call. If memcmp returns 0, eax is set to 1 and the function returns. This is exactly what we want. From the man page, memcmp takes three parameters, two buffers to compare and their lengths, and returns 0 if the buffers are identical.

   0x0000000000400622 <+66>:    mov    edx,0x18
   0x0000000000400627 <+71>:    mov    esi,0x600a80
   0x000000000040062c <+76>:    mov    rdi,rcx
   0x000000000040062f <+79>:    call   0x400480 <memcmp@plt>

Here’s the setup to the memcmp call. We can see the third parameter for length is the immediate 0x18 meaning the buffers will be 24 bytes in length. If we examine address 0x600a80, we find this 24 byte string:

gdb-peda$ hexd 0x600a80 /2
0x00600a80 : 91 bf a4 85 85 c3 ba b9 9f a6 b6 b1 93 b9 83 8f   ................
0x00600a90 : ae b1 ae c1 bc 80 ca ca 00 00 00 00 00 00 00 00   ................

Since this is a direct address to some memory, we can be fairly certain that we’ve found some sort of secret value! Based on the movzx eax,BYTE PTR [rdi] instruction (offset 7) which moves a byte from the input string into eax, the xor eax, 0xfffffff7 instruction (offset 36), and the add rdi, 0x1 instruction (offset 54) which increments the char* pointer to our input string, we can reasonably guess that this function is xor’ing each character of our input with 0xf7 and writing the result into a buffer which begins at rsp (also pointed to by rcx). Since we now know the secret (\x91\xbf\xa4\x85...) and the xor key (0xf7) it’s pretty easy to extract the password we need by xor’ing each byte of the secret with the xor key.

Here’s a way to do this with python.

{% highlight python %} str = ‘\x91\xbf\xa4\x85\x85\xc3\xba\xb9\x9f\xa6\xb6\xb1\x93\xb9\x83\x8f\xae\xb1\xae\xc1\xbc\x80\xca\xca’ ba = bytearray(str) for i, byte in enumerate(ba): ba[i] ^= 0xf7 print ba {% endhighlight %}

Which results in this:

$ python
$ ./crackme fHSrr4MNhQAFdNtxYFY6Kw==