Category Archives: C++

WIP: Integers, safe arithmetic, and handling overflow (lab notes)

Some rough lab notes on how to do integer math with computers. Surprisingly deep topic! Usual caveat applies, this is just what I’ve learned so far and there may be inaccuracies. This is mainly from a C/C++ perspective, though there is a small discussion of Rust included.

It all starts with a simple prompt:

return_type computeBufferSize(integer_type headerLen, integer_type numElements, integer_type elementSize)
{
  // We conceptually want to do: headerLen + (numElements * elementSize)
}

How do we safely do something as simple as this arithmetic expression in C/C++? Furthermore, what type do we choose for integer_type and return_type?

Overflow

  • You can’t just do headerLen + (numElements * elementSize) literally, because that can overflow for valid input arguments
  • Why is overflow bad?
  • Reason 1: Logic errors. The result of the expression will wrap around, producing incorrect values. At best they will make the program behave incorrectly, at medium they will crash the program, and at worst, they will be security vulnerabilities.
  • Reason 2: Undefined behavior. This only applies to overflow on signed types. Signed overflow is UB, which means the fundamental soundness of the program is compromised, and it can behave unpredictably at runtime. Apparently this allows the compiler to perform useful optimizations, but can also cause critical miscompiles. (e.g. if guards later being compiled out of the program, causing security issues)
  • (Note: Apparently there are compiler flags to force signed overflow to have defined behavior. -fno-strict-overflow and -fwrapv)

How to handle overflow

  • Ok, overflow is bad. What does one do about it?
  • Even for the simplest operations of multiplication and addition, you need to be careful of how large the input operands can be, and whether overflow is possible.
  • For application code, it might often be the case that numbers are so small that this is not a concern, but if you’re dealing with untrusted input, buffer sizes, or memory offsets, then you need to be more careful, as these numbers can be larger.
  • In situations where you need to defend against overflow, you cannot just inline the whole arithmetic expression the natural way.
  • You need to evaluate the expression operation by operation, at each step checking if overflow would occur, and failing the function if so.
  • For example, here you’d first evaluate the multiplication in a checked fashion. And then if that succeeds, do the arithmetic.
  • For addition a + b, the typical check for size_t looks like: bool willOverflow = (a > SIZE_MAX - b);
  • For multiplication a * b, it looks like bool willOverflow = (b != 0 && a > SIZE_MAX / b);
  • For signed ints, the checks are more involved. This is a reason to prefer unsigned types in situations where overflow checking is needed.

Helper libraries

Manually adding inline overflow checking code everywhere is tedious, error prone, and harms readability. It would be nice if one could just write the arithmetic and spend less time worrying about the subtleties of overflow.

GCC and Clang offer builtins for doing checked arithmetic. They also have the benefit of checking signed integer overflow without UB.

In practice, it seems like some industrial C/C++ projects use helper utilities to make overflow-safe programming more ergonomic.

When to think about overflow

So when coding, when do you actually need to care about this?

You do not literally have to use checked-arithmetic for every single addition or multiplication in the program. Checked arithmetic clutters the code and can impact runtime performance, so you should only use it when it’s relevant.

It’s relevant when coding on the “boundary”. Arithmetic on untrusted or weakly trusted input data, e.g. from the network, filesystem, hardware, etc. Or where you’re implementing an API to be called by external users.

But for many normal, everyday math operations in your code, you probably don’t need it. (Unless you are particularly dealing with numbers so large that overflow is a possibility. So a certain amount of vigilance is needed.)

Rust

  • Rust has significant differences from C/C++ for its overflow behavior, which make it much less error prone.
  • Difference 1 – All overflow is defined, unlike signed overflow being UB in C/C++. This prevents critical miscompiles if signed overflow happens.
  • What about logic bugs where numbers wrap around and produce dangerously large values that are mishandled down the line?
  • Rust sometimes protects against these.
  • Rust protects against these in debug builds by injecting overflow checks, which panic if this happens. This impacts performance, but debug builds are slow anyway, and this helps find bugs faster in development.
  • In release builds, wrap-around logic bugs are still possible. Rust prevents undefined behavior and memory unsafety, but incorrect arithmetic can still cause panics, allocation failures, or incorrect program behavior.
  • Difference 2 – Rust has first-class support for checked arithmetic operations in the form of “methods” one can call on an integer. These are a clean and ergonomic way to safely handle overflow if that is relevant for code.

Choosing an integer type

There are a lot of int types. Which do you choose?

  • int
  • unsigned int
  • size_t
  • ssize_t
  • uint32_t, uint64_t
  • off_t
  • ptrdiff_t
  • uintptr_t

Here’s my current understanding.

First, split off the specific-purpose types:

  • off_t: (Signed) For file offsets
  • ptrdiff_t: (Signed) for pointer differences
  • uintptr_t: (Unsigned) For representing a pointer as an integer, in order to do math on it

Next, split off the explicit sized types: uint32_t, uint64_t, etc

These are mainly used when interfacing with external data: the network, filesystem, hardware, etc. (From C/C++! In Rust, it’s idiomatic to use explicit sized types in more situations).

That mostly leaves int vs size_t.

This is hotly debated. Here’s my rule of thumb:

  • For application code, dealing with “small” integers, counts: Use int. This lets you assert for non-negative. The downside is signed int overflow is UB, and thus dangerous, but the idea is that you should not be anywhere near overflowing since the integers are small. This is approximately the position of the Google Style C++ guide.
  • For “systems” code dealing with buffer sizes, offsets, raw memory: Use size_t. int is often too small for cases like this. size_t is “safer” in that overflow is defined. This is approximately the position of the Chromium C++ style guide.

In the above code that’s computing the buffer size, there’s a few reasons to avoid int.

First of all, int might be too small, depending on the usage case (e.g. generic libraries).

Secondly, it’s risky to use int because it’s easier to invoke dangerous UB. Since int overflow is UB, you risk triggering miscompiles (e.g. if guards being compiled out) if you accidentally overflow. This has caused many subtle security bugs.

In general, reasoning about overflow is more complicated with int than with size_t, since size_t has fully defined behavior.

However there are downsides of size_t. Notably, you can’t assert it as non-negative.

What about unsigned int?

Rule of thumb: Don’t use this. It’s a worse size_t because it’s smaller/undefined width, and a worse int because it can’t be asserted for non-negative.

What about ssize_t?

This is a size_t that can also contain negative values for error codes.

It seems somewhat useful, but is also error-prone to use, as it must always be checked for negativity before use as a size_t, otherwise you’ll have the classic issue where it produces a dangerously large unsigned variable.

In general, it seems like one should prefer explicit distinction between error state and integer state (e.g. bool return, and size_t out parameter, or std::optional<size_t>) over this.

Which types to choose for the example

The input arguments should be size_t, since we’re dealing with buffers, offsets, and raw memory.

The return value is a bit of a trick question. Any function like this must be able to return either a value, or an error if overflow happened.

The core return value should be size_t. But then how to express error in case of overflow?

std::optional<size_t> is a good choice if using C++.

ssize_t is an option, if using C.

But changing the signature to return a bool, and then having an out size_t parameter might be even better, and is the pattern used by the compiler builtins.

Related:

Pure GNU Make is (unfortunately) not enough

I’d love to simply use GNU Make for my C/C++ projects because it’s so simple to get started with. Unfortunately it’s lacking a few essential qualify of life features:

  • Out of tree builds
  • Automatic header dependency detection
  • Recompile on CFLAGS change
  • First class build targets

Out of tree builds

If you want your build to happen in an isolated build directory (as opposed to creating object files in your source tree), you need to implement this yourself.

It involves a lot of juggling paths. Not fun!

Automatic header dependency detection

In C/C++, when a header file changes, you must recompile all translation units (i.e. object files, roughly) that depend on (i.e. include) that header. If you don’t, the object file will become stale and none of your changes to constants, defines, or struct definitions (for example) will be picked up.

In Make rules, you typically express dependencies between source files, and object files, e.g:

%.o: %.c
  # run compiler command here

This will recompile the object file when the source file changes, but won’t recompile when any headers included by that source file change. So it’s not good enough out of the box.

To fix this, you need to manually implement this by:

  1. Passing the proper flags to the compiler to cause it to emit header dependency information. (Something like -MMD. I don’t know them exactly because that’s my whole point =)
  2. Instructing the build to include that generate dependency info (Something like
    -include $(OBJECTS:.o=.d)
    )

The generated dependency info looks like this:

pmap.c.obj: \
 kern/pmap.c \
 inc/x86.h \
 inc/types.h \
 inc/mmu.h \
 inc/error.h \
 inc/string.h \

Recompile on CFLAGS change

In addition to recompiling if headers change, you also want to recompile if any of your compiler, linker, or other dev tool flags change.

Make doesn’t provide this out of the box, you’ll also have to implement this yourself.

This is somewhat nontrivial. For an example, check out how the JOS OS (from MIT 6.828 (2018)) does it: https://github.com/offlinemark/jos/blob/1d95b3e576dd5f84b739fa3df773ae569fa2f018/kern/Makefrag#L48

First class build targets

In general, it’s nice to have build targets as a first class concept. They express source files compiled by the module, and include paths to reach headers. Targets can depend on each other and seamlessly access the headers of another target (the build system makes sure all -I flags are passed correctly).

This is also something you’d have to implement yourself, and there are probably limitations to how well your can do it in pure Make.


Make definitely has it’s place for certain tasks (easily execute commonly used command lines), but I find it hard to justify using it for anything non-trivially sized compared to more modern alternatives like CMake, Meson, Bazel, etc.

That said, large projects like Linux use it, so somehow they must make it work!

WIP: What’s the deal with memory ordering? (seq_cst, acquire, release, etc)

(This is a high level summary of my current knowledge, primarily to help me crystallize the knowledge. It comes entirely from from Jeff Preshing’s blog (see end of post) and youtube talk. This is not intended to be a comprehensive overview; for that, please see the aforementioned materials. I am very much a non-expert on this topic; please treat everything with skepticism.)

When programming with atomics, how are you suppose to know which of the ~four memory orderings to use? For example, the main ones (C++ terminology) are:

  • memory_order_seq_cst
  • memory_order_acquire
  • memory_order_release
  • memory_order_relaxed
  • (and a few other niche ones: acq_rel, consume)

First, as Jeff Preshing states, there is a distinction between “sequentially consistent” atomics and “low level” atomics. He describes it as two libraries for atomics masquerading as a single one within the C++ standard library.

The first, “sequentially consistent”, can be considered a higher level way of using atomics. You can safely use seq_cst everywhere. You get simpler semantics and higher likelihood of correctness, just at the expensive of performance. As an optimization, you can then port the code to the second form of “low level atomics”. This is where you must choose the explicit memory orderings.

But why do sequentially consistent atomics come with a performance hit?

The performance hit comes from cross core communication. The sequentially consistent memory model offers a very strong guarantee to the programmer; in addition to the ordering of atomic operations being consistent across cores (which is always the case), the ordering of non-atomic operations is also guaranteed to be consistent (i.e. no reordering) relative to the atomic ones.

This is relevant because programming with atomics often involves “guard” (atomic) variables who regulate access to “normal” (non-atomic) data that is transferred between threads. This guarantee requires extra effort from the memory subsystem of the CPU in the form of cross core communication as the cores need to effectively synchronize their caches.

When one moves to “low level” atomics, the strict constraints required of the memory subsystem are relaxed. Not all orderings of non-atomic accesses relative to atomic accesses must be maintained. The consequence is less cross-core coordination is required. This can be exploited for higher performance in specific scenarios where the strict ordering constraint is not required in both (or any) directions (i.e. non-atomic memory accesses are allowed to move before or after the atomic access).

Exercise: Would one expect to see a performance improvement from porting code from sequentially consistent atomics to low level atomics, if the code is run on a single core system?

The whole point of low level atomics is to optimize performance by relaxing constraints and reducing cross core communication, so no. There is no cross core communication in a single core system, so there is nothing substantial to optimize.

(I am not 100% sure of this answer. This is the current state of my knowledge and I would appreciate being corrected or affirmed either way!)

So how does one choose between all those memory orderings?

With my non-expert understanding, I believe there are some simple rules that make the decision much easier than it might seem.

First off: Decide whether you’re using sequentially consistent or low level atomics. If the former, you use seq_cst everywhere (this is even the default with C++ if you don’t specify anything).

If you want to optimize to use low level atomics, then for most cases, you then only have three choices: acquire, release, and relaxed. (seq_cst is no longer an option; acq_rel is more niche; consume is actively discouraged). Then:

  • If you’re deciding for a load operation, you then only choose between acquire and relaxed. Loads are never release.
  • And vice verse, If you’re deciding for a store operation, you then only choose between release and relaxed. Stores are never acquire.

This narrows it down to two choices. To determine whether it’s acquire/release or relaxed, determine whether the load/store has a synchronizes-with relation to a corresponding store/load. If there is one, you want acquire/release. Otherwise, choose relaxed.

Read these blog posts for a fuller answer to this:

Links:

https://www.youtube.com/watch?v=X1T3IQ4N-3g

4 basic multithreading & lockfree programming exercises

Four exercises that touch basic multithreaded and lockfree programming concepts.

  1. Implement a program that attempts to use two threads to increment a global counter to 10,000 with each thread incrementing 5000. But make it buggy so that there are interleaving problems and the end result of the counter is less than 10,000.
  2. Fix the above with atomics.
  3. Implement a variant of the program: instead of simply incrementing the counter, make the counter wrap every 16 increments (as if incrementing through indices of an array of length 16). Make two threads each attempt to increment the counter (16 * 5000) times. The end state should have the counter be back at index zero. Implement it in a buggy naive way that causes the counter to often be nonzero, even if atomics are used.
  4. Fix the above using a CAS loop.
  5. (Bonus question for the above: Why isn’t std::atomic::compare_exchange_strong a good fit here?)
Continue reading

Mutexes, atomics, lockfree programming

Some rough lab notes on these topics to record the current state of my knowledge. I’m not an expert, so there may be inaccuracies.

Mutexes

  • On Linux, libpthread mutexes are implemented using the underlying futex syscall
  • They are basically a combination of a spinlock (in userspace), backed by the kernel for wait/signal operations only when absolutely necessary (i.e. when there’s contention). In the common case of an uncontended lock acquire, there is no context switch which improves performance
  • The userspace spinlock portion uses atomics as spinlocks usually do, specifically because the compare and set must be atomic
  • Jeff Preshing (see below) writes that each OS/platform has an analogous concept to this kind of “lightweight” mutex โ€” Windows and macOS have them too
  • Before futex(2), other syscalls were used for blocking. One option might have been the semaphore API, but commit 56c910668cff9131a365b98e9a91e636aace337a in glibc is before futex, and it seems like they actually use signals. (pthread_mutex_lock -> __pthread_lock (still has spinlock elements, despite being before futex) -> suspend() -> __pthread_suspend -> __pthread_wait_for_restart_signal -> sigsuspend)
  • A primary advantage of futex over previous implementations is that futexes only require kernel resources when there’s contention
  • Like atomics, mutexes implementations include memory barriers (maybe even implicitly due to atomics) to prevent loads/stores from inappropriately crossing the lock/unlock boundary due to compiler and/or hardware instruction reordering optimizations
Continue reading

Runtime polymorphism cheat sheet

While C++ is used here as an example, the concepts apply to any statically typed programming language that supports polymorphism.

For example, while Rust doesn’t have virtual functions and inheritance, it’s traits/dyn/Boxes are conceptually equivalent and. Rust enums are conceptually equivalent to std::variant as a closed set runtime polymorphism feature.

Virtual Functions/Inheritancestd::variant
Runtime PolymorphismYes – dynamic dispatch via vtableYes – dynamic dispatch via internal union tag (discriminant) and compile-time generated function pointer table
SemanticsReference – clients must operate using pointer or referenceValue – clients use value type
Open/Closed?Open – Can add new types without recompiling (even via DLL). Clients do not need to be adjusted.Closed – Must explicitly specify the types in the variant. Generally clients/dispatchers may need to be adjusted.
CodegenClient virtual call + virtual methodsClient function table dispatch based on union tag + copy of callable for each type in the dispatch. If doing generic dispatch (virtual function style), then also need the functions in each struct. Inlining possible.
Class definition boilerplateClass/pure virtual methods boilerplate.Almost none.
Client callsite boilerplateAlmost nonestd::visit() boilerplate can be onerous.
Must handle all cases in dispatch?No support โ€” the best you can do is an error-prone chain of dynamic_cast<>. If you need this, virtual functions are not the best tool.Yes, can support this.

Overall, virtual functions and std::variant are similar, though not completely interchangeable features. Both allow runtime polymorphism, however each has different strengths.

Virtual functions excels when the interface/concept for the objects is highly uniform and the focus is around code/methods; this allows callsites to be generic and avoid manual type checking of objects. Adding a const data member to the public virtual interface is awkward and must go through a virtual call.

std::variant excels when the alternative types are highly heterogenous, containing different data members, and the focus is on data. The dispatch/matching allows one to safely and maintainably handle the different cases, and be forced to update when a new alternative type is added. Accessing data members is much more ergonomic than for virtual functions, but the opposite is true for generic function dispatch across all alternative types, because the std::visit() is not ergonomic.

Building on these low level primitives, one can build:

  • Component pattern (using virtual functions typically) (value semantics technique; moves away from static typing and towards runtime typing)
  • Type erase pattern (also virtual functions internally) (value semantics wrapper over virtual functions)

Fun facts:

  • Rust also has exactly these, but just with different names and syntax. The ergonomics and implementation are different, but the concepts are the same. Rust uses fat pointers instead of normal pointer pointing to a vtable. Rust’s match syntax is more ergonomic for the variant-equivalent. Rust uses fat pointers apparently because it allows “attaching a vtable to an object whose memory layout you cannot control” which is apparently required due to Rust Traits. (source)
  • Go uses type erasure internally, but offers this as a first class language feature.

Case study: Component pattern

The component pattern is a typical API layer alternative to classical virtual functions. With classical runtime polymorphism via virtual functions, the virtual functions and inheritance are directly exposed to the client โ€” the client must use reference semantics and does direct invocation of virtual calls.

With the component pattern, virtual functions are removed from the API layer. Clients use value semantics and then “look up” a component for behavior that would have previously been inherited.

API classes, instead of inheriting, contain a container of Components, who are themselves runtime polymorphic objects of heterogenous types. The components can classically use virtual functions for this, inheriting from some parent class. Then the API class contains a container of pointers to the parent class. API clients look up the component they are interested in via its type, and the API class implements a lookup method that iterates the components and identifies the right one using dynamic_cast or similar.

However, variants offer another way to implement this. Rather than having all components inherit from the superclass, they can be separate classes that are included in a variant. The API class then has a container of this variant type. In the lookup method, instead of using dynamic_cast, it uses std::holds_alternative which is conceptually equal.

This is a somewhat unusual application of runtime polymorphism and neither implementation method stands out as strictly better. Since components do not share a common interface really (they would just inherit so they can be stored heterogenously in a container), virtual functions does not offer a strong benefit. But also since the component objects are never dispatched on (they are always explicitly looked up by type), the variant method also does not offer a strong benefit.

The main difference in this scenario is the core difference between virtual functions and variants: whether the set of “child” types is open or closed. With virtual functions being open, it offers the advantage that new components can be added by simply inheriting from the parent class and no existing code needs to be touched. Potentially new components could even be loaded dynamically and this would work.

With variants, when new components are added, the core definition of the component variant needs to be adjusted to include the new type. No dynamic loading is supported.

So it appears that virtual functions have slight advantage here.

See: https://gameprogrammingpatterns.com/component.html

Q: What about std::any?

std::any is loosely similar to virtual functions or std::variant in that it implements type erasure, allowing a set of heterogenous objects of different types, to be referenced using a single type. Virtual functions and std::variant aren’t typically called “type erasure” as far as I’ve heard, but this is effectively what they do.

However that’s where the similarities end. std::any represents type erasure, but not any kind of object polymorphism. With std::any, there is no notion of a common interface that can be exercised across a variety of types. In fact, there is basically nothing you can do with a std::any but store it and copy it. In order to extract the internally stored object, it must be queried using its type (via std::any_cast()) which tends to defeat the purpose of polymorphism.

std::any is exclusively designed to replace instances where you might have previously used a void * in C code, offering improved type safety and possibly efficiency. 1 The classic use case is implementing a library that allows clients to pass in some context object that will later be passed to callbacks supplied by the client.

For this use case, the library must be able to store and retrieve the user’s context object. It’s it. It literally never will interpret the object or access it in any other way. This is why std::any fits here.

Another use case for std::any might be the component pattern in C++, where objects store a list of components, which are then explicitly queried for by client code. In this case, the framework also never deals directly with the components, but simply stores and exposes the to clients on request.

More: https://devblogs.microsoft.com/cppblog/stdany-how-when-and-why

How to enable colored compiler output with CMake + VSCode

Assuming you’re using the “CMake Tools” VSCode extension, here’s what works for me.

1 – Set CMAKE_COLOR_DIAGNOSTICS to ON in your environment

This makes CMake pass -fcolor-diagnostics to clang. If you build on the command line, you’ll now have color. But the VSCode “output” pane will still be non-colored.

2 – Install the “Output Colorizer” extension from IBM.

This adds color to the Output pane.

It looks like this:

Links:

https://github.com/ninja-build/ninja/issues/174

https://github.com/microsoft/vscode-cmake-tools/issues/478

Can we have non-atomic std::shared_ptr in C++?

Turns out, we do have them. libstdc++ contains a heuristic kludge where it checks if libpthread is linked into the process and conditionally executes an atomic add or non-atomic add depending on the result. Post 2020, it doesn’t directly use the libpthread hack, but uses glibc’s native support for this.

These kinds of things always make me sad about C++ (especially compared to how nice and clean things seem to be in Rust world), but at least it’s nice knowing that the committee did consider it and there are good reasons why to not have this in the C++ stdlib. The most compelling reason is that in the absence of a borrow checker, it is all too easy to silently use a non-atomic shared_ptr in multithreaded code.

Full thread:

Tips for writing LLDB pretty printers

LLDB supports custom scripts (“variable formatters”) to pretty print C++ data structures. For example, std::vector is typically implemented as a struct with three pointers: begin, end, and capacity. But if you wanted to print out a std::vector variable during a debugging session, printing out these three pointers isn’t likely to be helpful. What you actually want is to print the contents of the vector. Pretty printer scripts allow for doing this for your own data structures.2

Here are a few tips to supplement the official documentation. Sudara also has a great post on the topic.

Continue reading