Category Archives: _Lab Notes ๐Ÿงช

Rough notes, typically technical, usually bullet points about some topic.

x86 kernel development lab notes

Here’s what I know about x86 kernel development. The usual caveat applies for my lab notes: this is not considered a high quality document and there may be inaccuracies.


Main Processor Modes

  • Real Mode (16 bit)
    • CPU boots into this mode for backward compatibility
    • The IDT is instead the IVT here (Interrupt Vector Table)
    • Legacy BIOS booting begins here โ€” the BIOS loads the first sector of disk into memory at a fixed address and begins executing it in Real Mode.
  • Protected Mode (32 bit)
    • Segmentation is mandatory โ€” a minimal GDT is necessary.
    • Paging is not mandatory
  • Long Mode (64 bit)
    • Paging is mandatory

Segmentation

  • Originally solved the problem of CPUs having more physical memory than could be addressed with 16 bit registers. (Note, this is the opposite situation of what we have today where virtual address spaces are vastly greater than physical ones)
  • Introduces the concept of “segments”, which are variable length “windows” into a larger address space.

Data structures

GDT (Global Descriptor Table)

  • Contains “descriptors” to describe the memory segments (“windows”) available. Segment register contain effectively an index into this table.
  • There are “normal” descriptors which describe memory segments and “system” descriptors which point to more exotic things, like Task State Segments (TSS) or Local Descriptor Tables (LDT)
  • These days OSs use the GDT as little as possible, only as much as strictly necessary. On 32 bit, this looks like 4 entries that start at base 0x0, and cover the entire 32 bit space. 2 for kernel, 2 for user โ€” 1 for code, 1 for data for each. (?)
  • On 64 bit GDT is totally unused (I believe?), as are nearly all segment registers(?), except FS and GS. (Why are they special? There is even a special MSR for GS?)

LDT (Local Descriptor Table)

  • My understanding is LDTs are really no longer used by nearly any OS. Some parts of segmentation are still required by OSs, like the GDT, but LDT is not required and almost completely unused in modern OSs.
  • These would contain segments only accessible to a single task, unlike the regions in the GDT (?)

IDT (Interrupt Descriptor Table)

  • Interrupts: Generally externally triggered, i.e. from hardware devices
  • Exceptions: Internally generated, I.e. division by zero exception, or software breakpoint
  • When the processor receives an interrupt or exception, it handles that by executing code โ€” interrupt handler routines.
  • These routines are registered via the IDT โ€” an array of descriptors that describes how to handle a particular interrupt.
  • Interrupts/exceptions have numbers which directly map to entries in the IDT.
  • IDT descriptors are a polymorphic structure โ€” there are several kinds of entities: interrupt, trap, and task gates (maybe others – call gates?).
  • Interrupt/trap gates are nearly identical and differ only in their handling of the interrupt flag. They contain a pointer to code to execute. This is expressed via a segment/offset.
  • Task gates make use of HW task switching and offer a more “turnkey” solution for running code in a separate context when an interrupt happens โ€” but generally aren’t used for other reasons (?). Context switch is automatic?
  • Task gates in the IDT point to a TSS descriptor in the GDT, which points to a TSS (?)
  • Some interrupt/exceptions have an associated error code, some don’t.
  • Interrupt gates describe a minimal privilege level required to trigger then with an int instruction โ€” this prevents userspace from triggering arbitrary interrupts with int. If userspace tries to trigger an int without permission, that is converted in to a General Protection fault (#GP)

Hardware task switching

  • Although long considered obsolete in favor of software task/context switching, x86 provides significant facilities for modeling OS “tasks” at the hardware level, and including functionality for automatic context switching between them.
  • Hardware task switching may require copying much more machine state than is necessary. Software context switches can be optimized to copy less and be faster, which is one reason why they’re preferred.
  • Hardware task switching puts a fixed limit on the number of tasks in the system (?)

TSS (Task State Segment)

  • This is a large structure modeling a “task” (thread of execution)
  • Contains space for registers & execution context
  • Even if HW task switching is not used, one TSS is still needed as the single HW Task running on the system, which internally implements all software context switching
  • TSS is minimally used for stack switches when handling interrupts across privilege levels โ€” when switching from userspace to kernel during interrupt, kernel stack is taken from TSS
  • Linux task_struct is probably named with “task” due to being original created for i386
  • The Task Register (TR) contains a descriptor pointing to the current active HW task (?)

The JOS boot process

JOS is the OS used in MIT 6.828 (2018).

Bootloader

  • JOS includes a small BIOS bootloader in addition to the kernel
  • The bootloader begins with typical 16 bit Real Mode assembly to do the typical steps to initialize the CPU (Set A20 line, etc)
  • Transition to protected mode
  • Set the stack immediately at the start of the code, and transition to C
  • The kernel is loaded from disk using Port IO
  • Loaded into physical memory around the 1MB mark, which is generally considered a “safe” area to load into. (Below the 1MB mark has various regions where devices, BIOS data, or other “things” reside and it’s best to not clobber them.)
  • Call into the kernel entrypoint

Early kernel boot

  • Receive control from the bootloader in protected mode
  • Transition to paging
  • The kernel is linked to run in high memory, starting at 0xf000000 (KERNBASE)
  • The transition from segmentation to paging virtual memory happens in a few steps. There’s first an initial basic transition using set of minimal page tables.
  • After that transition is made, a basic memory allocator is set up, which is then used to allocate memory for the production page tables which implement the production virtual memory layout used for the rest of runtime.
  • The minimal page tables contain two mappings:
    • 1 – Identity map the first 4MB to itself
    • 2 – Map the 4MB region starting at KERNBASE also to the first 4MB
  • One page directory entry maps a 4MB region, so only two page directory entries are needed
  • These page tables are constructed statically at compile time
  • The first identity mapping is critical because without it the kernel would crash immediate after loading CR3, because the next instruction would be unmapped. The identity mapping allows the low mem addresses the kernel resides in to remain valid until the kernel can jump to high mem
  • The assembly there looks a bit strange because the jump appears redundant. But all the asm labels are linked using highmem addresses, so jumping to them transitions from executing in low mem, to executing in high mem, where the kernel will remain executing for the rest of its lifetime.
  • Set the stack pointer to a global data/BSS section of internal storage within the kernel and enter C code

Memory allocators

  • The goal is transition to a production virtual memory setup
  • This requires allocating memory for page tables
  • To build the dynamic page/frame allocator, we start with a basic bump allocator
  • It starts allocating simply from the end of the kernel in memory. We have access to a symbol for the end of the kernel via a linker script.
  • Kernel queries the physical memory size of the system and dynamically allocates data structures for the dynamic page/frame allocator. This is an array of structure that correspond to each available frame of physical memory. These structures have an embedded linked list pointer (intrusive linked list) and a refcount. They are linked together into a linked list, to implement a stack data structure where frames can be popped (when allocating) and pushed (when freeing).
  • Using this frame allocator, pages for the production page tables are allocated.

gardenOS Update 1

(This is a random collection of thoughts around my new operating systems project, “gardenOS”.)


What’s up with the “gardening” terminology?

This is a phrase that I feel perfectly describes the ethos of this project. In the same way people have gardens as calm places to express themselves, learn things, and have fun doing work, I want to create an analogous place for exploring my interest in operating systems.

Some key tenants of the approach:

  • No stress
  • Have fun
  • Pursue whatever interests you

I don’t have particularly strong OSdev skills at the moment, so I’m especially focusing on doing small, easy tasks, such as cleaning up the build system. I do this in the same way you might spend an afternoon picking weeds in your little garden. It’s easy, well understood, not too complicated, no large decisions to be made โ€” and it concretely improves the project. It’s a concrete win you can lock in, in a fixed amount of time, with fairly little work.


Disclaimer: I’m almost apprehensive to even give this project a name

I’m worried that even naming this project (“gardenOS”) will put too much pressure on me. I am deeply aware that OS’s take monumental amounts of time and energy to even get to basic states. And at my current rate (~2 hours a week), it’s unlikely we will get to even basic levels soon.

I wasn’t expecting things to get this philosophical this quickly. My focus above all, is to have fun, learn things, and make some small progress each week in the stream. Each week where I do a stream or do some work, any any of that happens, is a win.

I explicitly hold very few expectations around a future “goal” of the project or where I want it to end up. I just want to have fun and learn things about operating systems.

The loose vision I have is to create a minimal OS for play and experimentation. It should be a high quality codebase, and I should work on it as if I wanted to present my best self as an aspiring pro systems programmer.


The ethos of the project

Even though the project is in a maximally nascent stage, I already feel a certain ethos evolving. In the project, we emphasize:

  • Relaxed, casual, kind attitude
  • Learning mindset
  • Ambition to use programming best practices, and aspiring to become pro systems programmers
  • Portray an accurate “slice of life” of systems programming. Meaning: Show the day to day, which is not always thrilling tasks (like adding a syscall), but simply just working on the build system or refactoring some code.

My experience after doing four streams

I’ve done four streams so far (2 live, 2 recorded). My big worries are that I’ll run out of stuff to do and have to scramble to find something to do, live.

This has never happened โ€” I’m always surprised at the adventures “we” find ourselves going on


The use of “we”

The project is just me at the moment, but there’s something about using “we” to describe it that makes me feel good.

It’s not just to make myself seem more impressive, or feel less alone. When I use it, I think of the handful of enthusiastic people that have joined me in the live streams, or left comments with questions or encouragement.

Even in these earliest of days, there is some tiny, microscopic community feel forming. So when I say “we”, I speak for this community.


WIP: What’s the deal with memory ordering? (seq_cst, acquire, release, etc)

(This is a high level summary of my current knowledge, primarily to help me crystallize the knowledge. It comes entirely from from Jeff Preshing’s blog (see end of post) and youtube talk. This is not intended to be a comprehensive overview; for that, please see the aforementioned materials. I am very much a non-expert on this topic; please treat everything with skepticism.)

When programming with atomics, how are you suppose to know which of the ~four memory orderings to use? For example, the main ones (C++ terminology) are:

  • memory_order_seq_cst
  • memory_order_acquire
  • memory_order_release
  • memory_order_relaxed
  • (and a few other niche ones: acq_rel, consume)

First, as Jeff Preshing states, there is a distinction between “sequentially consistent” atomics and “low level” atomics. He describes it as two libraries for atomics masquerading as a single one within the C++ standard library.

The first, “sequentially consistent”, can be considered a higher level way of using atomics. You can safely use seq_cst everywhere. You get simpler semantics and higher likelihood of correctness, just at the expensive of performance. As an optimization, you can then port the code to the second form of “low level atomics”. This is where you must choose the explicit memory orderings.

But why do sequentially consistent atomics come with a performance hit?

The performance hit comes from cross core communication. The sequentially consistent memory model offers a very strong guarantee to the programmer; in addition to the ordering of atomic operations being consistent across cores (which is always the case), the ordering of non-atomic operations is also guaranteed to be consistent (i.e. no reordering) relative to the atomic ones.

This is relevant because programming with atomics often involves “guard” (atomic) variables who regulate access to “normal” (non-atomic) data that is transferred between threads. This guarantee requires extra effort from the memory subsystem of the CPU in the form of cross core communication as the cores need to effectively synchronize their caches.

When one moves to “low level” atomics, the strict constraints required of the memory subsystem are relaxed. Not all orderings of non-atomic accesses relative to atomic accesses must be maintained. The consequence is less cross-core coordination is required. This can be exploited for higher performance in specific scenarios where the strict ordering constraint is not required in both (or any) directions (i.e. non-atomic memory accesses are allowed to move before or after the atomic access).

Exercise: Would one expect to see a performance improvement from porting code from sequentially consistent atomics to low level atomics, if the code is run on a single core system?

The whole point of low level atomics is to optimize performance by relaxing constraints and reducing cross core communication, so no. There is no cross core communication in a single core system, so there is nothing substantial to optimize.

(I am not 100% sure of this answer. This is the current state of my knowledge and I would appreciate being corrected or affirmed either way!)

So how does one choose between all those memory orderings?

With my non-expert understanding, I believe there are some simple rules that make the decision much easier than it might seem.

First off: Decide whether you’re using sequentially consistent or low level atomics. If the former, you use seq_cst everywhere (this is even the default with C++ if you don’t specify anything).

If you want to optimize to use low level atomics, then for most cases, you then only have three choices: acquire, release, and relaxed. (seq_cst is no longer an option; acq_rel is more niche; consume is actively discouraged). Then:

  • If you’re deciding for a load operation, you then only choose between acquire and relaxed. Loads are never release.
  • And vice verse, If you’re deciding for a store operation, you then only choose between release and relaxed. Stores are never acquire.

This narrows it down to two choices. To determine whether it’s acquire/release or relaxed, determine whether the load/store has a synchronizes-with relation to a corresponding store/load. If there is one, you want acquire/release. Otherwise, choose relaxed.

Read these blog posts for a fuller answer to this:

Links:

https://www.youtube.com/watch?v=X1T3IQ4N-3g

Mutexes, atomics, lockfree programming

Some rough lab notes on these topics to record the current state of my knowledge. I’m not an expert, so there may be inaccuracies.

Mutexes

  • On Linux, libpthread mutexes are implemented using the underlying futex syscall
  • They are basically a combination of a spinlock (in userspace), backed by the kernel for wait/signal operations only when absolutely necessary (i.e. when there’s contention). In the common case of an uncontended lock acquire, there is no context switch which improves performance
  • The userspace spinlock portion uses atomics as spinlocks usually do, specifically because the compare and set must be atomic
  • Jeff Preshing (see below) writes that each OS/platform has an analogous concept to this kind of “lightweight” mutex โ€” Windows and macOS have them too
  • Before futex(2), other syscalls were used for blocking. One option might have been the semaphore API, but commit 56c910668cff9131a365b98e9a91e636aace337a in glibc is before futex, and it seems like they actually use signals. (pthread_mutex_lock -> __pthread_lock (still has spinlock elements, despite being before futex) -> suspend() -> __pthread_suspend -> __pthread_wait_for_restart_signal -> sigsuspend)
  • A primary advantage of futex over previous implementations is that futexes only require kernel resources when there’s contention
  • Like atomics, mutexes implementations include memory barriers (maybe even implicitly due to atomics) to prevent loads/stores from inappropriately crossing the lock/unlock boundary due to compiler and/or hardware instruction reordering optimizations
Continue reading

Runtime polymorphism cheat sheet

While C++ is used here as an example, the concepts apply to any statically typed programming language that supports polymorphism.

For example, while Rust doesn’t have virtual functions and inheritance, it’s traits/dyn/Boxes are conceptually equivalent and. Rust enums are conceptually equivalent to std::variant as a closed set runtime polymorphism feature.

Virtual Functions/Inheritancestd::variant
Runtime PolymorphismYes – dynamic dispatch via vtableYes – dynamic dispatch via internal union tag (discriminant) and compile-time generated function pointer table
SemanticsReference – clients must operate using pointer or referenceValue – clients use value type
Open/Closed?Open – Can add new types without recompiling (even via DLL). Clients do not need to be adjusted.Closed – Must explicitly specify the types in the variant. Generally clients/dispatchers may need to be adjusted.
CodegenClient virtual call + virtual methodsClient function table dispatch based on union tag + copy of callable for each type in the dispatch. If doing generic dispatch (virtual function style), then also need the functions in each struct. Inlining possible.
Class definition boilerplateClass/pure virtual methods boilerplate.Almost none.
Client callsite boilerplateAlmost nonestd::visit() boilerplate can be onerous.
Must handle all cases in dispatch?No support โ€” the best you can do is an error-prone chain of dynamic_cast<>. If you need this, virtual functions are not the best tool.Yes, can support this.

Overall, virtual functions and std::variant are similar, though not completely interchangeable features. Both allow runtime polymorphism, however each has different strengths.

Virtual functions excels when the interface/concept for the objects is highly uniform and the focus is around code/methods; this allows callsites to be generic and avoid manual type checking of objects. Adding a const data member to the public virtual interface is awkward and must go through a virtual call.

std::variant excels when the alternative types are highly heterogenous, containing different data members, and the focus is on data. The dispatch/matching allows one to safely and maintainably handle the different cases, and be forced to update when a new alternative type is added. Accessing data members is much more ergonomic than for virtual functions, but the opposite is true for generic function dispatch across all alternative types, because the std::visit() is not ergonomic.

Building on these low level primitives, one can build:

  • Component pattern (using virtual functions typically) (value semantics technique; moves away from static typing and towards runtime typing)
  • Type erase pattern (also virtual functions internally) (value semantics wrapper over virtual functions)

Fun facts:

  • Rust also has exactly these, but just with different names and syntax. The ergonomics and implementation are different, but the concepts are the same. Rust uses fat pointers instead of normal pointer pointing to a vtable. Rust’s match syntax is more ergonomic for the variant-equivalent. Rust uses fat pointers apparently because it allows “attaching a vtable to an object whose memory layout you cannot control” which is apparently required due to Rust Traits. (source)
  • Go uses type erasure internally, but offers this as a first class language feature.

Case study: Component pattern

The component pattern is a typical API layer alternative to classical virtual functions. With classical runtime polymorphism via virtual functions, the virtual functions and inheritance are directly exposed to the client โ€” the client must use reference semantics and does direct invocation of virtual calls.

With the component pattern, virtual functions are removed from the API layer. Clients use value semantics and then “look up” a component for behavior that would have previously been inherited.

API classes, instead of inheriting, contain a container of Components, who are themselves runtime polymorphic objects of heterogenous types. The components can classically use virtual functions for this, inheriting from some parent class. Then the API class contains a container of pointers to the parent class. API clients look up the component they are interested in via its type, and the API class implements a lookup method that iterates the components and identifies the right one using dynamic_cast or similar.

However, variants offer another way to implement this. Rather than having all components inherit from the superclass, they can be separate classes that are included in a variant. The API class then has a container of this variant type. In the lookup method, instead of using dynamic_cast, it uses std::holds_alternative which is conceptually equal.

This is a somewhat unusual application of runtime polymorphism and neither implementation method stands out as strictly better. Since components do not share a common interface really (they would just inherit so they can be stored heterogenously in a container), virtual functions does not offer a strong benefit. But also since the component objects are never dispatched on (they are always explicitly looked up by type), the variant method also does not offer a strong benefit.

The main difference in this scenario is the core difference between virtual functions and variants: whether the set of “child” types is open or closed. With virtual functions being open, it offers the advantage that new components can be added by simply inheriting from the parent class and no existing code needs to be touched. Potentially new components could even be loaded dynamically and this would work.

With variants, when new components are added, the core definition of the component variant needs to be adjusted to include the new type. No dynamic loading is supported.

So it appears that virtual functions have slight advantage here.

See: https://gameprogrammingpatterns.com/component.html

Q: What about std::any?

std::any is loosely similar to virtual functions or std::variant in that it implements type erasure, allowing a set of heterogenous objects of different types, to be referenced using a single type. Virtual functions and std::variant aren’t typically called “type erasure” as far as I’ve heard, but this is effectively what they do.

However that’s where the similarities end. std::any represents type erasure, but not any kind of object polymorphism. With std::any, there is no notion of a common interface that can be exercised across a variety of types. In fact, there is basically nothing you can do with a std::any but store it and copy it. In order to extract the internally stored object, it must be queried using its type (via std::any_cast()) which tends to defeat the purpose of polymorphism.

std::any is exclusively designed to replace instances where you might have previously used a void * in C code, offering improved type safety and possibly efficiency. 1 The classic use case is implementing a library that allows clients to pass in some context object that will later be passed to callbacks supplied by the client.

For this use case, the library must be able to store and retrieve the user’s context object. It’s it. It literally never will interpret the object or access it in any other way. This is why std::any fits here.

Another use case for std::any might be the component pattern in C++, where objects store a list of components, which are then explicitly queried for by client code. In this case, the framework also never deals directly with the components, but simply stores and exposes the to clients on request.

More: https://devblogs.microsoft.com/cppblog/stdany-how-when-and-why

Task queues, Redis, Python, Celery, RQ

  • There are many instances where your application has some expensive work to do, that would be not good to execute in the hot path. (e.g. responding to an HTTP request)
  • The typical solution is to enqueue a task in a queue and have a worker process it “offline”
  • In Python, Celery is a popular library for this. It uses backends for the actual queue. Popular backends are RabbitMQ and Redis.
  • RQ is another Python that supports Redis only.
  • RabbitMQ is designed to be a queue โ€” it’s in the name.
  • Redis is a in memory key value store/database. (or “data structure” store). It includes a number of primitives that might be used to implement a queue. It’s basically a in memory hashmap. Keys are strings in a flat namespace. Value are a set of supported fundamental data types. Everything is serialized as it’s interacted with IPC.
    • List โ€” (Implemented in RQ.) You can use opcodes like LPUSH and RPOP.
    • Pub/Sub โ€” Unsure about this. (Possibly implemented in Celery?)
    • Stream โ€” Advanced but apparently is not implemented in either Celery or RQ.
  • Celery has sleek Pythonic syntactic sugar for specifying a “Task” and then calling it from the client (web app). It completely abstracts the queue. It returns a future to the client (AsyncResult) โ€” the interface is conceptually similar to std::async in C++.
  • RQ is less opinionated and any callable can be passed into this .enqueue() function, with arguments to call it with. This has the advantage that the expensive code does not need to have Celery as a dependency (to decorate it). However that is not a real downside, as you can always keep things separate by making Celery wrappers around otherwise dependency free Python code. But it is an addition level of layers.
  • Heroku offers support for this โ€” you just need to add a new process to your Procfile for the celery/RQ worker process. Both celery/RQ generate a main.
  • Redis has other uses beyond being a queue: it can be a simple cache that you application accesses on the same server before accessing the real database server. It can also be used to implement a distributed lock (sounds fancy but is basically just a single entry in redis that clients can check to see if a “lock” is taken. Of course it’s more complicated than just that). Redis also supports transactions, in a similar way to transactional memory on CPUs. In general there are direct parallels from from local parallel programming to nearly everything in this distributed system world. That said there are unique elements โ€” like the Redis distributed lock includes concepts like a timeout and a nonce in case the client that acquired the lock crashes or disappears forever. That is generally not something you’d see in a mutex implementation. Another difference is that even though accessing Redis is shared mutable state, clients probably don’t need some other out of band mutex because Redis implements atomicity likely. That’s different than local systems because even if the shared, mutable data type you’re writing to/reading from locally is atomic (like a int/word), you should still use a mutex/atomic locally due to instruction reordering (mutexes and atomics insert barriers).

How to enable colored compiler output with CMake + VSCode

Assuming you’re using the “CMake Tools” VSCode extension, here’s what works for me.

1 – Set CMAKE_COLOR_DIAGNOSTICS to ON in your environment

This makes CMake pass -fcolor-diagnostics to clang. If you build on the command line, you’ll now have color. But the VSCode “output” pane will still be non-colored.

2 – Install the “Output Colorizer” extension from IBM.

This adds color to the Output pane.

It looks like this:

Links:

https://github.com/ninja-build/ninja/issues/174

https://github.com/microsoft/vscode-cmake-tools/issues/478

struct stat notes

struct stat on Linux is pretty interesting

  • the struct definition in the man page is not exactly accurate
  • glibc explicitly pads the struct with unused members which is intersting. I guess to reserve space for expansion of fields
    • if you want to see the real definition, a trick you can use is writing a test program that uses a struct stat, and compiling with -E to stop after preprocessing then look in that output for the definition
  • you can look in the glibc sources and the linux sources and see that they actually have to make their struct definitions match! (i think). since kernel space is populating the struct memory and usespace is using it, they need to exactly agree on where what members are
    • you can find some snarky comments in linux about the padding, which is pretty funny. for example (arch/arm/include/uapi/asm/stat.h)
  • because the structs are explicitly padded, if you do a struct designator initialization, you CANNOT omit the designators. if you do, the padded members will be initialized instead of the fields you wanted!

Netcat Refresher

Introduction

Netcat is a great tool for all things networking and is commonly nicknamed "the TCP/IP Swiss-army knife" due to its versatility and utility. An absolute must-know for sysadmins and hackers. In this article, I’ll go over a few common uses I have for it that I frequently forget after not using it for a while, primarily for my own personal reference.

Before I begin, I should point out that there are a few variants on netcat that have slightly different options and behaviors but are all essentially the same in "spirit and functionality", as the ncat man page describes it.

The original netcat comes from the OpenBSD package and was written by "Hobbit". This is the default version that comes with OS X and Ubuntu. The version that I use and will cover is the standard GNU Netcat, by Giovanni Giacobbi, which is a rewrite of the original. This available using brew on OS X. On Ubuntu it’s called "netcat-traditional" which you can apt-get and then run with nc.traditional. Lastly, there is ncat, which is a netcat implementation by our friends from the nmap team. It is designed to modernize netcat and adds features like SSL, IPv6, and proxying which aren’t available in the original(s).

Usage

At its core, netcat is a tool for creating arbitrary TCP connections, which looks like

$ netcat [host] [port]

where host is either an IP Address or a domain name, and port is the TCP port to connect to.

You can also use netcat to do the reverse: listen for arbitrary TCP connections. This looks like

$ netcat -l -p [port] [host]

Here, host is an optional parameter which lets you limit what host can create connections.

Example: Chat

Using these two behaviors, we can create a crude chat system. One one host, listen for connections on a port.

$ netcat -l -p 1337

On the same one, in another terminal, connect to it on that port.

$ nc localhost 1337

There won’t be a prompt, but when you enter text and press enter, it will appear in the other terminal. You can just as easily do this between different hosts and have a super basic chat setup.

Example: Curl-like behavior

You can also use netcat to emulate curl and interact with HTTP servers. Connect to the server on port 80 (or whatever port it’s running on) and you can then type out the HTTP request to send to it. When you’re finished, hit enter twice and it will send.

[mark:~]{ nc example.org 80
GET / HTTP/1.1

HTTP/1.1 400 Bad Request
Content-Type: text/html
Content-Length: 349
Connection: close
Date: Wed, 05 Mar 2014 20:15:42 GMT
Server: ECSF (mdw/1383)

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>400 - Bad Request</title>
</head>
<body>
<h1>400 - Bad Request</h1>
</body>
</html>

As you can see here, we sent a bare-bones HTTP request (GET / HTTP/1.1) which was successfully sent to the server. The server responded with a 400, because our request didn’t contain enough information, but that’s not important; if we had filled in the right headers, it would have responded with the home page for example.org.

For Hackers

There are two applications for netcat that I find particularly useful in pen-testing situations.

Recon

The first is helpful for the recon stage, which is essentially getting information on your target. Sometimes network services may give away version information when an arbitrary network connection is made. For example, OpenSSH by default gives away it’s version information as well as information on the host, when you connect. For example,

$ netcat 1.2.3.4 22
SSH-2.0-OpenSSH_5.9p1 Debian-5ubuntu1.1

is typically what you might see. For an attacker, this is pretty valuable stuff! MySQL behaves similarly.

$ netcat 1.2.3.4 3306
J
5.5.33-.?2|>\8๏ฟฝ๏ฟฝ@x\E$"zeic2lmysql_native_password

The output isn’t as clear as OpenSSH, but we can confirm that MySQL is indeed running, and we can infer that the version is "5.5.33". For information on removing these banners, check out my blog post on it.

Persistence/Access

The other application is when you have achieved command execution, but not exactly shell access. You can use netcat to create a nifty backdoor which you can externally connect to. To create the backdoor, we’ll use the -e flag to tell netcat to execute a binary on receiving a connection. We want a shell, so we’ll say -e /bin/sh. The whole command will look like:

$ netcat -l -p 1337 -e /bin/sh

which will give you a backdoor on port 1337, which will then let you run commands upon connecting to that port. For a good example, check out my other blog post where I actually used this.

Conclusion

That was a quick overview of netcat including its basic functionality and some example use cases. Thanks for reading!