Doing multitasking the hard way

A few days ago, I was fiddling around with C's setjmp and getjmp functions and quickly found out that one could implement coroutines with them pretty easily. Unsurprisingly, I wasn't the first person to realise that, so you can find dozens of C libraries that do exactly that on GitHub, probably more on SourceForge, GitLab & Co. However, if you want to add a little spice to your implementation, why not make it completely independent from a C library or even an ABI (at least on x86_64)?

Creating a simple scheduler

Let's start by creating one of these "normal" schedulers I talked about above. We can do that using the default setjmp and longjmp, but don't worry, we'll replace these later with our own code. Let's start by creating a task struct that saves all the information we have about each individual task. Since we need some sort of dynamic list, let's put that into the struct as well.

Alright, let me go through each member and explain why we need it. First off, the function pointer is actually used only once. We need to save it anyways, because we don't want to start the task right after creating it. 'was_called' is here so we don't call the function twice. 'stack_top' is pretty self-explanatory if you know what a stack is. We need a stack for each task, because variables are stored on it, and obviously each task has different variables. The only other thing that I should probably note, is that the stack grows down in memory in the x86(_64) platform. Now, the following two jmp_buf's named 'func_env' and 'ctx_env' are a core part of our scheduler. The 'jmp_buf' type stores data about the current thread state. How exactly it does that isn't defined by the C specification, mainly because restoring the previous state of a thread works vastly different on different platforms. I am going to cover what exactly needs to go in there for the x86_64 platform later in this post. 'next' and 'prev' simply store a pointer to the next and previous task in our list.

Creating a new task is rather simple. Because we don't want to rely on a C standard library, we are going to pass the task of retrieving memory for both the task struct and the tasks's stack onto the user. I'm also going to typedef a pointer to a task instance as ctx, just because it's easier to type, and makes more sense for anyone reading the code.

The most complicated part of that function is where we connect our newly created task to the previous ones. Here, we use the 'get_last' function which, is just iterating over the list until it hits its tail. Since the list is circular, meaning that the last element's 'next' is the first element, the tail is the first element which is passed as an argument to the function.

Now to the interesting part: the actual scheduling function. Actually, it's two functions: one to iterate over the tasks and the other to actually execute them. Let's take a look first, then I'll explain what each part does.

The iteration part is rather "trivial". We start by saving a pointer to the first task into 'current_task', a global variable. Then we call the function to execute our thread and check its return code; I chose anything above one to exit the task. Skipping the if block for now, we just go to the next task using, the 'next' member of our task struct. Inside the if block we first check whether the current task is the only one in the task list. Since our list is circular we can do so by checking if the next task would be the current task. If we are indeed in the last running task we can safely exit the loop and return from the function. Otherwise, we need to take the task out of our list before continuing to the next one.

The execution part is a bit more complicated. We start of by saving our current task state into the 'ctx_env' of the current task using 'setjmp'. The functions return value has a very important property: much like POSIX's 'fork' it returns whether we just restored the task state or not. Zero indicates that we returned from the normal function call as you see it in code, anything else tells us that a call to 'longjmp' has jumped to this line of code. We use this knowledge to execute the task. The first part of the if block just calls the function. Additionally, it sets the stack pointer to the value specified. We can access the stack_pointer using some inline assembly and C's 'register' keyword. Keep in mind that this particular version of inline assembly is only available in compilers supporting the GNU C dialect. The other part is not so obvious at first glance. Here we longjmp to a state saved by the task itself. To actually enable the task to do so, with little code, we are going to need some macros.

This code already provides us with a working coroutine implementation, running everywhere (GNU) C does. It does not provide any fancy asynchronous I/O functions, thread sleeping and so on, but at least it works. Additionally, it is also neither independent from the C standard library nor the used ABI, but for now let's just set up some code to actually test our scheduler.

Declaration of independence

What we've made until now works, but it is far from being independent. There are a few main issues we have to fix. First of all, we have to remove our dependency to 'uintptr_t' and 'size_t'. C doesn't provide exact-size integers natively, but integers have to have a certain minimum size. Lucky for us, because we only target x86_64 and our integers can easily be larger than 64-bit without breaking things, we can just use one of these, namely 'unsigned long long int', which has to be at least 64-bits wide. Using it is easy; for convenience one could typedef it after which you just have to replace 'uintptr_t' and 'size_t'. The other dependencies are 'setjmp' and 'longjmp' as well as their accompanying type 'jmp_buf'. To implement these functions we need assembly. I am going to put their implementation into a separate file, mainly because I use NASM for assembling, but if you have the nerves to do so you could use GCC inline assembly.

Let's start off with some function calling theory. Calling in x86 and its compatibles is, apart from arguments and return values, ABI independent. The caller (the place where the 'call' instruction is) pushes its instruction pointer to the stack. The instruction pointer conveniently always points to the next instruction so when using this value we don't have to worry about offsetting it. After that the caller jumps to the desired location and execution continues. Once the CPU reaches a 'ret' instruction (we call this place the 'callee') it's going to pop a value off the stack and jmp to it in memory. Now you might ask yourself: how is this important if every task has its own stack anyways? Well, each task does have its own stack, but once on the stack, a value doesn't stay there forever; it can get overwritten by variables or other return addresses. Since we want to return to the place 'setjmp' was called from, in 'longjmp', we're going to have to save this return pointer. And that is already the most important part. We do have to save registers, but that is a pretty straight-forward process without any pitfalls. To store everything, we will use a plain old array. Again, I'm going to typedef this to a nicer name. Keep in mind that array types are defined like this:

Let's look at the implementation of setjmp, for the System V ABI (most to all UNIX systems, not Windows yet).

I put a lot in the comments already, but you might not know why we save rsp + 8 instead of just rsp. It will become clearer in a second, but we need to push to return address we saved back to the stack, which, on x86_64, happens to require eight bytes, and 'lea' makes the +8 slightly faster. 'longjmp' is pretty much the same code just the other way around. We first restore rsp, then we push the return value to the stack and restore all other registers we saved in 'setjmp'.

That wasn't too hard, was it? We can plug this into our code by just replacing any 'setjmp'/'longjmp'/'jmp_buf' with 'my_setjmp'/'my_longjmp'/'my_jmp_buf' respectively and adding the following two lines so that C knows the functions exist.

So we've removed the C library as a dependency, but what if you are a Windows user? Or even use some other weird compiler that doesn't support either of those ABI's? Fear not, dear friend, help is on its way. The following part should be easily portable to any language supporting inline assembly, so feel free to do so. In a nutshell we rely on your language's knowledge of it's own ABI. Since we can create functions that load arguments in some way and we can put these into a register using inline assembly, allowing us to define our own ABI which provides maximum compatibility with other ABI's. I've chosen rax and rdx as the two only caller-saved registers since these are used by both the 'div' and 'idiv' instructions which preform unsigned and signed integer division respectively, making them common in ABI's to caller-saved. Alternatively, one could use r8 and r9, since these are useable in Microsofts calling convention and usable enough in the System V ABI. Changing these registers in the assembly code, is just a matter of replacing their names, but from our C code we have to build a bridge to make calling work properly.

As you might have noticed we need to make a few more changes to our assembly code to make it work again. First of all we now can save all registers apart from rax and rdx in our my_jmp_buf. Since rax is used for the first argument, any temporary values from setjmp have to be stored in rdx instead. We also don't return using rax anymore, but rather rdx. Additionally, we have to rename the assembly functions, to fit the calls in our bridge functions. Lastly, and most complex...ly, we have to save two more values from the stack, the one pointed to by rbp and the one 8 bytes above that. This is because we need return to the bridge functions before they return to the original caller. The rbp register should point to the stack base of the original caller and the value above that, again this is 8 bytes offset on x86_64, is the return address of the original caller. Now that we have three values that need pushing to the stack we also need to reserve 16 more bytes for that. Saving all of these values properly should look like this

To properly return from assembly these values need to be pushed back onto the stack like so:

Conclusion

This probably won't find any real-world use. If you have a C library to work with, you can just use its 'setjmp'/'longjmp' implementation, and if you don't you likely know specifically what ABI you are using. This merely serves as a fun exercise to learn a thing or two about the inner workings of C and other languages that have coroutines. This is also not a definitive coroutine implementation. There are some great, completely independent pure-C implementations, using only the C preprocessor, which truly deserves an article of its own. Anyhow, I hope this helped someone to better understand cooperative multitasking, and until next time: