Doing multitasking the hard way

Hi there!

A few days ago, I was fiddling around with C's setjmp and getjmp functions and quickly found out that one could implement coroutines with them pretty easily. Unsurprisingly, I wasn't the first person to realise that, so you can find dozens of C libraries that do exactly that on GitHub, probably more on SourceForge, GitLab & Co. However, if you want to add a little spice to your implementation, why not make it completely independent from a C library or even an ABI (at least on x86_64)?

Multitasking was invented by John Multitasking in 1994 when he tried to use a computer twice at the same time.

Creating a simple scheduler

Let's start by creating one of these "normal" schedulers I talked about above. We can do that using the default setjmp and longjmp, but don't worry, we'll replace these later with our own code. Let's start by creating a task struct that saves all the information we have about each individual task. Since we need some sort of dynamic list, let's put that into the struct as well.


#if __bool_true_false_are_defined != 1
// We don't need stdbool.h
typedef enum { false, true } bool;
#endif

typedef struct task {
    void (*call)(); // The function that starts the task
    bool was_called; // Whether we already called the above function
    uintptr_t stack_top; // Pointer to the top of the stack memory
    jmp_buf func_env; // Data to jump back into the task
    jmp_buf ctx_env; // Data to jump back to the scheduler
    struct task *next; // Next in linked list
    struct task *prev; // Previous in linked list
} task_t;

Alright, let me go through each member and explain why we need it. First off, the function pointer is actually used only once. We need to save it anyways, because we don't want to start the task right after creating it. 'was_called' is here so we don't call the function twice. 'stack_top' is pretty self-explanatory if you know what a stack is. We need a stack for each task, because variables are stored on it, and obviously each task has different variables. The only other thing that I should probably note, is that the stack grows down in memory in the x86(_64) platform. Now, the following two jmp_buf's named 'func_env' and 'ctx_env' are a core part of our scheduler. The 'jmp_buf' type stores data about the current thread state. How exactly it does that isn't defined by the C specification, mainly because restoring the previous state of a thread works vastly different on different platforms. I am going to cover what exactly needs to go in there for the x86_64 platform later in this post. 'next' and 'prev' simply store a pointer to the next and previous task in our list.

Creating a new task is rather simple. Because we don't want to rely on a C standard library, we are going to pass the task of retrieving memory for both the task struct and the tasks's stack onto the user. I'm also going to typedef a pointer to a task instance as ctx, just because it's easier to type, and makes more sense for anyone reading the code.


typedef task_t *ctx_t;

void add_task(ctx_t *ctx, // This is actually a pointer to a pointer
              task_t *buf,
              void *stack_bottom,
              size_t stack_size,
              void (*call)()) {
    // Save the function pointer for when we start the task
    buf->call = call;
    // Calculate the stack top
    buf->stack_top = (uintptr_t) (stack_bottom + stack_size);
    // ctx acts as a pointer to the first task in our list
    if (!*ctx) {
        *ctx = buf;
    }
    // Make the list circular
    buf->next = *ctx;
    (*ctx)->prev = buf;
    // Get the last task in the list
    task_t *last = get_last(*ctx);
    // Link the task to the last task
    buf->prev = last;
    last->next = buf;
}

The most complicated part of that function is where we connect our newly created task to the previous ones. Here, we use the 'get_last' function which, is just iterating over the list until it hits its tail. Since the list is circular, meaning that the last element's 'next' is the first element, the tail is the first element which is passed as an argument to the function.

Now to the interesting part: the actual scheduling function. Actually, it's two functions: one to iterate over the tasks and the other to actually execute them. Let's take a look first, then I'll explain what each part does.


int pursue_task();
task_t *current_task = 0;

void start(ctx_t *ctx) {
    current_task = *ctx;
    for (;;) {
        if (pursue_task() > 1) {
            // Are we in the last task?
            if (current_task->next == current_task)
                break;

            task_t *prev = current_task->prev;
            task_t *next = current_task->next;

            // If we are in the first task, update that
            if (*ctx == current_task)
                *ctx = next;

            // Unlink current task
            prev->next = next;
            next->prev = prev;
        }
        // Move on to next task
        current_task = current_task->next;
    }
}

// Take control of the stack pointer
register uintptr_t stack_pointer asm ("rsp");

int pursue_task() {
    // Save scheduler state
    int i = setjmp(current_task->ctx_env);
    if (!i) {
        if (!current_task->was_called) {
            current_task->was_called = true;
            // Set stack pointer to task's stack
            stack_pointer = current_task->stack_top;
            // Call task
            current_task->call();
        } else {
            // Restore previous task state
            longjmp(current_task->func_env, 1);
        }
    }
    return i;
}

The iteration part is rather "trivial". We start by saving a pointer to the first task into 'current_task', a global variable. Then we call the function to execute our thread and check its return code; I chose anything above one to exit the task. Skipping the if block for now, we just go to the next task using, the 'next' member of our task struct. Inside the if block we first check whether the current task is the only one in the task list. Since our list is circular we can do so by checking if the next task would be the current task. If we are indeed in the last running task we can safely exit the loop and return from the function. Otherwise, we need to take the task out of our list before continuing to the next one.

The execution part is a bit more complicated. We start of by saving our current task state into the 'ctx_env' of the current task using 'setjmp'. The functions return value has a very important property: much like POSIX's 'fork' it returns whether we just restored the task state or not. Zero indicates that we returned from the normal function call as you see it in code, anything else tells us that a call to 'longjmp' has jumped to this line of code. We use this knowledge to execute the task. The first part of the if block just calls the function. Additionally, it sets the stack pointer to the value specified. We can access the stack_pointer using some inline assembly and C's 'register' keyword. Keep in mind that this particular version of inline assembly is only available in compilers supporting the GNU C dialect. The other part is not so obvious at first glance. Here we longjmp to a state saved by the task itself. To actually enable the task to do so, with little code, we are going to need some macros.


#define yield if (!setjmp(current_task->func_env)) \
    longjmp(current_task->ctx_env, 1)
#define dump if (!setjmp(current_task->func_env)) \
    longjmp(current_task->ctx_env, 2)

This code already provides us with a working coroutine implementation, running everywhere (GNU) C does. It does not provide any fancy asynchronous I/O functions, thread sleeping and so on, but at least it works. Additionally, it is also neither independent from the C standard library nor the used ABI, but for now let's just set up some code to actually test our scheduler.


#include <stdio.h>
#include <stdlib.h>

void my_async_func() {
    puts("Func1: 1");
    yield;
    puts("Func1: 2");
    dump;
}

void my_async_func2() {
    puts("Func2: 1");
    yield;
    puts("Func2: 2");
    dump;
}

int main() {
    ctx_t ctx = {0};
    task_t t1 = {0}, t2 = {0};
    add_task(&ctx, &t1, malloc(4096), 4096, my_async_func);
    add_task(&ctx, &t2, malloc(4096), 4096, my_async_func2);
    start(&ctx);
    free(t1.stack_top - 4096);
    free(t2.stack_top - 4096);
    return 0;
}

Declaration of independence

What we've made until now works, but it is far from being independent. There are a few main issues we have to fix. First of all, we have to remove our dependency to 'uintptr_t' and 'size_t'. C doesn't provide exact-size integers natively, but integers have to have a certain minimum size. Lucky for us, because we only target x86_64 and our integers can easily be larger than 64-bit without breaking things, we can just use one of these, namely 'unsigned long long int', which has to be at least 64-bits wide. Using it is easy; for convenience one could typedef it after which you just have to replace 'uintptr_t' and 'size_t'. The other dependencies are 'setjmp' and 'longjmp' as well as their accompanying type 'jmp_buf'. To implement these functions we need assembly. I am going to put their implementation into a separate file, mainly because I use NASM for assembling, but if you have the nerves to do so you could use GCC inline assembly.

Let's start off with some function calling theory. Calling in x86 and its compatibles is, apart from arguments and return values, ABI independent. The caller (the place where the 'call' instruction is) pushes its instruction pointer to the stack. The instruction pointer conveniently always points to the next instruction so when using this value we don't have to worry about offsetting it. After that the caller jumps to the desired location and execution continues. Once the CPU reaches a 'ret' instruction (we call this place the 'callee') it's going to pop a value off the stack and jmp to it in memory. Now you might ask yourself: how is this important if every task has its own stack anyways? Well, each task does have its own stack, but once on the stack, a value doesn't stay there forever; it can get overwritten by variables or other return addresses. Since we want to return to the place 'setjmp' was called from, in 'longjmp', we're going to have to save this return pointer. And that is already the most important part. We do have to save registers, but that is a pretty straight-forward process without any pitfalls. To store everything, we will use a plain old array. Again, I'm going to typedef this to a nicer name. Keep in mind that array types are defined like this:


// I know we need 8, trust me
typedef myuint64_t my_jmp_buf[8];

Let's look at the implementation of setjmp, for the System V ABI (most to all UNIX systems, not Windows yet).


; Make the following executable
section .text

; Expose our function to the outside world
; The _ is a prefix C adds to functions automatically
global _my_setjmp

_my_setjmp:
    ; Arguments are passed using registers.
    ; First argument (my_jmp_buf) is in rdi.
    ; Dereferencing a pointer in assembly is quite straight forward.
    ; To move a register to a pointer that is in rdi use
    ; mov qword [rdi + 0], reg
    ; Qword tells the cpu it needs to copy a 64-bit value and 0 can 
    ; be any offset to the pointer.
    ; For System V we only need to store rbx, rsp, rbp, and r12 to r15
    mov qword [rdi +  0], rbx
    mov qword [rdi +  8], rbp
    mov qword [rdi + 16], r12
    mov qword [rdi + 24], r13
    mov qword [rdi + 32], r14
    mov qword [rdi + 40], r15

    ; rsp gets a special bit
    ; This basically puts rsp into rax and adds 8 to it, just a bit faster
    lea rax, qword [rsp + 8]
    mov qword [rdi + 48], rax

    ; The return address is at the stack bottom
    mov rax, qword [rsp]
    mov qword [rdi + 56], rax

    ; setjmp has to return 0 if it doesn't come from longjmp
    ; rax is used for returning in sys v
    xor rax, rax

    ret

I put a lot in the comments already, but you might not know why we save rsp + 8 instead of just rsp. It will become clearer in a second, but we need to push to return address we saved back to the stack, which, on x86_64, happens to require eight bytes, and 'lea' makes the +8 slightly faster. 'longjmp' is pretty much the same code just the other way around. We first restore rsp, then we push the return value to the stack and restore all other registers we saved in 'setjmp'.


; Expose our function to the outside world again
global _my_longjmp

_my_longjmp:
    ; rdi holds our my_jmp_buf again
    ; rsi holds the 

    ; Restore the stack pointer
    mov rsp, qword [rdi + 48]

    ; Push the return address
    push qword [rdi + 56]

    ; Restore the registers
    mov rbx, qword [rdi +  0]
    mov rbp, qword [rdi +  8]
    mov r12, qword [rdi + 16]
    mov r13, qword [rdi + 24]
    mov r14, qword [rdi + 32]
    mov r15, qword [rdi + 40]

    ; Set the return value
    mov rax, rsi

    ret

That wasn't too hard, was it? We can plug this into our code by just replacing any 'setjmp'/'longjmp'/'jmp_buf' with 'my_setjmp'/'my_longjmp'/'my_jmp_buf' respectively and adding the following two lines so that C knows the functions exist.


// Sizes aren't too important here
extern int my_setjmp(my_jmp_buf);
extern void my_longjmp(my_jmp_buf, int);

So we've removed the C library as a dependency, but what if you are a Windows user? Or even use some other weird compiler that doesn't support either of those ABI's? Fear not, dear friend, help is on its way. The following part should be easily portable to any language supporting inline assembly, so feel free to do so. In a nutshell we rely on your language's knowledge of it's own ABI. Since we can create functions that load arguments in some way and we can put these into a register using inline assembly, allowing us to define our own ABI which provides maximum compatibility with other ABI's. I've chosen rax and rdx as the two only caller-saved registers since these are used by both the 'div' and 'idiv' instructions which preform unsigned and signed integer division respectively, making them common in ABI's to caller-saved. Alternatively, one could use r8 and r9, since these are useable in Microsofts calling convention and usable enough in the System V ABI. Changing these registers in the assembly code, is just a matter of replacing their names, but from our C code we have to build a bridge to make calling work properly.


// C takes arrays as pointers in arguments
int my_setjmp(myuint64_t *buf) {
    myuint64_t ret = 0;
    // This just moves the first argument into rax
    asm volatile("movq %0, %%rax" : : "r" (buf));
    // Calling the function in assembly to prevent the
    // compiler from changing the stack pointer
    asm volatile("call _my_setjmp_asm");
    // Our tiny ABI uses rdx to return values
    asm volatile("movq %%rdx, %0" : : "m" (ret));
    return ret;
}

void my_longjmp(myuint64_t *buf, int ret) {
    // We are loading the second argument first
    // because inline assembly puts every
    // variable it loads into rax which would
    // overwrite buf
    asm volatile("mov %0, %%edx" : : "r" (ret));
    asm volatile("movq %0, %%rax" : : "r" (buf));
    // Here we don't care about the stack since
    // we're restoring it anyway before we use it,
    // therefore a normal call would be alright, but
    // you never know what crazy things can be
    // Part of an ABI
    asm volatile("call _my_longjmp_asm");
}

As you might have noticed we need to make a few more changes to our assembly code to make it work again. First of all we now can save all registers apart from rax and rdx in our my_jmp_buf. Since rax is used for the first argument, any temporary values from setjmp have to be stored in rdx instead. We also don't return using rax anymore, but rather rdx. Additionally, we have to rename the assembly functions, to fit the calls in our bridge functions. Lastly, and most complex...ly, we have to save two more values from the stack, the one pointed to by rbp and the one 8 bytes above that. This is because we need return to the bridge functions before they return to the original caller. The rbp register should point to the stack base of the original caller and the value above that, again this is 8 bytes offset on x86_64, is the return address of the original caller. Now that we have three values that need pushing to the stack we also need to reserve 16 more bytes for that. Saving all of these values properly should look like this


; rsp gets a special bit
; This basically puts rsp into rax and adds 8 to it, just a bit faster
lea rdx, qword [rsp + 24]
mov qword [rax + x], rdx

; Save the stack base of the original caller
mov rdx, qword [rbp]
mov qword [rax + x + 8], rdx

; Save the return address to the original caller
mov rdx, qword [rbp + 8]
mov qword [rax + x + 16], rdx

; The return value is at the stack bottom
mov rdx, qword [rsp]
mov qword [rax + x + 24], rdx

To properly return from assembly these values need to be pushed back onto the stack like so:


; Restore the stack pointer
mov rsp, qword [rax + x]

push qword [rax + x + 8]
push qword [rax + x + 16]

; Push the return address
push qword [rax + x + 24]

Conclusion

This probably won't find any real-world use. If you have a C library to work with, you can just use its 'setjmp'/'longjmp' implementation, and if you don't you likely know specifically what ABI you are using. This merely serves as a fun exercise to learn a thing or two about the inner workings of C and other languages that have coroutines. This is also not a definitive coroutine implementation. There are some great, completely independent pure-C implementations, using only the C preprocessor, which truly deserves an article of its own. Anyhow, I hope this helped someone to better understand cooperative multitasking, and until next time:

Farewell!

You can find the actual implementation of this post on GitHub.