An overview of virtual memory management on Maestro

2024-04-08T13:00:00+00:00

And how to retain consistency when memory allocation failures occur?

First, some news about the project: I have not been very active on this blog for some time because I am currently putting as much effort as possible into cleaning up the codebase. There is a large pull request on GitHub that summarises the changes I made to the project, and there will be blog articles detailing the interesting parts.

As I am cleaning up the codebase of the kernel, I stumbled across the old code that handles processes’ virtual memory, and I decided to rewrite it as it was rather ugly and had issues, in particular, retaining consistency when memory allocation failures occur.

The problem with memory allocation failures

When running an userspace programme, the main way of allocating memory is through the use of malloc (or, depending on the needs, directly mmap).

According to the POSIX specification, if “Insufficient storage space is available” the malloc function fails, returning a NULL pointer and setting the errno to ENOMEM. In practice, on Linux systems, this rarely happens: when the kernel runs out of memory to allocate to a process, the OOM-killer runs in order to free up memory, killing processes according to their respective OOM-score (which is a metric defining their priority to stay alive).

As such, handling failures of malloc is often (but not always, depending on the application) not a great concern as the process will be killed instead.

The Rust language even took the decision to elide it completely for practicality reasons, as stated by this RFC:

Many collection methods may decide to allocate (push, insert, extend, entry, reserve, with_capacity, …) and those allocations may fail. Early on in Rust’s history we made a policy decision not to expose this fact at the API level, preferring to abort. This is because most developers aren’t prepared to handle it, or interested. Handling allocation failure haphazardly is likely to lead to many never-tested code paths and therefore bugs. We call this approach infallible collection allocation, because the developer model is that allocations just don’t fail.

However, the main issue with the OOM-killer approach is that the processes being killed are not necessarily the ones causing troubles on the system, leading to potentially more issues.

In Maestro

Under Maestro, I preferred reducing as much as possible the need for an OOM-killer, using it only in very specific situations. Thus, memory allocations on Maestro can fail.

The main consequence is that every function that allocates memory in the kernel needs to return a Result, in order to return an error in case the allocation fails. This includes functions on collections such as Vec, HashMap, BTreeMap, etc… which had to be reimplemented to support this.

This is also the case for certain cloning operations. As such, the kernel has a TryClone trait that returns a Result. A blanket implementation is also present so that every type that implements Clone also implements TryClone, which will never fail for those particular types.

The collect operation also becomes fallible, which is handled by the CollectResult wrapper.

Consistency problems arise when we need to make several allocations sequentially to modify the state of an object. Let’s say that for a given procedure, you have to make three operations that can allocate memory (and thus fail). If the first two allocations succeed but the third fails, then there is a need to undo the first two allocations in order to return to a consistent state.

Now, if the number of fallible operations is large, or dynamic depending on the input (as it is the case for processes’ memory mapping/unmapping), the code to undo those operations can quickly become quite cumbersome.

How is processes’ memory handled in Maestro?

In Maestro, a process’s memory is handled by a structure called MemSpace, which is architecture-independent.

Each MemSpace contains:

A set of gaps, that represent chunks of virtual memory that are unused
A set of mappings, that represent chunks of virtual memory that are currently mapped for the kernel
Its architecture-dependent counterpart, that contains paging instructions for the CPU: the VMem structure

The two main operations on a MemSpace are the following:

Mapping memory (used to implement the mmap system call): searching for a range of memory matching given constraints and making it available for use by the process
Unmapping memory (used to implement the munmap system call): deleting the given range of memory, making it unavailable to the process

A map or unmap operation can span several mappings and gaps, as is required for the mmap and munmap system calls.

As an example: Let’s say the process has memory mapped in the ranges:

0x100000-0x200000
0x250000-0x300000

Now, the user want to unmap memory in the range 0x150000-0x2a0000. Then the remaining ranges being mapped must be:

0x100000-0x150000
0x2a0000-0x300000

The same problem arises for mapping. If we replace the unmap operation with a map operation in the range 0x150000-0x2a0000, then previous mappings must be partially unmapped to let the place to the newly created mapping. Resulting in the following:

0x100000-0x150000
0x150000-0x2a0000
0x2a0000-0x300000

`oom::wrap` bad

Previously, to avoid the complexity of handling the undo operations, I used the oom::wrap function, which I really dislike and would like to get rid of entirely. This function basically takes a closure to perform the fallible operation, then it tries to execute it. On failure, the function kills a process (or at least, is supposed to kill as there is a TODO instead) to retrieve memory, then retries (with a fixed amount of tries before giving up, aka triggering a kernel panic).

oom::wrap(|| {
    /* Here, `insert` returns a `Result` retrieved by `oom::wrap`
     * If the Result is an error, this closure will be run again
     *
     * Cloning is necessary. Since the closure is `Fn`, it can only capture a reference to
     * `element`, so it has to be cloned at each try because we want to insert the element itself
     * and not a reference to it.
     */
    collection.insert(element.clone())
});

As the closure can be executed multiple times, the function has to take a Fn parameter instead of FnOnce, hence needing to clone anything that might be passed to it, resulting in:

Implementing the Clone on types you might really not want to be cloneable
Performance losses

On top of that, this is a terrible solution, as it is basically the same as having allocations that cannot fail, but more tedious to use.

Copy-On-Write

Virtual memory works by mapping ranges of addresses to pages of physical memory. It allows the kernel to define exactly where and how each process can access memory.

In practice, under both Maestro and Linux, when a range of virtual memory is allocated for a process, the underlying physical memory is lazily allocated.

For example, when a page of memory is mapped with write permission and zero-initialised, in reality, the kernel maps the virtual memory to a “default” physical page that contains all zeros and is shared between all pages that have not yet been written to.

Then, when the first write from the process occurs on the page, it triggers an exception (as the page is mapped to read-only) which is caught by the kernel. The kernel then allocates a physical page, remaps the virtual memory with that page so that the kernel can write on it, then resumes execution at the instruction that caused the exception.

From the point of view of the process, nothing happened. This feature allows to avoid allocating physical pages to map for virtual pages that are allocated but might not actually end up being used, saving both CPU time and memory.

This feature also works when forking a process: During the fork operation, the virtual memories of both processes are remapped so that they point to the same physical page, but are both read-only. Then, when either process tries to write on the page, the kernel allocates a new physical page, copies the data, then remaps the memory to allow writing before resuming.

Copy-On-Write is one of the few places where we have to use an OOM-killer because we basically lied to the process: from its point of view, the physical memory should already have been allocated since the creation of the mapping.

The Model-View-Controller pattern

It turns out I accidentally used some kind of MVC pattern while implementing virtual memory management.

MVC diagram

Model (MemSpace): contains mappings that define how the view (VMem) must be arranged
View (VMem): defines the layout of the virtual memory the controller (Process) runs on
Controller (Process): calls mmap/munmap/etc… from the userspace to modify the model (MemSpace)

Atomic operations on `MemSpace`

So, we need a way to make several fallible operations at once, then rollback if one fails.

Let’s call a sequence of fallible operations a “transaction” (reminding of databases). A transaction in Maestro is represented by the structure MemSpaceTransaction, implemented here.

MemSpace stores MemMapping and MemGap instances in BTreeMap collections. Map and unmap operations require several insert and remove operations on those collections that must be performed atomically.

The insert function can fail as it may allocate memory for the new node in the tree. However, remove cannot fail as it only frees memory.

Since a rollback operation is not allowed to fail (else, how could the kernel recover?), remove shall be the sole operation to be used during rollback (with one exception we will talk about later).

A transaction is used like this:

// Create the transaction for the MemSpace
let transaction = MemSpaceTransaction::new(&mut /* the inner state of MemSpace */, &mut /* the VMem instance of MemSpace */);
// Here, fallible operations happen, such as:
transaction.insert_gap(/* the MemGap to insert */)?;
/*
 * Commit the transaction to validate the modifications.
 * If the transaction is dropped before calling this function, the implementation of `Drop` performs the rollback.
 *
 * On error, the `?` operator stops the execution of the current scope, so this line is never reached and the transaction is dropped.
 */
transaction.commit();

Committing the transaction performs the necessary remove operations, as we are now sure every fallible inserts have succeeded.

Inside the insert_gap function (but also insert_mapping) the following function is used for insertion in the collections:

fn insert<K: Clone + Ord + Hash, V>(
	key: K,
	value: V,
	on: &mut BTreeMap<K, V>,
	complement: &mut HashMap<K, Option<V>>,
	discard: &mut HashMap<K, ()>,
) -> AllocResult<()> {
	// Insert new value and get previous
	let old = on.insert(key.clone(), value)?;
	// Insert `None` to reserve memory without dropping `old` on failure
	let Ok(val) = complement.entry(key.clone()).or_insert(None) else {
		// Memory allocation failure: rollback `on` for this element
		rollback_impl(on, key, old);
		return Err(AllocError);
	};
	// Write the actual complement value
	*val = old;
	// Do not discard an element that is to be restored by the complement
	discard.remove(&key);
	Ok(())
}

The function performs the insertion, but also performs two important operations:

It adds the previous value (if any) to a complement collection, whose role is to revert the modifications during rollback
It removes the key from discard, which is the set of elements to be removed on commit, so that the newly inserted element is not removed on commit

The rollback operation is done using these functions:

fn rollback<K: Ord + Hash, V>(on: &mut BTreeMap<K, V>, complement: HashMap<K, Option<V>>) {
	for (key, value) in complement {
		rollback_impl(on, key, value);
	}
}

#[cold]
fn rollback_impl<K: Ord + Hash, V>(on: &mut BTreeMap<K, V>, key: K, value: Option<V>) {
	let _ = match value {
		// Insertion cannot fail since `on` is guaranteed to already contain the key
		Some(value) => on.insert(key, value).unwrap(),
		None => on.remove(&key),
	};
}

The rollback function iterates on the complement collection and rolls back the changes by placing back the elements that previously had the same key, or simply removing the element if nothing was there before. The exception we talked about earlier is here: the insertion cannot fail because it only swaps the value.

Atomic operations on `VMem`

VMem also has a transaction system, although way simpler.

Inside of VMem, there is a pointer to the structures for paging (see Wikipedia).

A transaction performs map/unmap operations, replacing previous entries in those structures. But the transaction also keeps track of the previous entries in a Vec acting as a FIFO.

This way, if the transaction is dropped without being committed, the implementation of Drop rolls back every operation by inserting the entries of FIFO back in place, similar to repeated Ctrl + Z in a text editor.

Bonus: A funny Tale

While implementing all of this, I stumbled across a bug for which I could not figure out the cause. After a call to fork, the execution of the child process just seemed to be randomly jumping to the address 0x1 with no apparent reason (obviously causing a segmentation fault).

After more than 3 weeks of debugging, I finally found out…

On x86 CPUs, there is a cache called the Translation Lookaside Buffer (or TLB for short), which stores virtual memory mapping information so that the CPU does not need to retrieve it from the main memory each time it needs it, speeding up operations.

The consequence is that when you modify virtual memory mappings, there is a need to invalidate the cache at the locations you modified to force the CPU to update them. This can be done in several ways, one of which being the use of the invlpg instruction to invalidate a page of memory by giving its virtual address as an operand.

I vividly remembered reading somewhere a long time ago that QEMU (which I use most of the time to test the kernel) does not emulate the behaviour of the TLB, so I thought leaving a TODO instead would not harm for now, and I moved on (as I did not want to spend time at that moment thinking about the design of the API for that particular feature).

Then, after wasting so much time, I came across this: Stackoverflow: Does QEMU emulate TLB?

The consequence of not invalidating the cache was that the CPU kept randomly using old memory mappings, so when the Copy-On-Write operation was carried out after the call to fork, the data was copied from the wrong page.

Moral of the Tale: leaving TODOs = bad

Discuss this article on Discord

Contact me: blog@lenot.re