Page Cache: avoiding unnecessary disk I/O

2025-05-12T08:30:00+00:00

Along with some other news!

Article cover image

It has been a long time since I last wrote a blog article!

When I started this blog, I told myself I would not be writing blog articles just for the sake of making content. I want to have actual meaningful advances before I post something. So what’s new?

Maestro now supports the x86_64 architecture!

I have implemented support for the x86_64 architecture. So Maestro, when compiled under this architecture, is now able to run both 64 bit and 32 bit programs at the same time.

Support for the original x86 (32 bit) architecture remains.

The next planned architecture support will be for ARM. Although this is not a priority at the moment, and will come much later in the future.

I have talked at the p99conf!

This is a bit old, but on the 23rd of October, I gave a talk at the p99conf, organised by ScyllaDB. They also kindly reposted my blog article about FlameGraphs on their website.

In my talk, I explain some of the optimisations I have made when refactoring the codebase and the performance improvements I obtained. Parts of this talk are now outdated, as there have been a lot of changes in the code since (in particular about files management), but the rest is still relevant.

I rewrote most of the kernel

As I stated in my first blog article, the codebase of the kernel was not very clean at the beginning. So I rewrote most of it, although there are still parts that I am not proud of.

Even though the performance improvement was huge, the kernel was still too slow to be usable. Due to the absence of caching, the operating system was spending a lot of time doing unnecessary I/O operations on the disk.

So let’s see how this issue was fixed!

Caching directory entries

The first step in caching the content of the disk is to cache directory entries, since they are often accessed, in particular during path resolution.

By the way if you are curious, the path resolution process is described in the manual: path_resolution(7).

The VFS (Virtual FileSystem) is the abstraction layer that links all mounted filesystems together. In Maestro, its fundamental building block is the vfs::Entry structure, which is a representation of a cached directory entry. It is wrapped in an Arc.

Here is its definition:

pub struct Entry {
    /// Filename.
    pub name: String,
    /// The parent of the entry.
    ///
    /// If `None`, the current entry is the root of the VFS.
    parent: Option<Arc<Entry>>,
    /// The list of cached file entries.
    ///
    /// This is not an exhaustive list of the file's entries. Only those that are loaded.
    children: Mutex<HashSet<EntryChild>>,
    /// The node associated with the entry.
    ///
    /// If `None`, the entry is negative.
    pub node: Option<Arc<Node>>,
}

EntryChild is just a wrapper around Arc<Entry>. It implements the Hash trait over the name field, which allows to use HashSet.

Otherwise, we would need to use HashMap, which would require cloning name to use it as key.

The Node structure represents an inode (that is, all the information about a file, except its name, since there can be several hard links to the same inode).

All the instances of vfs::Entry form the tree of all the currently cached files on the system, which is only a subset of all the files present on all disks.

Here is a representation of a small example VFS tree:

VFS tree

Legend:

  • Full arrow: Link to a child entry (in the children field)
  • Dotted arrow: Link to parent entry (the parent field)

Negative entries

If you look carefully, you can see the node field in vfs::Entry is an Option<...>. When None, it represents a negative entry, which caches the fact that a file does not exist. This concept is also present in the Linux kernel.

“Why is this useful?” You may ask, well…

Imagine we are performing path resolution for /etc/foo. The etc directory exists, but foo does not.

Let’s decompose path resolution (assuming negative entries are not implemented):

  • etc might be cached, or not if we access it for the first time, in which case we need to load it from disk (slow).
  • foo cannot be in cache since it does not exist. So we have to attempt to load it from disk. Here we realise it does not exist, so path resolution fails.

Now let’s imagine we repeat this process with the same file over and over.

It may be the case when using a shell, executing the same command several times. The shell is repeatedly looking for the executable file at the locations specified in the PATH environment variable.

Each time we are executing the command, assuming there is no negative entries, the kernel will have to read from disk every time to check if the file exists.

Negative entries allow to avoid this. Instead, when we read from disk and find out the file does not exist, we can create a negative entry so that further accesses do not require disk I/O.

Caching file content: The Page Cache

We have now covered caching of directory entries. But what about the content of the files?

We often need to access the content of the same files over and over again, so we have to do caching there too. In general, operating systems implement a page cache for this purpose.

The page cache makes use of all the available memory on the system (the sentence “Unused RAM is wasted RAM” applies pretty well here).

If you look up the content of /proc/meminfo on Linux, you can see many lines appear, representing different kinds of states for memory pages. Some are free and can be used for new allocations (MemFree), some are allocated (Active), while others are used to cache the content of the disk (Cached).

This distinction is important because cached memory can actually be reclaimed whenever it is needed for an allocation. If the system runs out of program memory, it can just reclaim some from the cache, freeing in priority the least often used pages.

Page Cache LRU

Maestro is using the Least Recently Used (LRU) policy for the page cache. It is represented by a doubly linked list acting as a queue.

A frame (which is an allocated group of 2^^n pages) is represented by a structure called RcFrame. It includes a reference counter (hence the Rc prefix) and pointers for the cache’s linked list.

Since the pointers are inside the RcFrame structure, the linked list is said to be intrusive. We want to use an intrusive list to avoid memory allocations, since memory allocations can fail, and handling memory failures is tedious.

As a rule of thumb: You rarely want to use linked lists. There are almost always better solutions.

When a frame is accessed, it is promoted in the LRU (moved to the beginning) so that it is stored as the most recently used frame.

Relationship with mmap(2)

Once implemented, the page cache makes it very convenient to implement file mapping in memory with mmap(2). We can just reuse the physical pages of the page cache to map them in virtual memory.

Conveniently, any one physical page can be mapped at several virtual locations if necessary, allowing several processes to map the same file concurrently (for example)

The inconvenient thing is that any process that has the file mapped may modify its content at any time concurrently. This is not allowed by the Rust typing system and thus leads to undefined behaviours inside the kernel.

As such, we cannot represent the content of a page with a slice (&mut [u8]).

Instead, we have to use a *mut u8 (since mutable pointers allow aliasing, contrary to mutable references) and pass it to a special version of memcpy that allows concurrent accesses. This function is actually the same we use for copies between the userspace and kernelspace.

read(2) and write(2) also become trivial to implement: we just copy the data between the userspace and the cache’s pages

Writing data back to disk

Cached (and process-mapped) file contents have to be written back to disk at some point.

This has to be done regularly to avoid losing data in case of a system failure (such as a loss of power or a crash). To do this, Maestro has a kernel thread, called the flush task, that periodically flushes all the dirty pages to their corresponding location on disk.

Each page of physical memory has an associated structure, called Page, which stores metadata about the page (equivalent to page in Linux). This structure is in a fixed location in memory and initialised at boot for each page.

Page stores a dirty flag, which indicates whether the page has been modified since the last flush. It also stores the timestamp of the last flush.

So now we need a way to detect dirty pages.

Before we go further, you may want to know how paging works under the x86 architecture. See “Paging” on the OSDev wiki.

Under the x86 architecture, each entry in the page directory has a dirty flag, which is set by the CPU when a page has been written to.

This flag could be used to detect if the associated page needs to be flushed to disk; however, there is a caveat: this flag indicates whether the virtual page has been written to, but we want to know if the physical page is dirty. Since a physical page can be mapped at several virtual locations, we would need to scan all the virtual contexts of all processes to figure out if a page is dirty (which would be very expensive) or store all locations where a physical page is mapped (which is annoying to implement).

I am guessing the Linux people also stumbled across the same problem, since the msync(2) man page specifies:

msync() flushes changes made to the in-core copy of a file that was mapped into memory using mmap(2) back to the filesystem. Without use of this call, there is no guarantee that changes are written back before munmap(2) is called.

This statement moves the responsibility of determining when to synchronise the data to the disk, to userland applications (how convenient!)

The only thing that remains to be done is, when munmap(2) or msync(2) is called, check the dirty flag of the associated virtual pages and update the dirty flag of the corresponding physical pages (in the Page structure).

The flush task can then simply iterate over physical pages and flush those with the dirty flag set.

Filesystem structures mapping

The page cache also allows caching the inner structures of filesystems, saving a lot of I/O. This also makes filesystem implementations simpler, since we don’t have to manually write those structures back to disk all the time.

However, since we are mapping pages in kernelspace and not the userspace, the responsibility of determining which pages are dirty is upon us.

We cannot scan the page directory, so the easiest solution is to simply manually mark pages as dirty when we modify filesystem structures.

If the filesystem is mounted, and someone else tries to write the block device file containing this filesystem. What happens? Could this cause corruption of internal kernel structures?

We just return EBUSY when attempting to access the block device. Problem solved.

Shrinking the cache

As said earlier, the kernel will use all the available memory to store cached pages.

But, what if we are running out of memory and want to allocate some?

The buddy allocator (which is used to allocate frames of physical memory) has two options to handle this case:

  • Fail upon memory exhaustion
  • Shrink the cache and attempt again

There are currently two caches that can be shrunk:

  • The directory entries cache
  • The page cache

This was not mentioned so far, but like the page cache, the directory entries cache also use an LRU cache in the form of a linked list.

Upon shrinking caches, the kernel simply attempts to release the last element in either cache.

If a page is in use (a page is mapped by a process, for example), or if it could not be written back to disk (I/O failure), the kernel attempts to release the next least recently used element (until the end of the list) to avoid losing data.

Overall VFS architecture

Currently, the file management architecture roughly looks like this:

VFS architecture

Legend:

  • Dotted arrow: optional

Note: FilesystemOps, FileOps and NodeOps are filesystem-specific. The implementation is different for each kind of filesystem (ext2, tmpfs, etc…)

NodeOps represents the set of operations that can be performed on a filesystem node. On the other hand, FileOps represents the set of operations that can be performed on an open file.

The distinction is important because some open files have semantics that are independent of the underlying filesystem. This is the case for pipes, sockets, character devices and block devices.

When opening such a file, the kernel will replace FileOps with an appropriate version to give the file its semantics.

In summary

Let’s recapitulate how file accesses are done in Maestro.

To open a file:

  • The userspace program calls open(2) (or similar)
  • The kernel performs path resolution, finding the corresponding vfs::Entry
    • At each directory to traverse, the kernel checks the directory entry cache, and if not found, it calls lookup_entry on NodeOps
  • The kernel now has the vfs::Entry with the corresponding Node in it, which contains NodeOps and FileOps
  • The kernel creates a new instance of File, pointing to the vfs::Entry
    • If the file has a semantic that is independent of the underlying filesystem (pipes, sockets, character devices and block devices), FileOps is replaced with the appropriate version
  • The kernel creates a new FileDescriptor pointing to the newly created File
  • open(2) returns the ID of the newly created file descriptor

To read a file:

  • The userspace program calls read(2) (or similar) on the file descriptor
  • The kernel calls read on the FileOps stored in the corresponding File
  • The operations in FileOps are filesystem-specific. In the case of ext2, the function generic_file_read is called
  • generic_file_read calls read_page (on the NodeOps stored in the Node retrieved from File), in a loop to cover the required range
  • read_page checks for the page at the required offset in the page cache
    • If not found, the page is read from disk and inserted into the cache
  • generic_file_read then copies the content of the cached page to userspace

Writing to a file works almost the same way as reading, except:

  • The userspace program calls write(2) (or similar)
  • The kernel calls write on FileOps instead of read
  • generic_file_write is called instead of generic_file_read
  • After the page is retrieved with read_page, data is copied from userspace to the page
  • The page is then marked as dirty, to be written back to disk
  • The flush task asynchronously (later) writes dirty pages to disk

What’s next?

The last goals I have talked about on this blog were in the first blog article, where I stated I wanted Network support and Shared libraries support.

As of today, network support is on hold and will be coming later. Shared libraries support is theoretically working, since mapping a file with mmap is now working. Although it has not been tested so far.

For this year (2025), I have established the following goal: To be able to do software development on Maestro itself

The bare minimum required to do this is to port at least the following software:

  • A working linker and compilation tools: I will attempt to port binutils
  • A working compiler: gcc or clang shall do it
  • A text editor: I will attempt to port my favourite one: vim

Since, I would like to iterate faster on the kernel, I will stop testing blimp and maestro-install against the kernel for now. Those two pieces of software are hard to build, which prevents me from using them in the kernel’s Continuous Integration pipeline, hence prompting me to test them manually, which is very time-consuming.

If you are willing to test the OS, you should build it manually. There is a guide on how to do it on the kernel’s repository (located in doc/src/build.md at the moment).

If more time is left before the end of the year, I will attempt to:

  • Build a compiler (either gcc or clang) using itself, on Maestro. This will also require porting all necessary tools to compile it (such as autoconf and make for example)
    • Since building a compiler is very long, I will absolutely need to implement SMP support first (Symmetric MultiProcessing, aka support for multiple CPU cores)
  • Port rustc and cargo (this would require network support for a meaningful use)
  • Port git (this would also require network support)

The above goal might require a lot of dependencies, hence requiring that I rework blimp first to avoid manual work.

That’s all for today. I hope to have something to write a new article about soon!