Non-Volatile Memory Express (NVMe)

2026-04-23T10:30:00+00:00

An overview of NVMe and its support on Maestro

Article cover image

As stated in my previous blog article, I started an implementation of an NVMe driver. It is now functional, and this article is an overview of it!

NVMe (Non-Volatile Memory Express) is the modern standard for SSD drives. Its specification is defined by the NVM Express Consortium.

Now you may ask: Maestro supports PATA and NVMe, why not implement SATA first?

The answer is pretty straightforward: NVMe support is easy to implement!

Now you may have noticed that it took a few months before this article came out. While the implementation of the driver was easy, my kernel needed a bit of redesign in some places to fix design flaws that prevented the driver from working correctly. We will talk about this later.

The NVMe interface

For the implementation of this driver, I base myself on the version 1.3d of the specification (mainly because it is easy to read)

From the point of view of a kernel, the NVMe controller is exposed as a device on the PCIe bus. The PCIe provides BARs (Base Address Register) that are addresses in memory where the NVMe controller’s registers are located.

Reading or writing to memory at the address of a BAR performs I/O with the NVMe controller.

Note that depending on cases, it may be necessary to disable the cache on BAR memory or enable Write-Through or Write-Combining.

The Write-Through flag on a virtual memory map tells the CPU writing to memory should be directly forwarded to the underlying device (RAM, or other).

Write-Combining works the same, except the CPU waits for the whole cache line to be written before forwarding (which usually improves performance relative to Write-Through).

The NVMe exposes a set of registers on BAR memory (non-exhaustive):

OffsetNameDescription
0x00-0x07CAPController capabilities
0x08-0x0BVSVersion
0x0C-0x0FINTMSInterrupt mask set
0x10-0x13INTMCInterrupt mask clear
0x14-0x17CCController configuration
0x1C-0x1FCSTSController status
0x24-0x27AQAAdmin queue attributes
0x28-0x2FASQAdmin submission queue
0x30-0x37ACQAdmin completion queue
0x1000+(2X)*YSQxTDBLSubmission queue X tail doorbell
0x1000+(2X+1)*YCQxHDBLCompletion queue X head doorbell

This table has been shamelessly stolen from osdev.org

Y is determined by reading the CAP register.

In the NVMe nomenclature, a disk is called a namespace. Namespaces are attached to a controller, with which the OS communicates.

Under Linux, in the /dev directory, you may see devices under the form nvmeX, nvmeXnY or nvmeXnYpZ.

X represents the ID of the controller, Y is the ID of the namespace on the controller, and Z is the ID of the partition on the namespace.

The specific details of the initialisation procedure of the NVMe controller are not very interesting and won’t be covered here. This is an overview rather than a tutorial.

Submission and completion queues

NVMe works with asynchronous I/O, through submission and completion queues. The OS writes commands in the submission queue, and the NVMe controller writes command results back into the completion queue.

If you are familiar with Linux’s low-level interfaces, this should remind you about io_uring.

One of the main advantages of this approach is that the NVMe controller can choose the order in which it processes commands, allowing optimisations.

A submission queue contains structures with the following layout:

BytesFieldDescription
3:0CDW0Opcode [7:0], fused operation [9:8], PSDT [15:14], command ID [31:16]
7:4NSIDNamespace identifier
15:8-Reserved
23:16MPTRMetadata pointer
31:24PRP1Physical Region Page entry 1 (source/destination in physical memory)
39:32PRP2Physical Region Page entry 2, or pointer to a PRP list
43:40CDW10Command-specific
47:44CDW11Command-specific
51:48CDW12Command-specific
55:52CDW13Command-specific
59:56CDW14Command-specific
63:60CDW15Command-specific

When a command is completed, the NVMe controller writes a structure in the completion queue with the following layout:

BytesFieldDescription
3:0DW0Command-specific result
7:4DW1Reserved
9:8SQHDSubmission queue head pointer (updated by controller after processing)
11:10SQIDSubmission queue identifier
13:12CIDCommand identifier (matches the submission entry’s CDW0 [31:16])
15:14-Phase tag [0], status field [14:1]

After writing this structure, the controller issues an interrupt to signal the OS there is something to read.

Command submission and completion

When submitting a command, the OS:

  • Locks the queue’s semaphore (see below)
  • Writes the submission entry to the submission queue
  • Writes a pointer to the current process in a side table to link the submission to its corresponding process
  • Updates the submission queue’s doorbell register to signal the NVMe controller a new entry is available to read
  • Makes the current process sleep

Note: Maestro uses semaphores with a number of permits that correspond to the number of entries in the submission queue. Before submitting a command, we take a permit and only release it when completion arrives. This ensures we don’t overflow the queue.

Once the NVMe controller completed the command, the following happens:

  • The NVMe controller writes an entry in the completion queue
  • The NVMe controller sends an interrupt (see the Message Signaled Interrupts chapter)
  • The interrupt is handled by the OS, which reads the new completion queue entry
  • The OS uses the ID in the completion entry to find the matching submission entry, and then uses the side table to find the matching process
  • The OS copies the completion entry to the side table so that it may be retrieved by the process that submitted the command
  • The OS wakes the process up
  • The OS updates the completion queue’s doorbell register to signal the NVMe the completion queue entry has been processed
  • Upon resuming, the process retrieves the completion queue entry from the side table to know if the command succeeded

Admin queues & Identification

At the start, the NVMe only has one submission/completion queue pair. Those queues are the admin queues, used to identify the controller and namespaces, and to set up I/O queues.

The Identify command returns a big structure that contains a lot of information about the controller, the list of namespaces, or a namespace.

We first use it to retrieve the information about the controller, then retrieve the list of attached namespaces, then retrieve information about each namespace.

With this information, we are able to use the Create_IO_Completion_Queue and Create_IO_Submission_Queue commands to create a pair of I/O queues.

Read and write

Read and write operations on disk each have their associated command.

To read or write data on disk, the OS simply specifies the size (in blocks), the LBA (Logical Block Address) and a PRP (Physical Region Page) or a PRP list.

A PRP is an address in physical memory, where the data is read from or written to (depending on the I/O direction). Passing a PRP list allows implementing scatter-gather I/O (similar to the readv(2)/writev(2) system calls).

Message Signaled Interrupts

Message Signaled Interrupts are supported by PCI/PCIe. They allow passing interrupts to the CPU without using dedicated PINs. Instead, a message is sent on the bus from the device (in our case, the NVMe controller) to the CPU.

In the case of PCIe, this message is materialised by a write operation of a DWORD (4 bytes) at a specific memory address.

This feature is called MSI-X (or simply MSI for the legacy version, but we will not describe it here, as NVMe requires MSI-X anyway).

It turns out, x86 CPUs have a dedicated memory address that, when writing a DWORD to it, will trigger an interrupt.

Upon scanning the PCIe for device discovery, the OS can retrieve a BAR pointing to a table used to map the device’s interrupts to the CPU’s interrupt table. It has the following layout:

Bits 127-96Bits 95-64Bits 63-32Bits 31-0
Vector Control (0)Message Data (0)Message Address High (0)Message Address Low (0)
Vector Control (1)Message Data (1)Message Address High (1)Message Address Low (1)
Vector Control (N - 1)Message Data (N - 1)Message Address High (N - 1)Message Address Low (N - 1)

This table has been shamelessly stolen from osdev.org

The Message Address is the address where the message is written in memory. The Message Data is the value of the DWORD written there. Vector Control contains flags.

Each entry (line in the table) corresponds to an interrupt ID on the device’s side. The format of Message Data is specific to the CPU architecture, and it contains the interrupt ID on the CPU side.

Maestro’s design flaws

A few design issues in the kernel made implementing the NVMe a bit harder than it should have been. This chapter is an overview of those.

Sleeping in a page fault handler triggered in execve

When using execve(2) to execute a programme, the kernel creates a new virtual memory space to build the new programme image.

To populate this new virtual memory space, the kernel temporarily switches to it. While doing so, it also entered a critical section because we could not allow switching context to another process, while being in a memory context that is different from the one bound to the current process. Otherwise, the process would resume with its original memory context instead of the temporary one, causing memory corruption.

For more information about critical sections, see the blog article about SMP.

To populate the memory space, the kernel memory-maps (mmap(2)) the ELF file. Memory pages are lazy-populated, so they are not loaded from disk immediately.

Then, the kernel sometimes needs to zero the end of some ELF segments. Doing so triggers a page fault (since the page is not present yet), which in turn reads the file from the disk to populate memory.

The NVMe driver sends the read command and then puts the current process to sleep, which, in turn, triggers a context switch to immediately continue executing any other process waiting for CPU time.

However, we said earlier that we entered a critical section, right? The precise goal of a critical section is to disallow context switching.

This design was valid before NVMe because my Parallel ATA implementation is only polling (looping until data is ready, instead of sleeping until an interrupt comes).

To fix this issue, I modified the Process structure to contain one more field (active_mem_space) that points to the currently bound memory space (that may differ from the process’s own memory space). The temporary switch changes the value in active_mem_space. Then, when the process resumes, it uses that memory space instead of the process’s memory space. So the critical section is not needed any more.

CPU tasks rebalance during an NVMe sleep

To start sleeping, the NVMe driver has the following code:

// Put the process into sleeping state
process::set_state(State::Sleeping);
// Reschedule
schedule();

There is a small gap between the two function calls, during which another CPU core might run the task that rebalances tasks across CPU cores.

Between the moment we set the process into Sleeping state and the moment we reschedule (effectively saving the process’s state), another CPU core could attempt to resume the task if it received the completion interrupt before schedule was called.

This made the process run on two CPU cores at once, with an invalid register state on one core.

To fix this issue, the process state now has a flag that locks it until the context has effectively been switched, preventing other CPUs from picking the task until schedule is called.

Disabling Write-Protect outside a critical section

Under x86, the CR0 register has a Write-Protect flag that, when disabled, allows the kernel to write on read-only pages. The state of this register is NOT saved by Maestro on context switch.

In execve(2) (again), the kernel was temporarily disabling this flag to write in the ELF’s read-only segments.

We cannot wrap this in a critical section either since, again, writing in memory might read from disk, which will put the process to sleep, which in turn will trigger a context switch.

When putting the process to sleep, the scheduler sometimes migrated the process to another CPU (to rebalance the load between cores). Since the process is resumed on a different core, with the Write-Protect flag enabled (since CR0 is not saved on context switch), the process resumes trying to write on a read-only page, which triggers a kernel panic.

As a fix, I stopped disabling Write-Protect here, and instead I implemented the mprotect(2) system call to use its logic inside of execve(2) and enable read-only once initialisation is over.

Future optimisations

The current implementation only has one I/O queue pair. On systems that have a lot of CPU cores, this might cause contention.

One way to reduce this is to create as many I/O queue pairs as available (but no more than the number of CPU cores) and assign them evenly across cores. As such, if the system has N CPUs and M I/O queue pairs available, each I/O queues pair gets N / M cores assigned.

Some NVMe commands are available to make some operations faster (non-exhaustive):

  • Write Zeros: marks blocks as full of zeros (also making the NVMe last longer since it avoids write cycles)
  • Dataset Management: gives hints to the controller about the frequency of access to certain blocks
  • Copy: copies data from/to the same namespace or different namespaces without involving the CPU

What’s next?

Now that the NVMe side-quest is over, I will get back to implementing support for a desktop environment. This is ongoing and progressing well!