Symmetric MultiProcessing, Hyper-Threading and scheduling on Maestro
2025-09-15T08:00:00+00:00
A dive into the PIC, PIT, ACPI, APIC, HPET, SMP, TLB and other confusing acronyms!
The article’s cover image above is a CPU die shot of an Intel Mobile Pentium II (Dixon), coming from Wikipedia, distributed under this license
Honestly, I had no idea of what cover image I could use for this article :’)
In my previous blog article, I have presented Maestro running gcc and g++.
This was a crucial step before being able to port many pieces of software to the OS.
However, compiling software can often take time, especially when you have only one CPU core available, which was the case on Maestro for a long time.
To remedy this issue, I have implemented SMP (Symmetric MultiProcessing) support, which I am going to detail in this article.
To begin with SMP, we need a way to enumerate the CPU cores present on the system, which can be done with ACPI.
About ACPI
ACPI (Advanced Configuration and Power Interface) was initially introduced in 1996 by Intel, Microsoft and Toshiba. It is an interface allowing to discover system components and manage power. Its specification is available here.
ACPI exposes a rather easy-to-use interface with “tables” exposing information about the system.
However, a part of the ACPI interface is rather complex: one of those tables, called the DSDT (Differentiated System Description Table) contains AML (ACPI Machine Language) bytecode, a Turing-complete language which has to be run in kernel-space.
Unlike usual bytecode (which should be easy to parse), AML’s syntax in Backus–Naur Form spans 14 pages on the specification PDF (see Chapter 20).
ACPI has been criticised by many, including Linus Torvalds:
Modern PCs are horrible. ACPI is a complete design disaster in every way. But we’re kind of stuck with it. If any Intel people are listening to this and you had anything to do with ACPI, sh**t yourself now, before you reproduce.
And indeed, we have reasons to believe ACPI was purposely made to make it difficult to support on other operating systems than Windows.
Bill Gates, in an internal note from 1999 (revealed as part of the Comes v. Microsoft court case), stated:
One thing I find myself wondering about is whether we shouldn’t try and make the “ACPI” extensions somehow Windows specific.
It seems unfortunate if we do this work and get our partners to do the work and the result is that Linux works great without having to do the work.
Maybe there is no way to avoid this problem but it does bother me.
Maybe we could define the APIs so that they work well with NT and not the others even if they are open.
Or maybe we could patent something related to this.
While this note raises concerns about potential anti-competitive intent, it does not, on its own, constitute definitive proof that ACPI was deliberately designed to obstruct other operating systems. However, I would not be surprised if it was actually the case, since this is common Microsoft behaviour.
Moreover, Mark Shuttleworth notes:
If you read the catalogue of spy tools and digital weaponry provided to us by Edward Snowden, you’ll see that firmware on your device is the NSA’s best friend. Your biggest mistake might be to assume that the NSA is the only institution abusing this position of trust – in fact, it’s reasonable to assume that all firmware is a cesspool of insecurity courtesy of incompetence of the worst degree from manufacturers, and competence of the highest degree from a very wide range of such agencies.
Source: LWN
It should be well known that the world runs on duct tape and incompetence is widespread (even beyond software).
While incompetence is a well-known characteristic of the average JavaScript developer, it is unjustly far less recognised when it comes to firmware developers, who are equally capable of producing bug-ridden software.
Kernel developers often have to find workarounds for buggy firmware. And it is not uncommon, when booting Linux, to get error logs due to invalid AML code.
Unfortunately, ACPI is now the de facto standard for power management on PC, and we have to use it :(
Enumerating CPU cores using ACPI
During this section will talk about ACPI and the APIC. This is not a typo, they are two different things and I will give details about the APIC in the next section.
ACPI has a table called the MADT (Multiple APIC Description Table), which contains a list of entries.
Two entry types are of interest right now:
- Processor Local APIC: declares a CPU, with its Local APIC ID (each CPU core has a Local APIC) on one byte
- Processor Local x2APIC: same as the above, except the Local APIC ID is on four bytes. This is necessary for systems with more than
255cores
Each entry corresponds to a logical core (also known as thread), as opposed to a physical core. I will give more details in the Hyper-Threading section.
PIC vs APIC
The PIC (Programmable Interrupt Controller) is a legacy component that was used to handle interruptions from other components on the system. It once was implemented by the Intel 8259 chip.
The PIC receives external interrupts from other components in the system on its pins, which are forwarded to the CPU. Since the PIC only supported 8 interrupt lines, it was not uncommon to have two of them (master and slave) to support more.
Once the CPU finishes handling an interrupt, it sends an EOI (End Of Interrupt) to the PIC so that it can send the next (if any).
The PIC is simple to use, but it cannot support CPU with more than one core. This is why Intel introduced the APIC (Advanced Programmable Interrupt Controller), which is embedded directly into the CPU.
The APIC embeds an emulation of the PIC for backwards compatibility. It also embeds a timer (the Local APIC timer) which replaces (but does not emulate, this is a different interface) the legacy PIT (Programmable Interval Timer).
Make sure not to mix up the PIC and PIT. Again, this is not a typo and they are two different things ;)
The PIT was based on the Intel 8253 chip.
Nowadays, the APIC has been retroactively renamed to xAPIC.
This is due to the introduction of the x2APIC, which uses larger registers to allow using larger APIC IDs in order to accommodate more than 255 cores on a single system.
The APIC is actually split in two parts:
- The Local APIC, which allows, among others, setting up the timer, mapping interrupt IDs to interrupt vectors, and sending an IPI (Inter-Processor Interrupts). Each CPU core has one
- The I/O APIC, which is configured to redirect external interruptions to the cores chosen by the OS
IPIs are necessary to boot other CPU cores, but we will see this in a later section.
APIC timer calibration
Do not mix up the APIC timer and the ACPI timer! The latter is not very precise and pretty much useless nowadays
To start other CPU cores, we need a way to sleep for a given amount of time (we will see later why). We need a timer for that.
The APIC timer has two main advantages:
- There is one per core
- It is precise
But, its frequency is unknown, there is no standard way of retrieving it. To offset this issue, we have to measure it, in a process called calibration.
To calibrate the APIC timer, we are going to compare it to another timer whose frequency is known. A good candidate is the HPET (High Precision Event Timer).
Both the APIC timer and the HPET have counters. At each tick, the APIC decrements its counter, while the HPET increments its counter.
To calibrate the APIC:
- Set up the APIC timer’s counter to a hardcoded value
- Set up the HPET to run at a known frequency
- Read the HPET’s counter
- Start the APIC timer and loop until its counter reaches
0 - Read the HPET’s counter and compute the difference with the previous reading
- With this difference, knowing the HPET’s frequency, we can compute the time it takes for the APIC to elapse the number of ticks we gave it
Note: there is no guarantee all Local APIC timers have the same frequency, so this process has to be repeated for all cores once they are booted.
Starting all cores
On x86, only a single CPU core is up at startup. This is different from ARM, where all cores boot simultaneously.
The Intel Software Developer Manual (SDM) describes the algorithm to boot other CPU cores.
We can send Inter-Processor Interrupts (an interrupt to another CPU core) by writing in some of the APIC’s registers. Two different kinds of interrupts must be sent to the other core in order to boot it: INIT and STARTUP
A delay must be added in between each IPI (this is why we needed a way to wait for a given amount of time).
As CPUs become faster, the required delay is shorter. As such, Linux uses a shorter delay for more recent CPUs (see Linux’s source code).
Once the CPU core starts, it is in real mode. We must then perform all the usual initialisation to switch it to protected mode or long mode, and then we can use the core!
Halting the system
When running on a single CPU core, halting the system (stopping all operations until reboot) is rather easy. It can be summarised with the following assembly code:
halt:
cli
hlt
jmp halt
cli disables interruptions and hlt makes the process idle until there is an interruption (which will never happen), making the CPU idle forever.
However, this code only makes the current CPU core halt. Being able to stop all cores is very important, especially in case of a kernel panic.
In order to stop all cores, we need a global flag saying whether the system is in the process of halting. This flag is checked by all cores at each interruption.
When a core halts the system, it sets this flag, sends an IPI to all cores, and halts itself. Then, all cores will read the flag upon reception of the IPI and halt themselves too.
Per-CPU structure
Now that we have several CPU cores running, we need a per-CPU structure, which contains information for the core we are running on.
Under x86_64, we store a pointer to this structure in the gs segment register.
This is convenient since we can swap between a kernelspace and userspace gs using the swapgs instruction.
Each CPU now has its own scheduler, with its own run queue, stored in the per-CPU structure.
TLB shootdown
x86 CPUs cache translations from virtual to physical memory addresses in order to avoid redoing the resolution at each memory access. This cache is called the TLB (Translation Lookaside Buffer).
There are several ways to invalidate this cache (which is required after modifying a memory mapping to avoid inconsistencies). Maestro uses the following:
- Re-set the
cr3register to flush the whole TLB (except some mappings that are marked with theGLOBALflag) - Use the
invlpginstruction to invalidate a single page of memory
Here comes the issue: those methods invalidate the cache only on the current CPU core, not other cores.
As such, when several processes run in the same memory space (multithreaded), the TLB might be inconsistent after unmapping or remapping memory (mmap/munmap system calls).
To prevent this issue, we must perform a TLB shootdown, which is the process of telling other CPU cores to invalidate their TLB too.
For each memory space, Maestro maintains a bitmap of cores binding it. When bound, the core sets its bit and when unbound it clears it. Hence, we can just iterate on this bitmap to get a list of all the cores we have to notify.
The notification is transmitted through deferred calls.
Deferred calls
Maestro now has an interface to send a function to be executed on another CPU core.
Under Linux, this can be done using the functions smp_call_function, smp_call_function_single, smp_call_function_many, etc…
Internally Maestro uses a per-CPU MPSC (Multiple-Producer Single-Consumer) queue for this.
Let core A be the source (deferring the call) and core B the destination (executing the call)
On core A:
- First, if
AandBare the same core, we just execute the function directly and return - Else, we get core
B’s queue, and we insert the call onto it atomically - We send an IPI to core
Bto notify it there is an enqueued call - Then, if:
- The call is synchronous, we make core
Await until the function returns on coreB(this is signalled by an atomic boolean that is set once the function returns) - The call is asynchronous, we just return and core
Acontinues its execution normally
- The call is synchronous, we make core
On core B:
- The core receives the IPI and jumps to code that pops elements from the queue until none is left (with a hard limit to prevent starvation)
- For each element, the core executes the associated function
Rewriting the scheduler
Now that we have several cores, we need to have several schedulers (one per-core).
Each core has a run queue, which is a circular doubly-linked list containing processes ready to be run.
When a process turns to the Running state, it is inserted in a scheduler’s run queue. If it turns to another state (Sleeping, Stopped or Zombie), it is removed from the queue.
Each time we need to reschedule, the scheduler selects the next process in the queue to run and rotates the queue by one element.
Rescheduling is triggered in various places, including (non-exhaustive):
- When the currently running process has to wait for a resource, in which case it also switches to the
Sleepingstate (example: blockingread(2)) - The APIC timer fires an interrupt (it is set up to do it periodically to preempt execution)
- The process calls
sched_yield(2) - The process exits, in which case it switches to the
Zombiestate
Once the new process is selected, the CPU switches context to this process to resume its execution.
If the queue is empty, the scheduler runs the idle task, which is a simple kernel thread running this loop:
idle_task:
sti
hlt
jmp idle_task
sti enables interruptions and hlt makes the process idle until there is an interruption. In short: idle until something happens.
Process load balancing
This part of the implementation has been inspired by FreeBSD’s ULE scheduler.
To be efficient, an SMP system needs its processes to be balanced across cores to use as much CPU capacity as possible.
When a process switches to the Running state, we have to insert it in the run queue of a CPU.
To select the CPU, we run through the following steps:
- Look the currently running process of the last CPU that ran the process (if any), if its priority is lower than the process to insert, we select that CPU
Note: the idle task has a lower priority than anyone else, making any other process replace it
- Else, look up the CPU topology tree to find the closest available core (see the CPU topology tree section)
- Else, look up a global bitmap, which has a bit for each CPU, indicating whether it is currently idle. If there is one, we select it
- Else, select the CPU with the least assigned processes, among those whose currently running process has a lower priority than the process to insert
- Else, select the CPU with the least assigned processes
The kernel also has a kernel thread which periodically rebalances processes across cores.
It looks for the cores with the least and most processes, and migrates them from the latter to the former.
After enough iterations, the system should eventually tend towards equilibrium.
Critical sections
Sometimes, it is desirable to disable preemption (make sure the scheduler does not switch to another process), but also be able to handle interruptions at the same time.
Before introducing SMP, Maestro was simply temporarily disabling interruptions to disable preemption, but it quickly became clear this was not suitable for every case.
Under Linux (and now Maestro too), critical sections allow to temporarily disable preemption while keeping interruptions enabled.
A critical section is entered by calling preempt_disable and exited by calling preempt_enable.
Those functions respectively increment and decrement a counter in the per-CPU structure. This counter allows nesting critical sections.
Maestro’s counter has the following layout:
| Bit 31 | Bit 30-0 |
|---|---|
| Flag saying whether the timer has requested a reschedule (clear = reschedule) | The current number of nested critical sections |
Note that this layout is different in the Linux kernel.
When the APIC timer fires an interrupt, the interrupt handler just clears the highest bit. And the scheduler resets (sets) it once the rescheduling function is entered.
preempt_enable does not only decrement the counter, it also checks the value afterward.
If it is zero, it means rescheduling was requested and that we exited the last critical section, so we have to reschedule now.
Hyper-Threading
Intel introduced Hyper-Threading in February 2002, as a way to improve parallelisation.
Hyper-Threading divides physical CPU cores into logical cores. Each logical core shows up as an independent core in the MADT, with its own Local APIC ID.
Two logical cores on the same physical core share some caches, such as L1 and L2.
To make full use of Hyper-Threading, the operating system wants to avoid invalidating caches as much as possible, by avoiding moving processes to other physical cores when possible.
CPU topology tree
At boot, the kernel builds a CPU topology tree, which is a representation of each core and its relationship to others.
The tree usually has the following levels:
- Root
- Package (a chip/die/socket)
- Core (a physical core)
- Thread (a logical core)
If two elements share the same parent node, that means they are “close” and thus share more caches.
This tree is built using the cpuid instruction.
It is different for Intel and AMD and will not be detailed here because it is not very interesting.
This tree is used by the scheduler to determine the closest available core when enqueuing a process. We start by looking up the core that ran the process last (if any). If that core is not available, we go up the tree until we reach a core that is suitable to run the process.
What’s missing or could be improved?
- The PIT is not supported anymore (this is temporary), meaning the OS now relies on the HPET. If not present or disabled, the kernel will panic
- The x2APIC is not implemented yet (only the PIC and xAPIC are supported). This is especially important since the xAPIC only supports up to 255 CPU cores, and it is being deprecated
- P cores and E cores (Performance and Efficiency) are treated the same, which is inefficient
- There is memory contention in some places
- I think about the internal kernel memory allocator in particular, which is guarded by a single global spinlock
- The mutex implementation is actually just a spinlock. Meaning that a task waiting on it is not put to sleep and is wasting CPU time
- Memory contention could be greatly reduced by implementing the RCU mechanism
- Interrupt handlers are shared between cores, preventing assignment of some interrupts to some cores in particular
Special thanks
I would like to thank the following people for helping me!
- Mathewnd (Website/GitHub), for his explanation of the ULE scheduler’s load balancing
- dbstream, for his explanation of critical sections
You may find us on the OSDev Discord server.
What’s next?
Now we may start porting more software to the OS. The next step is to attempt to port vim! (with all its dependencies, of course)
If you like this project, don’t forget to leave a star ⭐ on the GitHub repository! ❤️
You can also join Maestro’s Discord server to make sure you stay up to date!