|  | .. SPDX-License-Identifier: GPL-2.0 | 
|  |  | 
|  | =========== | 
|  | Page Tables | 
|  | =========== | 
|  |  | 
|  | Paged virtual memory was invented along with virtual memory as a concept in | 
|  | 1962 on the Ferranti Atlas Computer which was the first computer with paged | 
|  | virtual memory. The feature migrated to newer computers and became a de facto | 
|  | feature of all Unix-like systems as time went by. In 1985 the feature was | 
|  | included in the Intel 80386, which was the CPU Linux 1.0 was developed on. | 
|  |  | 
|  | Page tables map virtual addresses as seen by the CPU into physical addresses | 
|  | as seen on the external memory bus. | 
|  |  | 
|  | Linux defines page tables as a hierarchy which is currently five levels in | 
|  | height. The architecture code for each supported architecture will then | 
|  | map this to the restrictions of the hardware. | 
|  |  | 
|  | The physical address corresponding to the virtual address is often referenced | 
|  | by the underlying physical page frame. The **page frame number** or **pfn** | 
|  | is the physical address of the page (as seen on the external memory bus) | 
|  | divided by `PAGE_SIZE`. | 
|  |  | 
|  | Physical memory address 0 will be *pfn 0* and the highest pfn will be | 
|  | the last page of physical memory the external address bus of the CPU can | 
|  | address. | 
|  |  | 
|  | With a page granularity of 4KB and a address range of 32 bits, pfn 0 is at | 
|  | address 0x00000000, pfn 1 is at address 0x00001000, pfn 2 is at 0x00002000 | 
|  | and so on until we reach pfn 0xfffff at 0xfffff000. With 16KB pages pfs are | 
|  | at 0x00004000, 0x00008000 ... 0xffffc000 and pfn goes from 0 to 0x3ffff. | 
|  |  | 
|  | As you can see, with 4KB pages the page base address uses bits 12-31 of the | 
|  | address, and this is why `PAGE_SHIFT` in this case is defined as 12 and | 
|  | `PAGE_SIZE` is usually defined in terms of the page shift as `(1 << PAGE_SHIFT)` | 
|  |  | 
|  | Over time a deeper hierarchy has been developed in response to increasing memory | 
|  | sizes. When Linux was created, 4KB pages and a single page table called | 
|  | `swapper_pg_dir` with 1024 entries was used, covering 4MB which coincided with | 
|  | the fact that Torvald's first computer had 4MB of physical memory. Entries in | 
|  | this single table were referred to as *PTE*:s - page table entries. | 
|  |  | 
|  | The software page table hierarchy reflects the fact that page table hardware has | 
|  | become hierarchical and that in turn is done to save page table memory and | 
|  | speed up mapping. | 
|  |  | 
|  | One could of course imagine a single, linear page table with enormous amounts | 
|  | of entries, breaking down the whole memory into single pages. Such a page table | 
|  | would be very sparse, because large portions of the virtual memory usually | 
|  | remains unused. By using hierarchical page tables large holes in the virtual | 
|  | address space does not waste valuable page table memory, because it will suffice | 
|  | to mark large areas as unmapped at a higher level in the page table hierarchy. | 
|  |  | 
|  | Additionally, on modern CPUs, a higher level page table entry can point directly | 
|  | to a physical memory range, which allows mapping a contiguous range of several | 
|  | megabytes or even gigabytes in a single high-level page table entry, taking | 
|  | shortcuts in mapping virtual memory to physical memory: there is no need to | 
|  | traverse deeper in the hierarchy when you find a large mapped range like this. | 
|  |  | 
|  | The page table hierarchy has now developed into this:: | 
|  |  | 
|  | +-----+ | 
|  | | PGD | | 
|  | +-----+ | 
|  | | | 
|  | |   +-----+ | 
|  | +-->| P4D | | 
|  | +-----+ | 
|  | | | 
|  | |   +-----+ | 
|  | +-->| PUD | | 
|  | +-----+ | 
|  | | | 
|  | |   +-----+ | 
|  | +-->| PMD | | 
|  | +-----+ | 
|  | | | 
|  | |   +-----+ | 
|  | +-->| PTE | | 
|  | +-----+ | 
|  |  | 
|  |  | 
|  | Symbols on the different levels of the page table hierarchy have the following | 
|  | meaning beginning from the bottom: | 
|  |  | 
|  | - **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier. | 
|  | The *pte* is an array of `PTRS_PER_PTE` elements of the `pteval_t` type, each | 
|  | mapping a single page of virtual memory to a single page of physical memory. | 
|  | The architecture defines the size and contents of `pteval_t`. | 
|  |  | 
|  | A typical example is that the `pteval_t` is a 32- or 64-bit value with the | 
|  | upper bits being a **pfn** (page frame number), and the lower bits being some | 
|  | architecture-specific bits such as memory protection. | 
|  |  | 
|  | The **entry** part of the name is a bit confusing because while in Linux 1.0 | 
|  | this did refer to a single page table entry in the single top level page | 
|  | table, it was retrofitted to be an array of mapping elements when two-level | 
|  | page tables were first introduced, so the *pte* is the lowermost page | 
|  | *table*, not a page table *entry*. | 
|  |  | 
|  | - **pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**, the hierarchy right | 
|  | above the *pte*, with `PTRS_PER_PMD` references to the *pte*:s. | 
|  |  | 
|  | - **pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was introduced after | 
|  | the other levels to handle 4-level page tables. It is potentially unused, | 
|  | or *folded* as we will discuss later. | 
|  |  | 
|  | - **p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was introduced to | 
|  | handle 5-level page tables after the *pud* was introduced. Now it was clear | 
|  | that we needed to replace *pgd*, *pmd*, *pud* etc with a figure indicating the | 
|  | directory level and that we cannot go on with ad hoc names any more. This | 
|  | is only used on systems which actually have 5 levels of page tables, otherwise | 
|  | it is folded. | 
|  |  | 
|  | - **pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the Linux kernel | 
|  | main page table handling the PGD for the kernel memory is still found in | 
|  | `swapper_pg_dir`, but each userspace process in the system also has its own | 
|  | memory context and thus its own *pgd*, found in `struct mm_struct` which | 
|  | in turn is referenced to in each `struct task_struct`. So tasks have memory | 
|  | context in the form of a `struct mm_struct` and this in turn has a | 
|  | `struct pgt_t *pgd` pointer to the corresponding page global directory. | 
|  |  | 
|  | To repeat: each level in the page table hierarchy is a *array of pointers*, so | 
|  | the **pgd** contains `PTRS_PER_PGD` pointers to the next level below, **p4d** | 
|  | contains `PTRS_PER_P4D` pointers to **pud** items and so on. The number of | 
|  | pointers on each level is architecture-defined.:: | 
|  |  | 
|  | PMD | 
|  | --> +-----+           PTE | 
|  | | ptr |-------> +-----+ | 
|  | | ptr |-        | ptr |-------> PAGE | 
|  | | ptr | \       | ptr | | 
|  | | ptr |  \        ... | 
|  | | ... |   \ | 
|  | | ptr |    \         PTE | 
|  | +-----+     +----> +-----+ | 
|  | | ptr |-------> PAGE | 
|  | | ptr | | 
|  | ... | 
|  |  | 
|  |  | 
|  | Page Table Folding | 
|  | ================== | 
|  |  | 
|  | If the architecture does not use all the page table levels, they can be *folded* | 
|  | which means skipped, and all operations performed on page tables will be | 
|  | compile-time augmented to just skip a level when accessing the next lower | 
|  | level. | 
|  |  | 
|  | Page table handling code that wishes to be architecture-neutral, such as the | 
|  | virtual memory manager, will need to be written so that it traverses all of the | 
|  | currently five levels. This style should also be preferred for | 
|  | architecture-specific code, so as to be robust to future changes. | 
|  |  | 
|  |  | 
|  | MMU, TLB, and Page Faults | 
|  | ========================= | 
|  |  | 
|  | The `Memory Management Unit (MMU)` is a hardware component that handles virtual | 
|  | to physical address translations. It may use relatively small caches in hardware | 
|  | called `Translation Lookaside Buffers (TLBs)` and `Page Walk Caches` to speed up | 
|  | these translations. | 
|  |  | 
|  | When CPU accesses a memory location, it provides a virtual address to the MMU, | 
|  | which checks if there is the existing translation in the TLB or in the Page | 
|  | Walk Caches (on architectures that support them). If no translation is found, | 
|  | MMU uses the page walks to determine the physical address and create the map. | 
|  |  | 
|  | The dirty bit for a page is set (i.e., turned on) when the page is written to. | 
|  | Each page of memory has associated permission and dirty bits. The latter | 
|  | indicate that the page has been modified since it was loaded into memory. | 
|  |  | 
|  | If nothing prevents it, eventually the physical memory can be accessed and the | 
|  | requested operation on the physical frame is performed. | 
|  |  | 
|  | There are several reasons why the MMU can't find certain translations. It could | 
|  | happen because the CPU is trying to access memory that the current task is not | 
|  | permitted to, or because the data is not present into physical memory. | 
|  |  | 
|  | When these conditions happen, the MMU triggers page faults, which are types of | 
|  | exceptions that signal the CPU to pause the current execution and run a special | 
|  | function to handle the mentioned exceptions. | 
|  |  | 
|  | There are common and expected causes of page faults. These are triggered by | 
|  | process management optimization techniques called "Lazy Allocation" and | 
|  | "Copy-on-Write". Page faults may also happen when frames have been swapped out | 
|  | to persistent storage (swap partition or file) and evicted from their physical | 
|  | locations. | 
|  |  | 
|  | These techniques improve memory efficiency, reduce latency, and minimize space | 
|  | occupation. This document won't go deeper into the details of "Lazy Allocation" | 
|  | and "Copy-on-Write" because these subjects are out of scope as they belong to | 
|  | Process Address Management. | 
|  |  | 
|  | Swapping differentiates itself from the other mentioned techniques because it's | 
|  | undesirable since it's performed as a means to reduce memory under heavy | 
|  | pressure. | 
|  |  | 
|  | Swapping can't work for memory mapped by kernel logical addresses. These are a | 
|  | subset of the kernel virtual space that directly maps a contiguous range of | 
|  | physical memory. Given any logical address, its physical address is determined | 
|  | with simple arithmetic on an offset. Accesses to logical addresses are fast | 
|  | because they avoid the need for complex page table lookups at the expenses of | 
|  | frames not being evictable and pageable out. | 
|  |  | 
|  | If the kernel fails to make room for the data that must be present in the | 
|  | physical frames, the kernel invokes the out-of-memory (OOM) killer to make room | 
|  | by terminating lower priority processes until pressure reduces under a safe | 
|  | threshold. | 
|  |  | 
|  | Additionally, page faults may be also caused by code bugs or by maliciously | 
|  | crafted addresses that the CPU is instructed to access. A thread of a process | 
|  | could use instructions to address (non-shared) memory which does not belong to | 
|  | its own address space, or could try to execute an instruction that want to write | 
|  | to a read-only location. | 
|  |  | 
|  | If the above-mentioned conditions happen in user-space, the kernel sends a | 
|  | `Segmentation Fault` (SIGSEGV) signal to the current thread. That signal usually | 
|  | causes the termination of the thread and of the process it belongs to. | 
|  |  | 
|  | This document is going to simplify and show an high altitude view of how the | 
|  | Linux kernel handles these page faults, creates tables and tables' entries, | 
|  | check if memory is present and, if not, requests to load data from persistent | 
|  | storage or from other devices, and updates the MMU and its caches. | 
|  |  | 
|  | The first steps are architecture dependent. Most architectures jump to | 
|  | `do_page_fault()`, whereas the x86 interrupt handler is defined by the | 
|  | `DEFINE_IDTENTRY_RAW_ERRORCODE()` macro which calls `handle_page_fault()`. | 
|  |  | 
|  | Whatever the routes, all architectures end up to the invocation of | 
|  | `handle_mm_fault()` which, in turn, (likely) ends up calling | 
|  | `__handle_mm_fault()` to carry out the actual work of allocating the page | 
|  | tables. | 
|  |  | 
|  | The unfortunate case of not being able to call `__handle_mm_fault()` means | 
|  | that the virtual address is pointing to areas of physical memory which are not | 
|  | permitted to be accessed (at least from the current context). This | 
|  | condition resolves to the kernel sending the above-mentioned SIGSEGV signal | 
|  | to the process and leads to the consequences already explained. | 
|  |  | 
|  | `__handle_mm_fault()` carries out its work by calling several functions to | 
|  | find the entry's offsets of the upper layers of the page tables and allocate | 
|  | the tables that it may need. | 
|  |  | 
|  | The functions that look for the offset have names like `*_offset()`, where the | 
|  | "*" is for pgd, p4d, pud, pmd, pte; instead the functions to allocate the | 
|  | corresponding tables, layer by layer, are called `*_alloc`, using the | 
|  | above-mentioned convention to name them after the corresponding types of tables | 
|  | in the hierarchy. | 
|  |  | 
|  | The page table walk may end at one of the middle or upper layers (PMD, PUD). | 
|  |  | 
|  | Linux supports larger page sizes than the usual 4KB (i.e., the so called | 
|  | `huge pages`). When using these kinds of larger pages, higher level pages can | 
|  | directly map them, with no need to use lower level page entries (PTE). Huge | 
|  | pages contain large contiguous physical regions that usually span from 2MB to | 
|  | 1GB. They are respectively mapped by the PMD and PUD page entries. | 
|  |  | 
|  | The huge pages bring with them several benefits like reduced TLB pressure, | 
|  | reduced page table overhead, memory allocation efficiency, and performance | 
|  | improvement for certain workloads. However, these benefits come with | 
|  | trade-offs, like wasted memory and allocation challenges. | 
|  |  | 
|  | At the very end of the walk with allocations, if it didn't return errors, | 
|  | `__handle_mm_fault()` finally calls `handle_pte_fault()`, which via `do_fault()` | 
|  | performs one of `do_read_fault()`, `do_cow_fault()`, `do_shared_fault()`. | 
|  | "read", "cow", "shared" give hints about the reasons and the kind of fault it's | 
|  | handling. | 
|  |  | 
|  | The actual implementation of the workflow is very complex. Its design allows | 
|  | Linux to handle page faults in a way that is tailored to the specific | 
|  | characteristics of each architecture, while still sharing a common overall | 
|  | structure. | 
|  |  | 
|  | To conclude this high altitude view of how Linux handles page faults, let's | 
|  | add that the page faults handler can be disabled and enabled respectively with | 
|  | `pagefault_disable()` and `pagefault_enable()`. | 
|  |  | 
|  | Several code path make use of the latter two functions because they need to | 
|  | disable traps into the page faults handler, mostly to prevent deadlocks. |