|  | .. SPDX-License-Identifier: GPL-2.0 | 
|  | .. _xfs_online_fsck_design: | 
|  |  | 
|  | .. | 
|  | Mapping of heading styles within this document: | 
|  | Heading 1 uses "====" above and below | 
|  | Heading 2 uses "====" | 
|  | Heading 3 uses "----" | 
|  | Heading 4 uses "````" | 
|  | Heading 5 uses "^^^^" | 
|  | Heading 6 uses "~~~~" | 
|  | Heading 7 uses "...." | 
|  |  | 
|  | Sections are manually numbered because apparently that's what everyone | 
|  | does in the kernel. | 
|  |  | 
|  | ====================== | 
|  | XFS Online Fsck Design | 
|  | ====================== | 
|  |  | 
|  | This document captures the design of the online filesystem check feature for | 
|  | XFS. | 
|  | The purpose of this document is threefold: | 
|  |  | 
|  | - To help kernel distributors understand exactly what the XFS online fsck | 
|  | feature is, and issues about which they should be aware. | 
|  |  | 
|  | - To help people reading the code to familiarize themselves with the relevant | 
|  | concepts and design points before they start digging into the code. | 
|  |  | 
|  | - To help developers maintaining the system by capturing the reasons | 
|  | supporting higher level decision making. | 
|  |  | 
|  | As the online fsck code is merged, the links in this document to topic branches | 
|  | will be replaced with links to code. | 
|  |  | 
|  | This document is licensed under the terms of the GNU Public License, v2. | 
|  | The primary author is Darrick J. Wong. | 
|  |  | 
|  | This design document is split into seven parts. | 
|  | Part 1 defines what fsck tools are and the motivations for writing a new one. | 
|  | Parts 2 and 3 present a high level overview of how online fsck process works | 
|  | and how it is tested to ensure correct functionality. | 
|  | Part 4 discusses the user interface and the intended usage modes of the new | 
|  | program. | 
|  | Parts 5 and 6 show off the high level components and how they fit together, and | 
|  | then present case studies of how each repair function actually works. | 
|  | Part 7 sums up what has been discussed so far and speculates about what else | 
|  | might be built atop online fsck. | 
|  |  | 
|  | .. contents:: Table of Contents | 
|  | :local: | 
|  |  | 
|  | 1. What is a Filesystem Check? | 
|  | ============================== | 
|  |  | 
|  | A Unix filesystem has four main responsibilities: | 
|  |  | 
|  | - Provide a hierarchy of names through which application programs can associate | 
|  | arbitrary blobs of data for any length of time, | 
|  |  | 
|  | - Virtualize physical storage media across those names, and | 
|  |  | 
|  | - Retrieve the named data blobs at any time. | 
|  |  | 
|  | - Examine resource usage. | 
|  |  | 
|  | Metadata directly supporting these functions (e.g. files, directories, space | 
|  | mappings) are sometimes called primary metadata. | 
|  | Secondary metadata (e.g. reverse mapping and directory parent pointers) support | 
|  | operations internal to the filesystem, such as internal consistency checking | 
|  | and reorganization. | 
|  | Summary metadata, as the name implies, condense information contained in | 
|  | primary metadata for performance reasons. | 
|  |  | 
|  | The filesystem check (fsck) tool examines all the metadata in a filesystem | 
|  | to look for errors. | 
|  | In addition to looking for obvious metadata corruptions, fsck also | 
|  | cross-references different types of metadata records with each other to look | 
|  | for inconsistencies. | 
|  | People do not like losing data, so most fsck tools also contains some ability | 
|  | to correct any problems found. | 
|  | As a word of caution -- the primary goal of most Linux fsck tools is to restore | 
|  | the filesystem metadata to a consistent state, not to maximize the data | 
|  | recovered. | 
|  | That precedent will not be challenged here. | 
|  |  | 
|  | Filesystems of the 20th century generally lacked any redundancy in the ondisk | 
|  | format, which means that fsck can only respond to errors by erasing files until | 
|  | errors are no longer detected. | 
|  | More recent filesystem designs contain enough redundancy in their metadata that | 
|  | it is now possible to regenerate data structures when non-catastrophic errors | 
|  | occur; this capability aids both strategies. | 
|  |  | 
|  | +--------------------------------------------------------------------------+ | 
|  | | **Note**:                                                                | | 
|  | +--------------------------------------------------------------------------+ | 
|  | | System administrators avoid data loss by increasing the number of        | | 
|  | | separate storage systems through the creation of backups; and they avoid | | 
|  | | downtime by increasing the redundancy of each storage system through the | | 
|  | | creation of RAID arrays.                                                 | | 
|  | | fsck tools address only the first problem.                               | | 
|  | +--------------------------------------------------------------------------+ | 
|  |  | 
|  | TLDR; Show Me the Code! | 
|  | ----------------------- | 
|  |  | 
|  | Code is posted to the kernel.org git trees as follows: | 
|  | `kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_, | 
|  | `userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and | 
|  | `QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_. | 
|  | Each kernel patchset adding an online repair function will use the same branch | 
|  | name across the kernel, xfsprogs, and fstests git repos. | 
|  |  | 
|  | Existing Tools | 
|  | -------------- | 
|  |  | 
|  | The online fsck tool described here will be the third tool in the history of | 
|  | XFS (on Linux) to check and repair filesystems. | 
|  | Two programs precede it: | 
|  |  | 
|  | The first program, ``xfs_check``, was created as part of the XFS debugger | 
|  | (``xfs_db``) and can only be used with unmounted filesystems. | 
|  | It walks all metadata in the filesystem looking for inconsistencies in the | 
|  | metadata, though it lacks any ability to repair what it finds. | 
|  | Due to its high memory requirements and inability to repair things, this | 
|  | program is now deprecated and will not be discussed further. | 
|  |  | 
|  | The second program, ``xfs_repair``, was created to be faster and more robust | 
|  | than the first program. | 
|  | Like its predecessor, it can only be used with unmounted filesystems. | 
|  | It uses extent-based in-memory data structures to reduce memory consumption, | 
|  | and tries to schedule readahead IO appropriately to reduce I/O waiting time | 
|  | while it scans the metadata of the entire filesystem. | 
|  | The most important feature of this tool is its ability to respond to | 
|  | inconsistencies in file metadata and directory tree by erasing things as needed | 
|  | to eliminate problems. | 
|  | Space usage metadata are rebuilt from the observed file metadata. | 
|  |  | 
|  | Problem Statement | 
|  | ----------------- | 
|  |  | 
|  | The current XFS tools leave several problems unsolved: | 
|  |  | 
|  | 1. **User programs** suddenly **lose access** to the filesystem when unexpected | 
|  | shutdowns occur as a result of silent corruptions in the metadata. | 
|  | These occur **unpredictably** and often without warning. | 
|  |  | 
|  | 2. **Users** experience a **total loss of service** during the recovery period | 
|  | after an **unexpected shutdown** occurs. | 
|  |  | 
|  | 3. **Users** experience a **total loss of service** if the filesystem is taken | 
|  | offline to **look for problems** proactively. | 
|  |  | 
|  | 4. **Data owners** cannot **check the integrity** of their stored data without | 
|  | reading all of it. | 
|  | This may expose them to substantial billing costs when a linear media scan | 
|  | performed by the storage system administrator might suffice. | 
|  |  | 
|  | 5. **System administrators** cannot **schedule** a maintenance window to deal | 
|  | with corruptions if they **lack the means** to assess filesystem health | 
|  | while the filesystem is online. | 
|  |  | 
|  | 6. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem | 
|  | health when doing so requires **manual intervention** and downtime. | 
|  |  | 
|  | 7. **Users** can be tricked into **doing things they do not desire** when | 
|  | malicious actors **exploit quirks of Unicode** to place misleading names | 
|  | in directories. | 
|  |  | 
|  | Given this definition of the problems to be solved and the actors who would | 
|  | benefit, the proposed solution is a third fsck tool that acts on a running | 
|  | filesystem. | 
|  |  | 
|  | This new third program has three components: an in-kernel facility to check | 
|  | metadata, an in-kernel facility to repair metadata, and a userspace driver | 
|  | program to drive fsck activity on a live filesystem. | 
|  | ``xfs_scrub`` is the name of the driver program. | 
|  | The rest of this document presents the goals and use cases of the new fsck | 
|  | tool, describes its major design points in connection to those goals, and | 
|  | discusses the similarities and differences with existing tools. | 
|  |  | 
|  | +--------------------------------------------------------------------------+ | 
|  | | **Note**:                                                                | | 
|  | +--------------------------------------------------------------------------+ | 
|  | | Throughout this document, the existing offline fsck tool can also be     | | 
|  | | referred to by its current name "``xfs_repair``".                        | | 
|  | | The userspace driver program for the new online fsck tool can be         | | 
|  | | referred to as "``xfs_scrub``".                                          | | 
|  | | The kernel portion of online fsck that validates metadata is called      | | 
|  | | "online scrub", and portion of the kernel that fixes metadata is called  | | 
|  | | "online repair".                                                         | | 
|  | +--------------------------------------------------------------------------+ | 
|  |  | 
|  | The naming hierarchy is broken up into objects known as directories and files | 
|  | and the physical space is split into pieces known as allocation groups. | 
|  | Sharding enables better performance on highly parallel systems and helps to | 
|  | contain the damage when corruptions occur. | 
|  | The division of the filesystem into principal objects (allocation groups and | 
|  | inodes) means that there are ample opportunities to perform targeted checks and | 
|  | repairs on a subset of the filesystem. | 
|  |  | 
|  | While this is going on, other parts continue processing IO requests. | 
|  | Even if a piece of filesystem metadata can only be regenerated by scanning the | 
|  | entire system, the scan can still be done in the background while other file | 
|  | operations continue. | 
|  |  | 
|  | In summary, online fsck takes advantage of resource sharding and redundant | 
|  | metadata to enable targeted checking and repair operations while the system | 
|  | is running. | 
|  | This capability will be coupled to automatic system management so that | 
|  | autonomous self-healing of XFS maximizes service availability. | 
|  |  | 
|  | 2. Theory of Operation | 
|  | ====================== | 
|  |  | 
|  | Because it is necessary for online fsck to lock and scan live metadata objects, | 
|  | online fsck consists of three separate code components. | 
|  | The first is the userspace driver program ``xfs_scrub``, which is responsible | 
|  | for identifying individual metadata items, scheduling work items for them, | 
|  | reacting to the outcomes appropriately, and reporting results to the system | 
|  | administrator. | 
|  | The second and third are in the kernel, which implements functions to check | 
|  | and repair each type of online fsck work item. | 
|  |  | 
|  | +------------------------------------------------------------------+ | 
|  | | **Note**:                                                        | | 
|  | +------------------------------------------------------------------+ | 
|  | | For brevity, this document shortens the phrase "online fsck work | | 
|  | | item" to "scrub item".                                           | | 
|  | +------------------------------------------------------------------+ | 
|  |  | 
|  | Scrub item types are delineated in a manner consistent with the Unix design | 
|  | philosophy, which is to say that each item should handle one aspect of a | 
|  | metadata structure, and handle it well. | 
|  |  | 
|  | Scope | 
|  | ----- | 
|  |  | 
|  | In principle, online fsck should be able to check and to repair everything that | 
|  | the offline fsck program can handle. | 
|  | However, online fsck cannot be running 100% of the time, which means that | 
|  | latent errors may creep in after a scrub completes. | 
|  | If these errors cause the next mount to fail, offline fsck is the only | 
|  | solution. | 
|  | This limitation means that maintenance of the offline fsck tool will continue. | 
|  | A second limitation of online fsck is that it must follow the same resource | 
|  | sharing and lock acquisition rules as the regular filesystem. | 
|  | This means that scrub cannot take *any* shortcuts to save time, because doing | 
|  | so could lead to concurrency problems. | 
|  | In other words, online fsck is not a complete replacement for offline fsck, and | 
|  | a complete run of online fsck may take longer than online fsck. | 
|  | However, both of these limitations are acceptable tradeoffs to satisfy the | 
|  | different motivations of online fsck, which are to **minimize system downtime** | 
|  | and to **increase predictability of operation**. | 
|  |  | 
|  | .. _scrubphases: | 
|  |  | 
|  | Phases of Work | 
|  | -------------- | 
|  |  | 
|  | The userspace driver program ``xfs_scrub`` splits the work of checking and | 
|  | repairing an entire filesystem into seven phases. | 
|  | Each phase concentrates on checking specific types of scrub items and depends | 
|  | on the success of all previous phases. | 
|  | The seven phases are as follows: | 
|  |  | 
|  | 1. Collect geometry information about the mounted filesystem and computer, | 
|  | discover the online fsck capabilities of the kernel, and open the | 
|  | underlying storage devices. | 
|  |  | 
|  | 2. Check allocation group metadata, all realtime volume metadata, and all quota | 
|  | files. | 
|  | Each metadata structure is scheduled as a separate scrub item. | 
|  | If corruption is found in the inode header or inode btree and ``xfs_scrub`` | 
|  | is permitted to perform repairs, then those scrub items are repaired to | 
|  | prepare for phase 3. | 
|  | Repairs are implemented by using the information in the scrub item to | 
|  | resubmit the kernel scrub call with the repair flag enabled; this is | 
|  | discussed in the next section. | 
|  | Optimizations and all other repairs are deferred to phase 4. | 
|  |  | 
|  | 3. Check all metadata of every file in the filesystem. | 
|  | Each metadata structure is also scheduled as a separate scrub item. | 
|  | If repairs are needed and ``xfs_scrub`` is permitted to perform repairs, | 
|  | and there were no problems detected during phase 2, then those scrub items | 
|  | are repaired immediately. | 
|  | Optimizations, deferred repairs, and unsuccessful repairs are deferred to | 
|  | phase 4. | 
|  |  | 
|  | 4. All remaining repairs and scheduled optimizations are performed during this | 
|  | phase, if the caller permits them. | 
|  | Before starting repairs, the summary counters are checked and any necessary | 
|  | repairs are performed so that subsequent repairs will not fail the resource | 
|  | reservation step due to wildly incorrect summary counters. | 
|  | Unsuccessful repairs are requeued as long as forward progress on repairs is | 
|  | made somewhere in the filesystem. | 
|  | Free space in the filesystem is trimmed at the end of phase 4 if the | 
|  | filesystem is clean. | 
|  |  | 
|  | 5. By the start of this phase, all primary and secondary filesystem metadata | 
|  | must be correct. | 
|  | Summary counters such as the free space counts and quota resource counts | 
|  | are checked and corrected. | 
|  | Directory entry names and extended attribute names are checked for | 
|  | suspicious entries such as control characters or confusing Unicode sequences | 
|  | appearing in names. | 
|  |  | 
|  | 6. If the caller asks for a media scan, read all allocated and written data | 
|  | file extents in the filesystem. | 
|  | The ability to use hardware-assisted data file integrity checking is new | 
|  | to online fsck; neither of the previous tools have this capability. | 
|  | If media errors occur, they will be mapped to the owning files and reported. | 
|  |  | 
|  | 7. Re-check the summary counters and presents the caller with a summary of | 
|  | space usage and file counts. | 
|  |  | 
|  | This allocation of responsibilities will be :ref:`revisited <scrubcheck>` | 
|  | later in this document. | 
|  |  | 
|  | Steps for Each Scrub Item | 
|  | ------------------------- | 
|  |  | 
|  | The kernel scrub code uses a three-step strategy for checking and repairing | 
|  | the one aspect of a metadata object represented by a scrub item: | 
|  |  | 
|  | 1. The scrub item of interest is checked for corruptions; opportunities for | 
|  | optimization; and for values that are directly controlled by the system | 
|  | administrator but look suspicious. | 
|  | If the item is not corrupt or does not need optimization, resource are | 
|  | released and the positive scan results are returned to userspace. | 
|  | If the item is corrupt or could be optimized but the caller does not permit | 
|  | this, resources are released and the negative scan results are returned to | 
|  | userspace. | 
|  | Otherwise, the kernel moves on to the second step. | 
|  |  | 
|  | 2. The repair function is called to rebuild the data structure. | 
|  | Repair functions generally choose rebuild a structure from other metadata | 
|  | rather than try to salvage the existing structure. | 
|  | If the repair fails, the scan results from the first step are returned to | 
|  | userspace. | 
|  | Otherwise, the kernel moves on to the third step. | 
|  |  | 
|  | 3. In the third step, the kernel runs the same checks over the new metadata | 
|  | item to assess the efficacy of the repairs. | 
|  | The results of the reassessment are returned to userspace. | 
|  |  | 
|  | Classification of Metadata | 
|  | -------------------------- | 
|  |  | 
|  | Each type of metadata object (and therefore each type of scrub item) is | 
|  | classified as follows: | 
|  |  | 
|  | Primary Metadata | 
|  | ```````````````` | 
|  |  | 
|  | Metadata structures in this category should be most familiar to filesystem | 
|  | users either because they are directly created by the user or they index | 
|  | objects created by the user | 
|  | Most filesystem objects fall into this class: | 
|  |  | 
|  | - Free space and reference count information | 
|  |  | 
|  | - Inode records and indexes | 
|  |  | 
|  | - Storage mapping information for file data | 
|  |  | 
|  | - Directories | 
|  |  | 
|  | - Extended attributes | 
|  |  | 
|  | - Symbolic links | 
|  |  | 
|  | - Quota limits | 
|  |  | 
|  | Scrub obeys the same rules as regular filesystem accesses for resource and lock | 
|  | acquisition. | 
|  |  | 
|  | Primary metadata objects are the simplest for scrub to process. | 
|  | The principal filesystem object (either an allocation group or an inode) that | 
|  | owns the item being scrubbed is locked to guard against concurrent updates. | 
|  | The check function examines every record associated with the type for obvious | 
|  | errors and cross-references healthy records against other metadata to look for | 
|  | inconsistencies. | 
|  | Repairs for this class of scrub item are simple, since the repair function | 
|  | starts by holding all the resources acquired in the previous step. | 
|  | The repair function scans available metadata as needed to record all the | 
|  | observations needed to complete the structure. | 
|  | Next, it stages the observations in a new ondisk structure and commits it | 
|  | atomically to complete the repair. | 
|  | Finally, the storage from the old data structure are carefully reaped. | 
|  |  | 
|  | Because ``xfs_scrub`` locks a primary object for the duration of the repair, | 
|  | this is effectively an offline repair operation performed on a subset of the | 
|  | filesystem. | 
|  | This minimizes the complexity of the repair code because it is not necessary to | 
|  | handle concurrent updates from other threads, nor is it necessary to access | 
|  | any other part of the filesystem. | 
|  | As a result, indexed structures can be rebuilt very quickly, and programs | 
|  | trying to access the damaged structure will be blocked until repairs complete. | 
|  | The only infrastructure needed by the repair code are the staging area for | 
|  | observations and a means to write new structures to disk. | 
|  | Despite these limitations, the advantage that online repair holds is clear: | 
|  | targeted work on individual shards of the filesystem avoids total loss of | 
|  | service. | 
|  |  | 
|  | This mechanism is described in section 2.1 ("Off-Line Algorithm") of | 
|  | V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction | 
|  | Algorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_, | 
|  | *Extending Database Technology*, pp. 293-309, 1992. | 
|  |  | 
|  | Most primary metadata repair functions stage their intermediate results in an | 
|  | in-memory array prior to formatting the new ondisk structure, which is very | 
|  | similar to the list-based algorithm discussed in section 2.3 ("List-Based | 
|  | Algorithms") of Srinivasan. | 
|  | However, any data structure builder that maintains a resource lock for the | 
|  | duration of the repair is *always* an offline algorithm. | 
|  |  | 
|  | .. _secondary_metadata: | 
|  |  | 
|  | Secondary Metadata | 
|  | `````````````````` | 
|  |  | 
|  | Metadata structures in this category reflect records found in primary metadata, | 
|  | but are only needed for online fsck or for reorganization of the filesystem. | 
|  |  | 
|  | Secondary metadata include: | 
|  |  | 
|  | - Reverse mapping information | 
|  |  | 
|  | - Directory parent pointers | 
|  |  | 
|  | This class of metadata is difficult for scrub to process because scrub attaches | 
|  | to the secondary object but needs to check primary metadata, which runs counter | 
|  | to the usual order of resource acquisition. | 
|  | Frequently, this means that full filesystems scans are necessary to rebuild the | 
|  | metadata. | 
|  | Check functions can be limited in scope to reduce runtime. | 
|  | Repairs, however, require a full scan of primary metadata, which can take a | 
|  | long time to complete. | 
|  | Under these conditions, ``xfs_scrub`` cannot lock resources for the entire | 
|  | duration of the repair. | 
|  |  | 
|  | Instead, repair functions set up an in-memory staging structure to store | 
|  | observations. | 
|  | Depending on the requirements of the specific repair function, the staging | 
|  | index will either have the same format as the ondisk structure or a design | 
|  | specific to that repair function. | 
|  | The next step is to release all locks and start the filesystem scan. | 
|  | When the repair scanner needs to record an observation, the staging data are | 
|  | locked long enough to apply the update. | 
|  | While the filesystem scan is in progress, the repair function hooks the | 
|  | filesystem so that it can apply pending filesystem updates to the staging | 
|  | information. | 
|  | Once the scan is done, the owning object is re-locked, the live data is used to | 
|  | write a new ondisk structure, and the repairs are committed atomically. | 
|  | The hooks are disabled and the staging staging area is freed. | 
|  | Finally, the storage from the old data structure are carefully reaped. | 
|  |  | 
|  | Introducing concurrency helps online repair avoid various locking problems, but | 
|  | comes at a high cost to code complexity. | 
|  | Live filesystem code has to be hooked so that the repair function can observe | 
|  | updates in progress. | 
|  | The staging area has to become a fully functional parallel structure so that | 
|  | updates can be merged from the hooks. | 
|  | Finally, the hook, the filesystem scan, and the inode locking model must be | 
|  | sufficiently well integrated that a hook event can decide if a given update | 
|  | should be applied to the staging structure. | 
|  |  | 
|  | In theory, the scrub implementation could apply these same techniques for | 
|  | primary metadata, but doing so would make it massively more complex and less | 
|  | performant. | 
|  | Programs attempting to access the damaged structures are not blocked from | 
|  | operation, which may cause application failure or an unplanned filesystem | 
|  | shutdown. | 
|  |  | 
|  | Inspiration for the secondary metadata repair strategy was drawn from section | 
|  | 2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File") | 
|  | and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for | 
|  | Creating Indexes for Very Large Tables Without Quiescing Updates" | 
|  | <https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992. | 
|  |  | 
|  | The sidecar index mentioned above bears some resemblance to the side file | 
|  | method mentioned in Srinivasan and Mohan. | 
|  | Their method consists of an index builder that extracts relevant record data to | 
|  | build the new structure as quickly as possible; and an auxiliary structure that | 
|  | captures all updates that would be committed to the index by other threads were | 
|  | the new index already online. | 
|  | After the index building scan finishes, the updates recorded in the side file | 
|  | are applied to the new index. | 
|  | To avoid conflicts between the index builder and other writer threads, the | 
|  | builder maintains a publicly visible cursor that tracks the progress of the | 
|  | scan through the record space. | 
|  | To avoid duplication of work between the side file and the index builder, side | 
|  | file updates are elided when the record ID for the update is greater than the | 
|  | cursor position within the record ID space. | 
|  |  | 
|  | To minimize changes to the rest of the codebase, XFS online repair keeps the | 
|  | replacement index hidden until it's completely ready to go. | 
|  | In other words, there is no attempt to expose the keyspace of the new index | 
|  | while repair is running. | 
|  | The complexity of such an approach would be very high and perhaps more | 
|  | appropriate to building *new* indices. | 
|  |  | 
|  | **Future Work Question**: Can the full scan and live update code used to | 
|  | facilitate a repair also be used to implement a comprehensive check? | 
|  |  | 
|  | *Answer*: In theory, yes.  Check would be much stronger if each scrub function | 
|  | employed these live scans to build a shadow copy of the metadata and then | 
|  | compared the shadow records to the ondisk records. | 
|  | However, doing that is a fair amount more work than what the checking functions | 
|  | do now. | 
|  | The live scans and hooks were developed much later. | 
|  | That in turn increases the runtime of those scrub functions. | 
|  |  | 
|  | Summary Information | 
|  | ``````````````````` | 
|  |  | 
|  | Metadata structures in this last category summarize the contents of primary | 
|  | metadata records. | 
|  | These are often used to speed up resource usage queries, and are many times | 
|  | smaller than the primary metadata which they represent. | 
|  |  | 
|  | Examples of summary information include: | 
|  |  | 
|  | - Summary counts of free space and inodes | 
|  |  | 
|  | - File link counts from directories | 
|  |  | 
|  | - Quota resource usage counts | 
|  |  | 
|  | Check and repair require full filesystem scans, but resource and lock | 
|  | acquisition follow the same paths as regular filesystem accesses. | 
|  |  | 
|  | The superblock summary counters have special requirements due to the underlying | 
|  | implementation of the incore counters, and will be treated separately. | 
|  | Check and repair of the other types of summary counters (quota resource counts | 
|  | and file link counts) employ the same filesystem scanning and hooking | 
|  | techniques as outlined above, but because the underlying data are sets of | 
|  | integer counters, the staging data need not be a fully functional mirror of the | 
|  | ondisk structure. | 
|  |  | 
|  | Inspiration for quota and file link count repair strategies were drawn from | 
|  | sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View | 
|  | Maintenance") of G.  Graefe, `"Concurrent Queries and Updates in Summary Views | 
|  | and Their Indexes" | 
|  | <http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011. | 
|  |  | 
|  | Since quotas are non-negative integer counts of resource usage, online | 
|  | quotacheck can use the incremental view deltas described in section 2.14 to | 
|  | track pending changes to the block and inode usage counts in each transaction, | 
|  | and commit those changes to a dquot side file when the transaction commits. | 
|  | Delta tracking is necessary for dquots because the index builder scans inodes, | 
|  | whereas the data structure being rebuilt is an index of dquots. | 
|  | Link count checking combines the view deltas and commit step into one because | 
|  | it sets attributes of the objects being scanned instead of writing them to a | 
|  | separate data structure. | 
|  | Each online fsck function will be discussed as case studies later in this | 
|  | document. | 
|  |  | 
|  | Risk Management | 
|  | --------------- | 
|  |  | 
|  | During the development of online fsck, several risk factors were identified | 
|  | that may make the feature unsuitable for certain distributors and users. | 
|  | Steps can be taken to mitigate or eliminate those risks, though at a cost to | 
|  | functionality. | 
|  |  | 
|  | - **Decreased performance**: Adding metadata indices to the filesystem | 
|  | increases the time cost of persisting changes to disk, and the reverse space | 
|  | mapping and directory parent pointers are no exception. | 
|  | System administrators who require the maximum performance can disable the | 
|  | reverse mapping features at format time, though this choice dramatically | 
|  | reduces the ability of online fsck to find inconsistencies and repair them. | 
|  |  | 
|  | - **Incorrect repairs**: As with all software, there might be defects in the | 
|  | software that result in incorrect repairs being written to the filesystem. | 
|  | Systematic fuzz testing (detailed in the next section) is employed by the | 
|  | authors to find bugs early, but it might not catch everything. | 
|  | The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB`` | 
|  | and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to | 
|  | accept this risk. | 
|  | The xfsprogs build system has a configure option (``--enable-scrub=no``) that | 
|  | disables building of the ``xfs_scrub`` binary, though this is not a risk | 
|  | mitigation if the kernel functionality remains enabled. | 
|  |  | 
|  | - **Inability to repair**: Sometimes, a filesystem is too badly damaged to be | 
|  | repairable. | 
|  | If the keyspaces of several metadata indices overlap in some manner but a | 
|  | coherent narrative cannot be formed from records collected, then the repair | 
|  | fails. | 
|  | To reduce the chance that a repair will fail with a dirty transaction and | 
|  | render the filesystem unusable, the online repair functions have been | 
|  | designed to stage and validate all new records before committing the new | 
|  | structure. | 
|  |  | 
|  | - **Misbehavior**: Online fsck requires many privileges -- raw IO to block | 
|  | devices, opening files by handle, ignoring Unix discretionary access control, | 
|  | and the ability to perform administrative changes. | 
|  | Running this automatically in the background scares people, so the systemd | 
|  | background service is configured to run with only the privileges required. | 
|  | Obviously, this cannot address certain problems like the kernel crashing or | 
|  | deadlocking, but it should be sufficient to prevent the scrub process from | 
|  | escaping and reconfiguring the system. | 
|  | The cron job does not have this protection. | 
|  |  | 
|  | - **Fuzz Kiddiez**: There are many people now who seem to think that running | 
|  | automated fuzz testing of ondisk artifacts to find mischievous behavior and | 
|  | spraying exploit code onto the public mailing list for instant zero-day | 
|  | disclosure is somehow of some social benefit. | 
|  | In the view of this author, the benefit is realized only when the fuzz | 
|  | operators help to **fix** the flaws, but this opinion apparently is not | 
|  | widely shared among security "researchers". | 
|  | The XFS maintainers' continuing ability to manage these events presents an | 
|  | ongoing risk to the stability of the development process. | 
|  | Automated testing should front-load some of the risk while the feature is | 
|  | considered EXPERIMENTAL. | 
|  |  | 
|  | Many of these risks are inherent to software programming. | 
|  | Despite this, it is hoped that this new functionality will prove useful in | 
|  | reducing unexpected downtime. | 
|  |  | 
|  | 3. Testing Plan | 
|  | =============== | 
|  |  | 
|  | As stated before, fsck tools have three main goals: | 
|  |  | 
|  | 1. Detect inconsistencies in the metadata; | 
|  |  | 
|  | 2. Eliminate those inconsistencies; and | 
|  |  | 
|  | 3. Minimize further loss of data. | 
|  |  | 
|  | Demonstrations of correct operation are necessary to build users' confidence | 
|  | that the software behaves within expectations. | 
|  | Unfortunately, it was not really feasible to perform regular exhaustive testing | 
|  | of every aspect of a fsck tool until the introduction of low-cost virtual | 
|  | machines with high-IOPS storage. | 
|  | With ample hardware availability in mind, the testing strategy for the online | 
|  | fsck project involves differential analysis against the existing fsck tools and | 
|  | systematic testing of every attribute of every type of metadata object. | 
|  | Testing can be split into four major categories, as discussed below. | 
|  |  | 
|  | Integrated Testing with fstests | 
|  | ------------------------------- | 
|  |  | 
|  | The primary goal of any free software QA effort is to make testing as | 
|  | inexpensive and widespread as possible to maximize the scaling advantages of | 
|  | community. | 
|  | In other words, testing should maximize the breadth of filesystem configuration | 
|  | scenarios and hardware setups. | 
|  | This improves code quality by enabling the authors of online fsck to find and | 
|  | fix bugs early, and helps developers of new features to find integration | 
|  | issues earlier in their development effort. | 
|  |  | 
|  | The Linux filesystem community shares a common QA testing suite, | 
|  | `fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for | 
|  | functional and regression testing. | 
|  | Even before development work began on online fsck, fstests (when run on XFS) | 
|  | would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and | 
|  | scratch filesystems between each test. | 
|  | This provides a level of assurance that the kernel and the fsck tools stay in | 
|  | alignment about what constitutes consistent metadata. | 
|  | During development of the online checking code, fstests was modified to run | 
|  | ``xfs_scrub -n`` between each test to ensure that the new checking code | 
|  | produces the same results as the two existing fsck tools. | 
|  |  | 
|  | To start development of online repair, fstests was modified to run | 
|  | ``xfs_repair`` to rebuild the filesystem's metadata indices between tests. | 
|  | This ensures that offline repair does not crash, leave a corrupt filesystem | 
|  | after it exists, or trigger complaints from the online check. | 
|  | This also established a baseline for what can and cannot be repaired offline. | 
|  | To complete the first phase of development of online repair, fstests was | 
|  | modified to be able to run ``xfs_scrub`` in a "force rebuild" mode. | 
|  | This enables a comparison of the effectiveness of online repair as compared to | 
|  | the existing offline repair tools. | 
|  |  | 
|  | General Fuzz Testing of Metadata Blocks | 
|  | --------------------------------------- | 
|  |  | 
|  | XFS benefits greatly from having a very robust debugging tool, ``xfs_db``. | 
|  |  | 
|  | Before development of online fsck even began, a set of fstests were created | 
|  | to test the rather common fault that entire metadata blocks get corrupted. | 
|  | This required the creation of fstests library code that can create a filesystem | 
|  | containing every possible type of metadata object. | 
|  | Next, individual test cases were created to create a test filesystem, identify | 
|  | a single block of a specific type of metadata object, trash it with the | 
|  | existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a | 
|  | particular metadata validation strategy. | 
|  |  | 
|  | This earlier test suite enabled XFS developers to test the ability of the | 
|  | in-kernel validation functions and the ability of the offline fsck tool to | 
|  | detect and eliminate the inconsistent metadata. | 
|  | This part of the test suite was extended to cover online fsck in exactly the | 
|  | same manner. | 
|  |  | 
|  | In other words, for a given fstests filesystem configuration: | 
|  |  | 
|  | * For each metadata object existing on the filesystem: | 
|  |  | 
|  | * Write garbage to it | 
|  |  | 
|  | * Test the reactions of: | 
|  |  | 
|  | 1. The kernel verifiers to stop obviously bad metadata | 
|  | 2. Offline repair (``xfs_repair``) to detect and fix | 
|  | 3. Online repair (``xfs_scrub``) to detect and fix | 
|  |  | 
|  | Targeted Fuzz Testing of Metadata Records | 
|  | ----------------------------------------- | 
|  |  | 
|  | The testing plan for online fsck includes extending the existing fs testing | 
|  | infrastructure to provide a much more powerful facility: targeted fuzz testing | 
|  | of every metadata field of every metadata object in the filesystem. | 
|  | ``xfs_db`` can modify every field of every metadata structure in every | 
|  | block in the filesystem to simulate the effects of memory corruption and | 
|  | software bugs. | 
|  | Given that fstests already contains the ability to create a filesystem | 
|  | containing every metadata format known to the filesystem, ``xfs_db`` can be | 
|  | used to perform exhaustive fuzz testing! | 
|  |  | 
|  | For a given fstests filesystem configuration: | 
|  |  | 
|  | * For each metadata object existing on the filesystem... | 
|  |  | 
|  | * For each record inside that metadata object... | 
|  |  | 
|  | * For each field inside that record... | 
|  |  | 
|  | * For each conceivable type of transformation that can be applied to a bit field... | 
|  |  | 
|  | 1. Clear all bits | 
|  | 2. Set all bits | 
|  | 3. Toggle the most significant bit | 
|  | 4. Toggle the middle bit | 
|  | 5. Toggle the least significant bit | 
|  | 6. Add a small quantity | 
|  | 7. Subtract a small quantity | 
|  | 8. Randomize the contents | 
|  |  | 
|  | * ...test the reactions of: | 
|  |  | 
|  | 1. The kernel verifiers to stop obviously bad metadata | 
|  | 2. Offline checking (``xfs_repair -n``) | 
|  | 3. Offline repair (``xfs_repair``) | 
|  | 4. Online checking (``xfs_scrub -n``) | 
|  | 5. Online repair (``xfs_scrub``) | 
|  | 6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed) | 
|  |  | 
|  | This is quite the combinatoric explosion! | 
|  |  | 
|  | Fortunately, having this much test coverage makes it easy for XFS developers to | 
|  | check the responses of XFS' fsck tools. | 
|  | Since the introduction of the fuzz testing framework, these tests have been | 
|  | used to discover incorrect repair code and missing functionality for entire | 
|  | classes of metadata objects in ``xfs_repair``. | 
|  | The enhanced testing was used to finalize the deprecation of ``xfs_check`` by | 
|  | confirming that ``xfs_repair`` could detect at least as many corruptions as | 
|  | the older tool. | 
|  |  | 
|  | These tests have been very valuable for ``xfs_scrub`` in the same ways -- they | 
|  | allow the online fsck developers to compare online fsck against offline fsck, | 
|  | and they enable XFS developers to find deficiencies in the code base. | 
|  |  | 
|  | Proposed patchsets include | 
|  | `general fuzzer improvements | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_, | 
|  | `fuzzing baselines | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_, | 
|  | and `improvements in fuzz testing comprehensiveness | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_. | 
|  |  | 
|  | Stress Testing | 
|  | -------------- | 
|  |  | 
|  | A unique requirement to online fsck is the ability to operate on a filesystem | 
|  | concurrently with regular workloads. | 
|  | Although it is of course impossible to run ``xfs_scrub`` with *zero* observable | 
|  | impact on the running system, the online repair code should never introduce | 
|  | inconsistencies into the filesystem metadata, and regular workloads should | 
|  | never notice resource starvation. | 
|  | To verify that these conditions are being met, fstests has been enhanced in | 
|  | the following ways: | 
|  |  | 
|  | * For each scrub item type, create a test to exercise checking that item type | 
|  | while running ``fsstress``. | 
|  | * For each scrub item type, create a test to exercise repairing that item type | 
|  | while running ``fsstress``. | 
|  | * Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole | 
|  | filesystem doesn't cause problems. | 
|  | * Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that | 
|  | force-repairing the whole filesystem doesn't cause problems. | 
|  | * Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while | 
|  | freezing and thawing the filesystem. | 
|  | * Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while | 
|  | remounting the filesystem read-only and read-write. | 
|  | * The same, but running ``fsx`` instead of ``fsstress``.  (Not done yet?) | 
|  |  | 
|  | Success is defined by the ability to run all of these tests without observing | 
|  | any unexpected filesystem shutdowns due to corrupted metadata, kernel hang | 
|  | check warnings, or any other sort of mischief. | 
|  |  | 
|  | Proposed patchsets include `general stress testing | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_ | 
|  | and the `evolution of existing per-function stress testing | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_. | 
|  |  | 
|  | 4. User Interface | 
|  | ================= | 
|  |  | 
|  | The primary user of online fsck is the system administrator, just like offline | 
|  | repair. | 
|  | Online fsck presents two modes of operation to administrators: | 
|  | A foreground CLI process for online fsck on demand, and a background service | 
|  | that performs autonomous checking and repair. | 
|  |  | 
|  | Checking on Demand | 
|  | ------------------ | 
|  |  | 
|  | For administrators who want the absolute freshest information about the | 
|  | metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on | 
|  | a command line. | 
|  | The program checks every piece of metadata in the filesystem while the | 
|  | administrator waits for the results to be reported, just like the existing | 
|  | ``xfs_repair`` tool. | 
|  | Both tools share a ``-n`` option to perform a read-only scan, and a ``-v`` | 
|  | option to increase the verbosity of the information reported. | 
|  |  | 
|  | A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error | 
|  | correction capabilities of the hardware to check data file contents. | 
|  | The media scan is not enabled by default because it may dramatically increase | 
|  | program runtime and consume a lot of bandwidth on older storage hardware. | 
|  |  | 
|  | The output of a foreground invocation is captured in the system log. | 
|  |  | 
|  | The ``xfs_scrub_all`` program walks the list of mounted filesystems and | 
|  | initiates ``xfs_scrub`` for each of them in parallel. | 
|  | It serializes scans for any filesystems that resolve to the same top level | 
|  | kernel block device to prevent resource overconsumption. | 
|  |  | 
|  | Background Service | 
|  | ------------------ | 
|  |  | 
|  | To reduce the workload of system administrators, the ``xfs_scrub`` package | 
|  | provides a suite of `systemd <https://systemd.io/>`_ timers and services that | 
|  | run online fsck automatically on weekends by default. | 
|  | The background service configures scrub to run with as little privilege as | 
|  | possible, the lowest CPU and IO priority, and in a CPU-constrained single | 
|  | threaded mode. | 
|  | This can be tuned by the systemd administrator at any time to suit the latency | 
|  | and throughput requirements of customer workloads. | 
|  |  | 
|  | The output of the background service is also captured in the system log. | 
|  | If desired, reports of failures (either due to inconsistencies or mere runtime | 
|  | errors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment | 
|  | variable in the following service files: | 
|  |  | 
|  | * ``xfs_scrub_fail@.service`` | 
|  | * ``xfs_scrub_media_fail@.service`` | 
|  | * ``xfs_scrub_all_fail.service`` | 
|  |  | 
|  | The decision to enable the background scan is left to the system administrator. | 
|  | This can be done by enabling either of the following services: | 
|  |  | 
|  | * ``xfs_scrub_all.timer`` on systemd systems | 
|  | * ``xfs_scrub_all.cron`` on non-systemd systems | 
|  |  | 
|  | This automatic weekly scan is configured out of the box to perform an | 
|  | additional media scan of all file data once per month. | 
|  | This is less foolproof than, say, storing file data block checksums, but much | 
|  | more performant if application software provides its own integrity checking, | 
|  | redundancy can be provided elsewhere above the filesystem, or the storage | 
|  | device's integrity guarantees are deemed sufficient. | 
|  |  | 
|  | The systemd unit file definitions have been subjected to a security audit | 
|  | (as of systemd 249) to ensure that the xfs_scrub processes have as little | 
|  | access to the rest of the system as possible. | 
|  | This was performed via ``systemd-analyze security``, after which privileges | 
|  | were restricted to the minimum required, sandboxing was set up to the maximal | 
|  | extent possible with sandboxing and system call filtering; and access to the | 
|  | filesystem tree was restricted to the minimum needed to start the program and | 
|  | access the filesystem being scanned. | 
|  | The service definition files restrict CPU usage to 80% of one CPU core, and | 
|  | apply as nice of a priority to IO and CPU scheduling as possible. | 
|  | This measure was taken to minimize delays in the rest of the filesystem. | 
|  | No such hardening has been performed for the cron job. | 
|  |  | 
|  | Proposed patchset: | 
|  | `Enabling the xfs_scrub background service | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_. | 
|  |  | 
|  | Health Reporting | 
|  | ---------------- | 
|  |  | 
|  | XFS caches a summary of each filesystem's health status in memory. | 
|  | The information is updated whenever ``xfs_scrub`` is run, or whenever | 
|  | inconsistencies are detected in the filesystem metadata during regular | 
|  | operations. | 
|  | System administrators should use the ``health`` command of ``xfs_spaceman`` to | 
|  | download this information into a human-readable format. | 
|  | If problems have been observed, the administrator can schedule a reduced | 
|  | service window to run the online repair tool to correct the problem. | 
|  | Failing that, the administrator can decide to schedule a maintenance window to | 
|  | run the traditional offline repair tool to correct the problem. | 
|  |  | 
|  | **Future Work Question**: Should the health reporting integrate with the new | 
|  | inotify fs error notification system? | 
|  | Would it be helpful for sysadmins to have a daemon to listen for corruption | 
|  | notifications and initiate a repair? | 
|  |  | 
|  | *Answer*: These questions remain unanswered, but should be a part of the | 
|  | conversation with early adopters and potential downstream users of XFS. | 
|  |  | 
|  | Proposed patchsets include | 
|  | `wiring up health reports to correction returns | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_ | 
|  | and | 
|  | `preservation of sickness info during memory reclaim | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_. | 
|  |  | 
|  | 5. Kernel Algorithms and Data Structures | 
|  | ======================================== | 
|  |  | 
|  | This section discusses the key algorithms and data structures of the kernel | 
|  | code that provide the ability to check and repair metadata while the system | 
|  | is running. | 
|  | The first chapters in this section reveal the pieces that provide the | 
|  | foundation for checking metadata. | 
|  | The remainder of this section presents the mechanisms through which XFS | 
|  | regenerates itself. | 
|  |  | 
|  | Self Describing Metadata | 
|  | ------------------------ | 
|  |  | 
|  | Starting with XFS version 5 in 2012, XFS updated the format of nearly every | 
|  | ondisk block header to record a magic number, a checksum, a universally | 
|  | "unique" identifier (UUID), an owner code, the ondisk address of the block, | 
|  | and a log sequence number. | 
|  | When loading a block buffer from disk, the magic number, UUID, owner, and | 
|  | ondisk address confirm that the retrieved block matches the specific owner of | 
|  | the current filesystem, and that the information contained in the block is | 
|  | supposed to be found at the ondisk address. | 
|  | The first three components enable checking tools to disregard alleged metadata | 
|  | that doesn't belong to the filesystem, and the fourth component enables the | 
|  | filesystem to detect lost writes. | 
|  |  | 
|  | Whenever a file system operation modifies a block, the change is submitted | 
|  | to the log as part of a transaction. | 
|  | The log then processes these transactions marking them done once they are | 
|  | safely persisted to storage. | 
|  | The logging code maintains the checksum and the log sequence number of the last | 
|  | transactional update. | 
|  | Checksums are useful for detecting torn writes and other discrepancies that can | 
|  | be introduced between the computer and its storage devices. | 
|  | Sequence number tracking enables log recovery to avoid applying out of date | 
|  | log updates to the filesystem. | 
|  |  | 
|  | These two features improve overall runtime resiliency by providing a means for | 
|  | the filesystem to detect obvious corruption when reading metadata blocks from | 
|  | disk, but these buffer verifiers cannot provide any consistency checking | 
|  | between metadata structures. | 
|  |  | 
|  | For more information, please see the documentation for | 
|  | Documentation/filesystems/xfs/xfs-self-describing-metadata.rst | 
|  |  | 
|  | Reverse Mapping | 
|  | --------------- | 
|  |  | 
|  | The original design of XFS (circa 1993) is an improvement upon 1980s Unix | 
|  | filesystem design. | 
|  | In those days, storage density was expensive, CPU time was scarce, and | 
|  | excessive seek time could kill performance. | 
|  | For performance reasons, filesystem authors were reluctant to add redundancy to | 
|  | the filesystem, even at the cost of data integrity. | 
|  | Filesystems designers in the early 21st century choose different strategies to | 
|  | increase internal redundancy -- either storing nearly identical copies of | 
|  | metadata, or more space-efficient encoding techniques. | 
|  |  | 
|  | For XFS, a different redundancy strategy was chosen to modernize the design: | 
|  | a secondary space usage index that maps allocated disk extents back to their | 
|  | owners. | 
|  | By adding a new index, the filesystem retains most of its ability to scale | 
|  | well to heavily threaded workloads involving large datasets, since the primary | 
|  | file metadata (the directory tree, the file block map, and the allocation | 
|  | groups) remain unchanged. | 
|  | Like any system that improves redundancy, the reverse-mapping feature increases | 
|  | overhead costs for space mapping activities. | 
|  | However, it has two critical advantages: first, the reverse index is key to | 
|  | enabling online fsck and other requested functionality such as free space | 
|  | defragmentation, better media failure reporting, and filesystem shrinking. | 
|  | Second, the different ondisk storage format of the reverse mapping btree | 
|  | defeats device-level deduplication because the filesystem requires real | 
|  | redundancy. | 
|  |  | 
|  | +--------------------------------------------------------------------------+ | 
|  | | **Sidebar**:                                                             | | 
|  | +--------------------------------------------------------------------------+ | 
|  | | A criticism of adding the secondary index is that it does nothing to     | | 
|  | | improve the robustness of user data storage itself.                      | | 
|  | | This is a valid point, but adding a new index for file data block        | | 
|  | | checksums increases write amplification by turning data overwrites into  | | 
|  | | copy-writes, which age the filesystem prematurely.                       | | 
|  | | In keeping with thirty years of precedent, users who want file data      | | 
|  | | integrity can supply as powerful a solution as they require.             | | 
|  | | As for metadata, the complexity of adding a new secondary index of space | | 
|  | | usage is much less than adding volume management and storage device      | | 
|  | | mirroring to XFS itself.                                                 | | 
|  | | Perfection of RAID and volume management are best left to existing       | | 
|  | | layers in the kernel.                                                    | | 
|  | +--------------------------------------------------------------------------+ | 
|  |  | 
|  | The information captured in a reverse space mapping record is as follows: | 
|  |  | 
|  | .. code-block:: c | 
|  |  | 
|  | struct xfs_rmap_irec { | 
|  | xfs_agblock_t    rm_startblock;   /* extent start block */ | 
|  | xfs_extlen_t     rm_blockcount;   /* extent length */ | 
|  | uint64_t         rm_owner;        /* extent owner */ | 
|  | uint64_t         rm_offset;       /* offset within the owner */ | 
|  | unsigned int     rm_flags;        /* state flags */ | 
|  | }; | 
|  |  | 
|  | The first two fields capture the location and size of the physical space, | 
|  | in units of filesystem blocks. | 
|  | The owner field tells scrub which metadata structure or file inode have been | 
|  | assigned this space. | 
|  | For space allocated to files, the offset field tells scrub where the space was | 
|  | mapped within the file fork. | 
|  | Finally, the flags field provides extra information about the space usage -- | 
|  | is this an attribute fork extent?  A file mapping btree extent?  Or an | 
|  | unwritten data extent? | 
|  |  | 
|  | Online filesystem checking judges the consistency of each primary metadata | 
|  | record by comparing its information against all other space indices. | 
|  | The reverse mapping index plays a key role in the consistency checking process | 
|  | because it contains a centralized alternate copy of all space allocation | 
|  | information. | 
|  | Program runtime and ease of resource acquisition are the only real limits to | 
|  | what online checking can consult. | 
|  | For example, a file data extent mapping can be checked against: | 
|  |  | 
|  | * The absence of an entry in the free space information. | 
|  | * The absence of an entry in the inode index. | 
|  | * The absence of an entry in the reference count data if the file is not | 
|  | marked as having shared extents. | 
|  | * The correspondence of an entry in the reverse mapping information. | 
|  |  | 
|  | There are several observations to make about reverse mapping indices: | 
|  |  | 
|  | 1. Reverse mappings can provide a positive affirmation of correctness if any of | 
|  | the above primary metadata are in doubt. | 
|  | The checking code for most primary metadata follows a path similar to the | 
|  | one outlined above. | 
|  |  | 
|  | 2. Proving the consistency of secondary metadata with the primary metadata is | 
|  | difficult because that requires a full scan of all primary space metadata, | 
|  | which is very time intensive. | 
|  | For example, checking a reverse mapping record for a file extent mapping | 
|  | btree block requires locking the file and searching the entire btree to | 
|  | confirm the block. | 
|  | Instead, scrub relies on rigorous cross-referencing during the primary space | 
|  | mapping structure checks. | 
|  |  | 
|  | 3. Consistency scans must use non-blocking lock acquisition primitives if the | 
|  | required locking order is not the same order used by regular filesystem | 
|  | operations. | 
|  | For example, if the filesystem normally takes a file ILOCK before taking | 
|  | the AGF buffer lock but scrub wants to take a file ILOCK while holding | 
|  | an AGF buffer lock, scrub cannot block on that second acquisition. | 
|  | This means that forward progress during this part of a scan of the reverse | 
|  | mapping data cannot be guaranteed if system load is heavy. | 
|  |  | 
|  | In summary, reverse mappings play a key role in reconstruction of primary | 
|  | metadata. | 
|  | The details of how these records are staged, written to disk, and committed | 
|  | into the filesystem are covered in subsequent sections. | 
|  |  | 
|  | Checking and Cross-Referencing | 
|  | ------------------------------ | 
|  |  | 
|  | The first step of checking a metadata structure is to examine every record | 
|  | contained within the structure and its relationship with the rest of the | 
|  | system. | 
|  | XFS contains multiple layers of checking to try to prevent inconsistent | 
|  | metadata from wreaking havoc on the system. | 
|  | Each of these layers contributes information that helps the kernel to make | 
|  | three decisions about the health of a metadata structure: | 
|  |  | 
|  | - Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ? | 
|  | - Is this structure inconsistent with the rest of the system | 
|  | (``XFS_SCRUB_OFLAG_XCORRUPT``) ? | 
|  | - Is there so much damage around the filesystem that cross-referencing is not | 
|  | possible (``XFS_SCRUB_OFLAG_XFAIL``) ? | 
|  | - Can the structure be optimized to improve performance or reduce the size of | 
|  | metadata (``XFS_SCRUB_OFLAG_PREEN``) ? | 
|  | - Does the structure contain data that is not inconsistent but deserves review | 
|  | by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ? | 
|  |  | 
|  | The following sections describe how the metadata scrubbing process works. | 
|  |  | 
|  | Metadata Buffer Verification | 
|  | ```````````````````````````` | 
|  |  | 
|  | The lowest layer of metadata protection in XFS are the metadata verifiers built | 
|  | into the buffer cache. | 
|  | These functions perform inexpensive internal consistency checking of the block | 
|  | itself, and answer these questions: | 
|  |  | 
|  | - Does the block belong to this filesystem? | 
|  |  | 
|  | - Does the block belong to the structure that asked for the read? | 
|  | This assumes that metadata blocks only have one owner, which is always true | 
|  | in XFS. | 
|  |  | 
|  | - Is the type of data stored in the block within a reasonable range of what | 
|  | scrub is expecting? | 
|  |  | 
|  | - Does the physical location of the block match the location it was read from? | 
|  |  | 
|  | - Does the block checksum match the data? | 
|  |  | 
|  | The scope of the protections here are very limited -- verifiers can only | 
|  | establish that the filesystem code is reasonably free of gross corruption bugs | 
|  | and that the storage system is reasonably competent at retrieval. | 
|  | Corruption problems observed at runtime cause the generation of health reports, | 
|  | failed system calls, and in the extreme case, filesystem shutdowns if the | 
|  | corrupt metadata force the cancellation of a dirty transaction. | 
|  |  | 
|  | Every online fsck scrubbing function is expected to read every ondisk metadata | 
|  | block of a structure in the course of checking the structure. | 
|  | Corruption problems observed during a check are immediately reported to | 
|  | userspace as corruption; during a cross-reference, they are reported as a | 
|  | failure to cross-reference once the full examination is complete. | 
|  | Reads satisfied by a buffer already in cache (and hence already verified) | 
|  | bypass these checks. | 
|  |  | 
|  | Internal Consistency Checks | 
|  | ``````````````````````````` | 
|  |  | 
|  | After the buffer cache, the next level of metadata protection is the internal | 
|  | record verification code built into the filesystem. | 
|  | These checks are split between the buffer verifiers, the in-filesystem users of | 
|  | the buffer cache, and the scrub code itself, depending on the amount of higher | 
|  | level context required. | 
|  | The scope of checking is still internal to the block. | 
|  | These higher level checking functions answer these questions: | 
|  |  | 
|  | - Does the type of data stored in the block match what scrub is expecting? | 
|  |  | 
|  | - Does the block belong to the owning structure that asked for the read? | 
|  |  | 
|  | - If the block contains records, do the records fit within the block? | 
|  |  | 
|  | - If the block tracks internal free space information, is it consistent with | 
|  | the record areas? | 
|  |  | 
|  | - Are the records contained inside the block free of obvious corruptions? | 
|  |  | 
|  | Record checks in this category are more rigorous and more time-intensive. | 
|  | For example, block pointers and inumbers are checked to ensure that they point | 
|  | within the dynamically allocated parts of an allocation group and within | 
|  | the filesystem. | 
|  | Names are checked for invalid characters, and flags are checked for invalid | 
|  | combinations. | 
|  | Other record attributes are checked for sensible values. | 
|  | Btree records spanning an interval of the btree keyspace are checked for | 
|  | correct order and lack of mergeability (except for file fork mappings). | 
|  | For performance reasons, regular code may skip some of these checks unless | 
|  | debugging is enabled or a write is about to occur. | 
|  | Scrub functions, of course, must check all possible problems. | 
|  |  | 
|  | Validation of Userspace-Controlled Record Attributes | 
|  | ```````````````````````````````````````````````````` | 
|  |  | 
|  | Various pieces of filesystem metadata are directly controlled by userspace. | 
|  | Because of this nature, validation work cannot be more precise than checking | 
|  | that a value is within the possible range. | 
|  | These fields include: | 
|  |  | 
|  | - Superblock fields controlled by mount options | 
|  | - Filesystem labels | 
|  | - File timestamps | 
|  | - File permissions | 
|  | - File size | 
|  | - File flags | 
|  | - Names present in directory entries, extended attribute keys, and filesystem | 
|  | labels | 
|  | - Extended attribute key namespaces | 
|  | - Extended attribute values | 
|  | - File data block contents | 
|  | - Quota limits | 
|  | - Quota timer expiration (if resource usage exceeds the soft limit) | 
|  |  | 
|  | Cross-Referencing Space Metadata | 
|  | ```````````````````````````````` | 
|  |  | 
|  | After internal block checks, the next higher level of checking is | 
|  | cross-referencing records between metadata structures. | 
|  | For regular runtime code, the cost of these checks is considered to be | 
|  | prohibitively expensive, but as scrub is dedicated to rooting out | 
|  | inconsistencies, it must pursue all avenues of inquiry. | 
|  | The exact set of cross-referencing is highly dependent on the context of the | 
|  | data structure being checked. | 
|  |  | 
|  | The XFS btree code has keyspace scanning functions that online fsck uses to | 
|  | cross reference one structure with another. | 
|  | Specifically, scrub can scan the key space of an index to determine if that | 
|  | keyspace is fully, sparsely, or not at all mapped to records. | 
|  | For the reverse mapping btree, it is possible to mask parts of the key for the | 
|  | purposes of performing a keyspace scan so that scrub can decide if the rmap | 
|  | btree contains records mapping a certain extent of physical space without the | 
|  | sparsenses of the rest of the rmap keyspace getting in the way. | 
|  |  | 
|  | Btree blocks undergo the following checks before cross-referencing: | 
|  |  | 
|  | - Does the type of data stored in the block match what scrub is expecting? | 
|  |  | 
|  | - Does the block belong to the owning structure that asked for the read? | 
|  |  | 
|  | - Do the records fit within the block? | 
|  |  | 
|  | - Are the records contained inside the block free of obvious corruptions? | 
|  |  | 
|  | - Are the name hashes in the correct order? | 
|  |  | 
|  | - Do node pointers within the btree point to valid block addresses for the type | 
|  | of btree? | 
|  |  | 
|  | - Do child pointers point towards the leaves? | 
|  |  | 
|  | - Do sibling pointers point across the same level? | 
|  |  | 
|  | - For each node block record, does the record key accurate reflect the contents | 
|  | of the child block? | 
|  |  | 
|  | Space allocation records are cross-referenced as follows: | 
|  |  | 
|  | 1. Any space mentioned by any metadata structure are cross-referenced as | 
|  | follows: | 
|  |  | 
|  | - Does the reverse mapping index list only the appropriate owner as the | 
|  | owner of each block? | 
|  |  | 
|  | - Are none of the blocks claimed as free space? | 
|  |  | 
|  | - If these aren't file data blocks, are none of the blocks claimed as space | 
|  | shared by different owners? | 
|  |  | 
|  | 2. Btree blocks are cross-referenced as follows: | 
|  |  | 
|  | - Everything in class 1 above. | 
|  |  | 
|  | - If there's a parent node block, do the keys listed for this block match the | 
|  | keyspace of this block? | 
|  |  | 
|  | - Do the sibling pointers point to valid blocks?  Of the same level? | 
|  |  | 
|  | - Do the child pointers point to valid blocks?  Of the next level down? | 
|  |  | 
|  | 3. Free space btree records are cross-referenced as follows: | 
|  |  | 
|  | - Everything in class 1 and 2 above. | 
|  |  | 
|  | - Does the reverse mapping index list no owners of this space? | 
|  |  | 
|  | - Is this space not claimed by the inode index for inodes? | 
|  |  | 
|  | - Is it not mentioned by the reference count index? | 
|  |  | 
|  | - Is there a matching record in the other free space btree? | 
|  |  | 
|  | 4. Inode btree records are cross-referenced as follows: | 
|  |  | 
|  | - Everything in class 1 and 2 above. | 
|  |  | 
|  | - Is there a matching record in free inode btree? | 
|  |  | 
|  | - Do cleared bits in the holemask correspond with inode clusters? | 
|  |  | 
|  | - Do set bits in the freemask correspond with inode records with zero link | 
|  | count? | 
|  |  | 
|  | 5. Inode records are cross-referenced as follows: | 
|  |  | 
|  | - Everything in class 1. | 
|  |  | 
|  | - Do all the fields that summarize information about the file forks actually | 
|  | match those forks? | 
|  |  | 
|  | - Does each inode with zero link count correspond to a record in the free | 
|  | inode btree? | 
|  |  | 
|  | 6. File fork space mapping records are cross-referenced as follows: | 
|  |  | 
|  | - Everything in class 1 and 2 above. | 
|  |  | 
|  | - Is this space not mentioned by the inode btrees? | 
|  |  | 
|  | - If this is a CoW fork mapping, does it correspond to a CoW entry in the | 
|  | reference count btree? | 
|  |  | 
|  | 7. Reference count records are cross-referenced as follows: | 
|  |  | 
|  | - Everything in class 1 and 2 above. | 
|  |  | 
|  | - Within the space subkeyspace of the rmap btree (that is to say, all | 
|  | records mapped to a particular space extent and ignoring the owner info), | 
|  | are there the same number of reverse mapping records for each block as the | 
|  | reference count record claims? | 
|  |  | 
|  | Proposed patchsets are the series to find gaps in | 
|  | `refcount btree | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_, | 
|  | `inode btree | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and | 
|  | `rmap btree | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records; | 
|  | to find | 
|  | `mergeable records | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_; | 
|  | and to | 
|  | `improve cross referencing with rmap | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_ | 
|  | before starting a repair. | 
|  |  | 
|  | Checking Extended Attributes | 
|  | ```````````````````````````` | 
|  |  | 
|  | Extended attributes implement a key-value store that enable fragments of data | 
|  | to be attached to any file. | 
|  | Both the kernel and userspace can access the keys and values, subject to | 
|  | namespace and privilege restrictions. | 
|  | Most typically these fragments are metadata about the file -- origins, security | 
|  | contexts, user-supplied labels, indexing information, etc. | 
|  |  | 
|  | Names can be as long as 255 bytes and can exist in several different | 
|  | namespaces. | 
|  | Values can be as large as 64KB. | 
|  | A file's extended attributes are stored in blocks mapped by the attr fork. | 
|  | The mappings point to leaf blocks, remote value blocks, or dabtree blocks. | 
|  | Block 0 in the attribute fork is always the top of the structure, but otherwise | 
|  | each of the three types of blocks can be found at any offset in the attr fork. | 
|  | Leaf blocks contain attribute key records that point to the name and the value. | 
|  | Names are always stored elsewhere in the same leaf block. | 
|  | Values that are less than 3/4 the size of a filesystem block are also stored | 
|  | elsewhere in the same leaf block. | 
|  | Remote value blocks contain values that are too large to fit inside a leaf. | 
|  | If the leaf information exceeds a single filesystem block, a dabtree (also | 
|  | rooted at block 0) is created to map hashes of the attribute names to leaf | 
|  | blocks in the attr fork. | 
|  |  | 
|  | Checking an extended attribute structure is not so straightforward due to the | 
|  | lack of separation between attr blocks and index blocks. | 
|  | Scrub must read each block mapped by the attr fork and ignore the non-leaf | 
|  | blocks: | 
|  |  | 
|  | 1. Walk the dabtree in the attr fork (if present) to ensure that there are no | 
|  | irregularities in the blocks or dabtree mappings that do not point to | 
|  | attr leaf blocks. | 
|  |  | 
|  | 2. Walk the blocks of the attr fork looking for leaf blocks. | 
|  | For each entry inside a leaf: | 
|  |  | 
|  | a. Validate that the name does not contain invalid characters. | 
|  |  | 
|  | b. Read the attr value. | 
|  | This performs a named lookup of the attr name to ensure the correctness | 
|  | of the dabtree. | 
|  | If the value is stored in a remote block, this also validates the | 
|  | integrity of the remote value block. | 
|  |  | 
|  | Checking and Cross-Referencing Directories | 
|  | `````````````````````````````````````````` | 
|  |  | 
|  | The filesystem directory tree is a directed acylic graph structure, with files | 
|  | constituting the nodes, and directory entries (dirents) constituting the edges. | 
|  | Directories are a special type of file containing a set of mappings from a | 
|  | 255-byte sequence (name) to an inumber. | 
|  | These are called directory entries, or dirents for short. | 
|  | Each directory file must have exactly one directory pointing to the file. | 
|  | A root directory points to itself. | 
|  | Directory entries point to files of any type. | 
|  | Each non-directory file may have multiple directories point to it. | 
|  |  | 
|  | In XFS, directories are implemented as a file containing up to three 32GB | 
|  | partitions. | 
|  | The first partition contains directory entry data blocks. | 
|  | Each data block contains variable-sized records associating a user-provided | 
|  | name with an inumber and, optionally, a file type. | 
|  | If the directory entry data grows beyond one block, the second partition (which | 
|  | exists as post-EOF extents) is populated with a block containing free space | 
|  | information and an index that maps hashes of the dirent names to directory data | 
|  | blocks in the first partition. | 
|  | This makes directory name lookups very fast. | 
|  | If this second partition grows beyond one block, the third partition is | 
|  | populated with a linear array of free space information for faster | 
|  | expansions. | 
|  | If the free space has been separated and the second partition grows again | 
|  | beyond one block, then a dabtree is used to map hashes of dirent names to | 
|  | directory data blocks. | 
|  |  | 
|  | Checking a directory is pretty straightforward: | 
|  |  | 
|  | 1. Walk the dabtree in the second partition (if present) to ensure that there | 
|  | are no irregularities in the blocks or dabtree mappings that do not point to | 
|  | dirent blocks. | 
|  |  | 
|  | 2. Walk the blocks of the first partition looking for directory entries. | 
|  | Each dirent is checked as follows: | 
|  |  | 
|  | a. Does the name contain no invalid characters? | 
|  |  | 
|  | b. Does the inumber correspond to an actual, allocated inode? | 
|  |  | 
|  | c. Does the child inode have a nonzero link count? | 
|  |  | 
|  | d. If a file type is included in the dirent, does it match the type of the | 
|  | inode? | 
|  |  | 
|  | e. If the child is a subdirectory, does the child's dotdot pointer point | 
|  | back to the parent? | 
|  |  | 
|  | f. If the directory has a second partition, perform a named lookup of the | 
|  | dirent name to ensure the correctness of the dabtree. | 
|  |  | 
|  | 3. Walk the free space list in the third partition (if present) to ensure that | 
|  | the free spaces it describes are really unused. | 
|  |  | 
|  | Checking operations involving :ref:`parents <dirparent>` and | 
|  | :ref:`file link counts <nlinks>` are discussed in more detail in later | 
|  | sections. | 
|  |  | 
|  | Checking Directory/Attribute Btrees | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | As stated in previous sections, the directory/attribute btree (dabtree) index | 
|  | maps user-provided names to improve lookup times by avoiding linear scans. | 
|  | Internally, it maps a 32-bit hash of the name to a block offset within the | 
|  | appropriate file fork. | 
|  |  | 
|  | The internal structure of a dabtree closely resembles the btrees that record | 
|  | fixed-size metadata records -- each dabtree block contains a magic number, a | 
|  | checksum, sibling pointers, a UUID, a tree level, and a log sequence number. | 
|  | The format of leaf and node records are the same -- each entry points to the | 
|  | next level down in the hierarchy, with dabtree node records pointing to dabtree | 
|  | leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere | 
|  | in the fork. | 
|  |  | 
|  | Checking and cross-referencing the dabtree is very similar to what is done for | 
|  | space btrees: | 
|  |  | 
|  | - Does the type of data stored in the block match what scrub is expecting? | 
|  |  | 
|  | - Does the block belong to the owning structure that asked for the read? | 
|  |  | 
|  | - Do the records fit within the block? | 
|  |  | 
|  | - Are the records contained inside the block free of obvious corruptions? | 
|  |  | 
|  | - Are the name hashes in the correct order? | 
|  |  | 
|  | - Do node pointers within the dabtree point to valid fork offsets for dabtree | 
|  | blocks? | 
|  |  | 
|  | - Do leaf pointers within the dabtree point to valid fork offsets for directory | 
|  | or attr leaf blocks? | 
|  |  | 
|  | - Do child pointers point towards the leaves? | 
|  |  | 
|  | - Do sibling pointers point across the same level? | 
|  |  | 
|  | - For each dabtree node record, does the record key accurate reflect the | 
|  | contents of the child dabtree block? | 
|  |  | 
|  | - For each dabtree leaf record, does the record key accurate reflect the | 
|  | contents of the directory or attr block? | 
|  |  | 
|  | Cross-Referencing Summary Counters | 
|  | `````````````````````````````````` | 
|  |  | 
|  | XFS maintains three classes of summary counters: available resources, quota | 
|  | resource usage, and file link counts. | 
|  |  | 
|  | In theory, the amount of available resources (data blocks, inodes, realtime | 
|  | extents) can be found by walking the entire filesystem. | 
|  | This would make for very slow reporting, so a transactional filesystem can | 
|  | maintain summaries of this information in the superblock. | 
|  | Cross-referencing these values against the filesystem metadata should be a | 
|  | simple matter of walking the free space and inode metadata in each AG and the | 
|  | realtime bitmap, but there are complications that will be discussed in | 
|  | :ref:`more detail <fscounters>` later. | 
|  |  | 
|  | :ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>` | 
|  | checking are sufficiently complicated to warrant separate sections. | 
|  |  | 
|  | Post-Repair Reverification | 
|  | `````````````````````````` | 
|  |  | 
|  | After performing a repair, the checking code is run a second time to validate | 
|  | the new structure, and the results of the health assessment are recorded | 
|  | internally and returned to the calling process. | 
|  | This step is critical for enabling system administrator to monitor the status | 
|  | of the filesystem and the progress of any repairs. | 
|  | For developers, it is a useful means to judge the efficacy of error detection | 
|  | and correction in the online and offline checking tools. | 
|  |  | 
|  | Eventual Consistency vs. Online Fsck | 
|  | ------------------------------------ | 
|  |  | 
|  | Complex operations can make modifications to multiple per-AG data structures | 
|  | with a chain of transactions. | 
|  | These chains, once committed to the log, are restarted during log recovery if | 
|  | the system crashes while processing the chain. | 
|  | Because the AG header buffers are unlocked between transactions within a chain, | 
|  | online checking must coordinate with chained operations that are in progress to | 
|  | avoid incorrectly detecting inconsistencies due to pending chains. | 
|  | Furthermore, online repair must not run when operations are pending because | 
|  | the metadata are temporarily inconsistent with each other, and rebuilding is | 
|  | not possible. | 
|  |  | 
|  | Only online fsck has this requirement of total consistency of AG metadata, and | 
|  | should be relatively rare as compared to filesystem change operations. | 
|  | Online fsck coordinates with transaction chains as follows: | 
|  |  | 
|  | * For each AG, maintain a count of intent items targeting that AG. | 
|  | The count should be bumped whenever a new item is added to the chain. | 
|  | The count should be dropped when the filesystem has locked the AG header | 
|  | buffers and finished the work. | 
|  |  | 
|  | * When online fsck wants to examine an AG, it should lock the AG header | 
|  | buffers to quiesce all transaction chains that want to modify that AG. | 
|  | If the count is zero, proceed with the checking operation. | 
|  | If it is nonzero, cycle the buffer locks to allow the chain to make forward | 
|  | progress. | 
|  |  | 
|  | This may lead to online fsck taking a long time to complete, but regular | 
|  | filesystem updates take precedence over background checking activity. | 
|  | Details about the discovery of this situation are presented in the | 
|  | :ref:`next section <chain_coordination>`, and details about the solution | 
|  | are presented :ref:`after that<intent_drains>`. | 
|  |  | 
|  | .. _chain_coordination: | 
|  |  | 
|  | Discovery of the Problem | 
|  | ```````````````````````` | 
|  |  | 
|  | Midway through the development of online scrubbing, the fsstress tests | 
|  | uncovered a misinteraction between online fsck and compound transaction chains | 
|  | created by other writer threads that resulted in false reports of metadata | 
|  | inconsistency. | 
|  | The root cause of these reports is the eventual consistency model introduced by | 
|  | the expansion of deferred work items and compound transaction chains when | 
|  | reverse mapping and reflink were introduced. | 
|  |  | 
|  | Originally, transaction chains were added to XFS to avoid deadlocks when | 
|  | unmapping space from files. | 
|  | Deadlock avoidance rules require that AGs only be locked in increasing order, | 
|  | which makes it impossible (say) to use a single transaction to free a space | 
|  | extent in AG 7 and then try to free a now superfluous block mapping btree block | 
|  | in AG 3. | 
|  | To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log | 
|  | items to commit to freeing some space in one transaction while deferring the | 
|  | actual metadata updates to a fresh transaction. | 
|  | The transaction sequence looks like this: | 
|  |  | 
|  | 1. The first transaction contains a physical update to the file's block mapping | 
|  | structures to remove the mapping from the btree blocks. | 
|  | It then attaches to the in-memory transaction an action item to schedule | 
|  | deferred freeing of space. | 
|  | Concretely, each transaction maintains a list of ``struct | 
|  | xfs_defer_pending`` objects, each of which maintains a list of ``struct | 
|  | xfs_extent_free_item`` objects. | 
|  | Returning to the example above, the action item tracks the freeing of both | 
|  | the unmapped space from AG 7 and the block mapping btree (BMBT) block from | 
|  | AG 3. | 
|  | Deferred frees recorded in this manner are committed in the log by creating | 
|  | an EFI log item from the ``struct xfs_extent_free_item`` object and | 
|  | attaching the log item to the transaction. | 
|  | When the log is persisted to disk, the EFI item is written into the ondisk | 
|  | transaction record. | 
|  | EFIs can list up to 16 extents to free, all sorted in AG order. | 
|  |  | 
|  | 2. The second transaction contains a physical update to the free space btrees | 
|  | of AG 3 to release the former BMBT block and a second physical update to the | 
|  | free space btrees of AG 7 to release the unmapped file space. | 
|  | Observe that the physical updates are resequenced in the correct order | 
|  | when possible. | 
|  | Attached to the transaction is a an extent free done (EFD) log item. | 
|  | The EFD contains a pointer to the EFI logged in transaction #1 so that log | 
|  | recovery can tell if the EFI needs to be replayed. | 
|  |  | 
|  | If the system goes down after transaction #1 is written back to the filesystem | 
|  | but before #2 is committed, a scan of the filesystem metadata would show | 
|  | inconsistent filesystem metadata because there would not appear to be any owner | 
|  | of the unmapped space. | 
|  | Happily, log recovery corrects this inconsistency for us -- when recovery finds | 
|  | an intent log item but does not find a corresponding intent done item, it will | 
|  | reconstruct the incore state of the intent item and finish it. | 
|  | In the example above, the log must replay both frees described in the recovered | 
|  | EFI to complete the recovery phase. | 
|  |  | 
|  | There are subtleties to XFS' transaction chaining strategy to consider: | 
|  |  | 
|  | * Log items must be added to a transaction in the correct order to prevent | 
|  | conflicts with principal objects that are not held by the transaction. | 
|  | In other words, all per-AG metadata updates for an unmapped block must be | 
|  | completed before the last update to free the extent, and extents should not | 
|  | be reallocated until that last update commits to the log. | 
|  |  | 
|  | * AG header buffers are released between each transaction in a chain. | 
|  | This means that other threads can observe an AG in an intermediate state, | 
|  | but as long as the first subtlety is handled, this should not affect the | 
|  | correctness of filesystem operations. | 
|  |  | 
|  | * Unmounting the filesystem flushes all pending work to disk, which means that | 
|  | offline fsck never sees the temporary inconsistencies caused by deferred | 
|  | work item processing. | 
|  |  | 
|  | In this manner, XFS employs a form of eventual consistency to avoid deadlocks | 
|  | and increase parallelism. | 
|  |  | 
|  | During the design phase of the reverse mapping and reflink features, it was | 
|  | decided that it was impractical to cram all the reverse mapping updates for a | 
|  | single filesystem change into a single transaction because a single file | 
|  | mapping operation can explode into many small updates: | 
|  |  | 
|  | * The block mapping update itself | 
|  | * A reverse mapping update for the block mapping update | 
|  | * Fixing the freelist | 
|  | * A reverse mapping update for the freelist fix | 
|  |  | 
|  | * A shape change to the block mapping btree | 
|  | * A reverse mapping update for the btree update | 
|  | * Fixing the freelist (again) | 
|  | * A reverse mapping update for the freelist fix | 
|  |  | 
|  | * An update to the reference counting information | 
|  | * A reverse mapping update for the refcount update | 
|  | * Fixing the freelist (a third time) | 
|  | * A reverse mapping update for the freelist fix | 
|  |  | 
|  | * Freeing any space that was unmapped and not owned by any other file | 
|  | * Fixing the freelist (a fourth time) | 
|  | * A reverse mapping update for the freelist fix | 
|  |  | 
|  | * Freeing the space used by the block mapping btree | 
|  | * Fixing the freelist (a fifth time) | 
|  | * A reverse mapping update for the freelist fix | 
|  |  | 
|  | Free list fixups are not usually needed more than once per AG per transaction | 
|  | chain, but it is theoretically possible if space is very tight. | 
|  | For copy-on-write updates this is even worse, because this must be done once to | 
|  | remove the space from a staging area and again to map it into the file! | 
|  |  | 
|  | To deal with this explosion in a calm manner, XFS expands its use of deferred | 
|  | work items to cover most reverse mapping updates and all refcount updates. | 
|  | This reduces the worst case size of transaction reservations by breaking the | 
|  | work into a long chain of small updates, which increases the degree of eventual | 
|  | consistency in the system. | 
|  | Again, this generally isn't a problem because XFS orders its deferred work | 
|  | items carefully to avoid resource reuse conflicts between unsuspecting threads. | 
|  |  | 
|  | However, online fsck changes the rules -- remember that although physical | 
|  | updates to per-AG structures are coordinated by locking the buffers for AG | 
|  | headers, buffer locks are dropped between transactions. | 
|  | Once scrub acquires resources and takes locks for a data structure, it must do | 
|  | all the validation work without releasing the lock. | 
|  | If the main lock for a space btree is an AG header buffer lock, scrub may have | 
|  | interrupted another thread that is midway through finishing a chain. | 
|  | For example, if a thread performing a copy-on-write has completed a reverse | 
|  | mapping update but not the corresponding refcount update, the two AG btrees | 
|  | will appear inconsistent to scrub and an observation of corruption will be | 
|  | recorded.  This observation will not be correct. | 
|  | If a repair is attempted in this state, the results will be catastrophic! | 
|  |  | 
|  | Several other solutions to this problem were evaluated upon discovery of this | 
|  | flaw and rejected: | 
|  |  | 
|  | 1. Add a higher level lock to allocation groups and require writer threads to | 
|  | acquire the higher level lock in AG order before making any changes. | 
|  | This would be very difficult to implement in practice because it is | 
|  | difficult to determine which locks need to be obtained, and in what order, | 
|  | without simulating the entire operation. | 
|  | Performing a dry run of a file operation to discover necessary locks would | 
|  | make the filesystem very slow. | 
|  |  | 
|  | 2. Make the deferred work coordinator code aware of consecutive intent items | 
|  | targeting the same AG and have it hold the AG header buffers locked across | 
|  | the transaction roll between updates. | 
|  | This would introduce a lot of complexity into the coordinator since it is | 
|  | only loosely coupled with the actual deferred work items. | 
|  | It would also fail to solve the problem because deferred work items can | 
|  | generate new deferred subtasks, but all subtasks must be complete before | 
|  | work can start on a new sibling task. | 
|  |  | 
|  | 3. Teach online fsck to walk all transactions waiting for whichever lock(s) | 
|  | protect the data structure being scrubbed to look for pending operations. | 
|  | The checking and repair operations must factor these pending operations into | 
|  | the evaluations being performed. | 
|  | This solution is a nonstarter because it is *extremely* invasive to the main | 
|  | filesystem. | 
|  |  | 
|  | .. _intent_drains: | 
|  |  | 
|  | Intent Drains | 
|  | ````````````` | 
|  |  | 
|  | Online fsck uses an atomic intent item counter and lock cycling to coordinate | 
|  | with transaction chains. | 
|  | There are two key properties to the drain mechanism. | 
|  | First, the counter is incremented when a deferred work item is *queued* to a | 
|  | transaction, and it is decremented after the associated intent done log item is | 
|  | *committed* to another transaction. | 
|  | The second property is that deferred work can be added to a transaction without | 
|  | holding an AG header lock, but per-AG work items cannot be marked done without | 
|  | locking that AG header buffer to log the physical updates and the intent done | 
|  | log item. | 
|  | The first property enables scrub to yield to running transaction chains, which | 
|  | is an explicit deprioritization of online fsck to benefit file operations. | 
|  | The second property of the drain is key to the correct coordination of scrub, | 
|  | since scrub will always be able to decide if a conflict is possible. | 
|  |  | 
|  | For regular filesystem code, the drain works as follows: | 
|  |  | 
|  | 1. Call the appropriate subsystem function to add a deferred work item to a | 
|  | transaction. | 
|  |  | 
|  | 2. The function calls ``xfs_defer_drain_bump`` to increase the counter. | 
|  |  | 
|  | 3. When the deferred item manager wants to finish the deferred work item, it | 
|  | calls ``->finish_item`` to complete it. | 
|  |  | 
|  | 4. The ``->finish_item`` implementation logs some changes and calls | 
|  | ``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads | 
|  | waiting on the drain. | 
|  |  | 
|  | 5. The subtransaction commits, which unlocks the resource associated with the | 
|  | intent item. | 
|  |  | 
|  | For scrub, the drain works as follows: | 
|  |  | 
|  | 1. Lock the resource(s) associated with the metadata being scrubbed. | 
|  | For example, a scan of the refcount btree would lock the AGI and AGF header | 
|  | buffers. | 
|  |  | 
|  | 2. If the counter is zero (``xfs_defer_drain_busy`` returns false), there are no | 
|  | chains in progress and the operation may proceed. | 
|  |  | 
|  | 3. Otherwise, release the resources grabbed in step 1. | 
|  |  | 
|  | 4. Wait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go | 
|  | back to step 1 unless a signal has been caught. | 
|  |  | 
|  | To avoid polling in step 4, the drain provides a waitqueue for scrub threads to | 
|  | be woken up whenever the intent count drops to zero. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `scrub intent drain series | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_. | 
|  |  | 
|  | .. _jump_labels: | 
|  |  | 
|  | Static Keys (aka Jump Label Patching) | 
|  | ````````````````````````````````````` | 
|  |  | 
|  | Online fsck for XFS separates the regular filesystem from the checking and | 
|  | repair code as much as possible. | 
|  | However, there are a few parts of online fsck (such as the intent drains, and | 
|  | later, live update hooks) where it is useful for the online fsck code to know | 
|  | what's going on in the rest of the filesystem. | 
|  | Since it is not expected that online fsck will be constantly running in the | 
|  | background, it is very important to minimize the runtime overhead imposed by | 
|  | these hooks when online fsck is compiled into the kernel but not actively | 
|  | running on behalf of userspace. | 
|  | Taking locks in the hot path of a writer thread to access a data structure only | 
|  | to find that no further action is necessary is expensive -- on the author's | 
|  | computer, this have an overhead of 40-50ns per access. | 
|  | Fortunately, the kernel supports dynamic code patching, which enables XFS to | 
|  | replace a static branch to hook code with ``nop`` sleds when online fsck isn't | 
|  | running. | 
|  | This sled has an overhead of however long it takes the instruction decoder to | 
|  | skip past the sled, which seems to be on the order of less than 1ns and | 
|  | does not access memory outside of instruction fetching. | 
|  |  | 
|  | When online fsck enables the static key, the sled is replaced with an | 
|  | unconditional branch to call the hook code. | 
|  | The switchover is quite expensive (~22000ns) but is paid entirely by the | 
|  | program that invoked online fsck, and can be amortized if multiple threads | 
|  | enter online fsck at the same time, or if multiple filesystems are being | 
|  | checked at the same time. | 
|  | Changing the branch direction requires taking the CPU hotplug lock, and since | 
|  | CPU initialization requires memory allocation, online fsck must be careful not | 
|  | to change a static key while holding any locks or resources that could be | 
|  | accessed in the memory reclaim paths. | 
|  | To minimize contention on the CPU hotplug lock, care should be taken not to | 
|  | enable or disable static keys unnecessarily. | 
|  |  | 
|  | Because static keys are intended to minimize hook overhead for regular | 
|  | filesystem operations when xfs_scrub is not running, the intended usage | 
|  | patterns are as follows: | 
|  |  | 
|  | - The hooked part of XFS should declare a static-scoped static key that | 
|  | defaults to false. | 
|  | The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this. | 
|  | The static key itself should be declared as a ``static`` variable. | 
|  |  | 
|  | - When deciding to invoke code that's only used by scrub, the regular | 
|  | filesystem should call the ``static_branch_unlikely`` predicate to avoid the | 
|  | scrub-only hook code if the static key is not enabled. | 
|  |  | 
|  | - The regular filesystem should export helper functions that call | 
|  | ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the | 
|  | static key. | 
|  | Wrapper functions make it easy to compile out the relevant code if the kernel | 
|  | distributor turns off online fsck at build time. | 
|  |  | 
|  | - Scrub functions wanting to turn on scrub-only XFS functionality should call | 
|  | the ``xchk_fsgates_enable`` from the setup function to enable a specific | 
|  | hook. | 
|  | This must be done before obtaining any resources that are used by memory | 
|  | reclaim. | 
|  | Callers had better be sure they really need the functionality gated by the | 
|  | static key; the ``TRY_HARDER`` flag is useful here. | 
|  |  | 
|  | Online scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to | 
|  | handle locking AGI and AGF buffers for all scrubber functions. | 
|  | If it detects a conflict between scrub and the running transactions, it will | 
|  | try to wait for intents to complete. | 
|  | If the caller of the helper has not enabled the static key, the helper will | 
|  | return -EDEADLOCK, which should result in the scrub being restarted with the | 
|  | ``TRY_HARDER`` flag set. | 
|  | The scrub setup function should detect that flag, enable the static key, and | 
|  | try the scrub again. | 
|  | Scrub teardown disables all static keys obtained by ``xchk_fsgates_enable``. | 
|  |  | 
|  | For more information, please see the kernel documentation of | 
|  | Documentation/staging/static-keys.rst. | 
|  |  | 
|  | .. _xfile: | 
|  |  | 
|  | Pageable Kernel Memory | 
|  | ---------------------- | 
|  |  | 
|  | Some online checking functions work by scanning the filesystem to build a | 
|  | shadow copy of an ondisk metadata structure in memory and comparing the two | 
|  | copies. | 
|  | For online repair to rebuild a metadata structure, it must compute the record | 
|  | set that will be stored in the new structure before it can persist that new | 
|  | structure to disk. | 
|  | Ideally, repairs complete with a single atomic commit that introduces | 
|  | a new data structure. | 
|  | To meet these goals, the kernel needs to collect a large amount of information | 
|  | in a place that doesn't require the correct operation of the filesystem. | 
|  |  | 
|  | Kernel memory isn't suitable because: | 
|  |  | 
|  | * Allocating a contiguous region of memory to create a C array is very | 
|  | difficult, especially on 32-bit systems. | 
|  |  | 
|  | * Linked lists of records introduce double pointer overhead which is very high | 
|  | and eliminate the possibility of indexed lookups. | 
|  |  | 
|  | * Kernel memory is pinned, which can drive the system into OOM conditions. | 
|  |  | 
|  | * The system might not have sufficient memory to stage all the information. | 
|  |  | 
|  | At any given time, online fsck does not need to keep the entire record set in | 
|  | memory, which means that individual records can be paged out if necessary. | 
|  | Continued development of online fsck demonstrated that the ability to perform | 
|  | indexed data storage would also be very useful. | 
|  | Fortunately, the Linux kernel already has a facility for byte-addressable and | 
|  | pageable storage: tmpfs. | 
|  | In-kernel graphics drivers (most notably i915) take advantage of tmpfs files | 
|  | to store intermediate data that doesn't need to be in memory at all times, so | 
|  | that usage precedent is already established. | 
|  | Hence, the ``xfile`` was born! | 
|  |  | 
|  | +--------------------------------------------------------------------------+ | 
|  | | **Historical Sidebar**:                                                  | | 
|  | +--------------------------------------------------------------------------+ | 
|  | | The first edition of online repair inserted records into a new btree as  | | 
|  | | it found them, which failed because filesystem could shut down with a    | | 
|  | | built data structure, which would be live after recovery finished.       | | 
|  | |                                                                          | | 
|  | | The second edition solved the half-rebuilt structure problem by storing  | | 
|  | | everything in memory, but frequently ran the system out of memory.       | | 
|  | |                                                                          | | 
|  | | The third edition solved the OOM problem by using linked lists, but the  | | 
|  | | memory overhead of the list pointers was extreme.                        | | 
|  | +--------------------------------------------------------------------------+ | 
|  |  | 
|  | xfile Access Models | 
|  | ``````````````````` | 
|  |  | 
|  | A survey of the intended uses of xfiles suggested these use cases: | 
|  |  | 
|  | 1. Arrays of fixed-sized records (space management btrees, directory and | 
|  | extended attribute entries) | 
|  |  | 
|  | 2. Sparse arrays of fixed-sized records (quotas and link counts) | 
|  |  | 
|  | 3. Large binary objects (BLOBs) of variable sizes (directory and extended | 
|  | attribute names and values) | 
|  |  | 
|  | 4. Staging btrees in memory (reverse mapping btrees) | 
|  |  | 
|  | 5. Arbitrary contents (realtime space management) | 
|  |  | 
|  | To support the first four use cases, high level data structures wrap the xfile | 
|  | to share functionality between online fsck functions. | 
|  | The rest of this section discusses the interfaces that the xfile presents to | 
|  | four of those five higher level data structures. | 
|  | The fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case | 
|  | study. | 
|  |  | 
|  | XFS is very record-based, which suggests that the ability to load and store | 
|  | complete records is important. | 
|  | To support these cases, a pair of ``xfile_load`` and ``xfile_store`` | 
|  | functions are provided to read and persist objects into an xfile that treat any | 
|  | error as an out of memory error.  For online repair, squashing error conditions | 
|  | in this manner is an acceptable behavior because the only reaction is to abort | 
|  | the operation back to userspace. | 
|  |  | 
|  | However, no discussion of file access idioms is complete without answering the | 
|  | question, "But what about mmap?" | 
|  | It is convenient to access storage directly with pointers, just like userspace | 
|  | code does with regular memory. | 
|  | Online fsck must not drive the system into OOM conditions, which means that | 
|  | xfiles must be responsive to memory reclamation. | 
|  | tmpfs can only push a pagecache folio to the swap cache if the folio is neither | 
|  | pinned nor locked, which means the xfile must not pin too many folios. | 
|  |  | 
|  | Short term direct access to xfile contents is done by locking the pagecache | 
|  | folio and mapping it into kernel address space.  Object load and store uses this | 
|  | mechanism.  Folio locks are not supposed to be held for long periods of time, so | 
|  | long term direct access to xfile contents is done by bumping the folio refcount, | 
|  | mapping it into kernel address space, and dropping the folio lock. | 
|  | These long term users *must* be responsive to memory reclaim by hooking into | 
|  | the shrinker infrastructure to know when to release folios. | 
|  |  | 
|  | The ``xfile_get_folio`` and ``xfile_put_folio`` functions are provided to | 
|  | retrieve the (locked) folio that backs part of an xfile and to release it. | 
|  | The only code to use these folio lease functions are the xfarray | 
|  | :ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory | 
|  | btrees<xfbtree>`. | 
|  |  | 
|  | xfile Access Coordination | 
|  | ````````````````````````` | 
|  |  | 
|  | For security reasons, xfiles must be owned privately by the kernel. | 
|  | They are marked ``S_PRIVATE`` to prevent interference from the security system, | 
|  | must never be mapped into process file descriptor tables, and their pages must | 
|  | never be mapped into userspace processes. | 
|  |  | 
|  | To avoid locking recursion issues with the VFS, all accesses to the shmfs file | 
|  | are performed by manipulating the page cache directly. | 
|  | xfile writers call the ``->write_begin`` and ``->write_end`` functions of the | 
|  | xfile's address space to grab writable pages, copy the caller's buffer into the | 
|  | page, and release the pages. | 
|  | xfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly | 
|  | before copying the contents into the caller's buffer. | 
|  | In other words, xfiles ignore the VFS read and write code paths to avoid | 
|  | having to create a dummy ``struct kiocb`` and to avoid taking inode and | 
|  | freeze locks. | 
|  | tmpfs cannot be frozen, and xfiles must not be exposed to userspace. | 
|  |  | 
|  | If an xfile is shared between threads to stage repairs, the caller must provide | 
|  | its own locks to coordinate access. | 
|  | For example, if a scrub function stores scan results in an xfile and needs | 
|  | other threads to provide updates to the scanned data, the scrub function must | 
|  | provide a lock for all threads to share. | 
|  |  | 
|  | .. _xfarray: | 
|  |  | 
|  | Arrays of Fixed-Sized Records | 
|  | ````````````````````````````` | 
|  |  | 
|  | In XFS, each type of indexed space metadata (free space, inodes, reference | 
|  | counts, file fork space, and reverse mappings) consists of a set of fixed-size | 
|  | records indexed with a classic B+ tree. | 
|  | Directories have a set of fixed-size dirent records that point to the names, | 
|  | and extended attributes have a set of fixed-size attribute keys that point to | 
|  | names and values. | 
|  | Quota counters and file link counters index records with numbers. | 
|  | During a repair, scrub needs to stage new records during the gathering step and | 
|  | retrieve them during the btree building step. | 
|  |  | 
|  | Although this requirement can be satisfied by calling the read and write | 
|  | methods of the xfile directly, it is simpler for callers for there to be a | 
|  | higher level abstraction to take care of computing array offsets, to provide | 
|  | iterator functions, and to deal with sparse records and sorting. | 
|  | The ``xfarray`` abstraction presents a linear array for fixed-size records atop | 
|  | the byte-accessible xfile. | 
|  |  | 
|  | .. _xfarray_access_patterns: | 
|  |  | 
|  | Array Access Patterns | 
|  | ^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Array access patterns in online fsck tend to fall into three categories. | 
|  | Iteration of records is assumed to be necessary for all cases and will be | 
|  | covered in the next section. | 
|  |  | 
|  | The first type of caller handles records that are indexed by position. | 
|  | Gaps may exist between records, and a record may be updated multiple times | 
|  | during the collection step. | 
|  | In other words, these callers want a sparse linearly addressed table file. | 
|  | The typical use case are quota records or file link count records. | 
|  | Access to array elements is performed programmatically via ``xfarray_load`` and | 
|  | ``xfarray_store`` functions, which wrap the similarly-named xfile functions to | 
|  | provide loading and storing of array elements at arbitrary array indices. | 
|  | Gaps are defined to be null records, and null records are defined to be a | 
|  | sequence of all zero bytes. | 
|  | Null records are detected by calling ``xfarray_element_is_null``. | 
|  | They are created either by calling ``xfarray_unset`` to null out an existing | 
|  | record or by never storing anything to an array index. | 
|  |  | 
|  | The second type of caller handles records that are not indexed by position | 
|  | and do not require multiple updates to a record. | 
|  | The typical use case here is rebuilding space btrees and key/value btrees. | 
|  | These callers can add records to the array without caring about array indices | 
|  | via the ``xfarray_append`` function, which stores a record at the end of the | 
|  | array. | 
|  | For callers that require records to be presentable in a specific order (e.g. | 
|  | rebuilding btree data), the ``xfarray_sort`` function can arrange the sorted | 
|  | records; this function will be covered later. | 
|  |  | 
|  | The third type of caller is a bag, which is useful for counting records. | 
|  | The typical use case here is constructing space extent reference counts from | 
|  | reverse mapping information. | 
|  | Records can be put in the bag in any order, they can be removed from the bag | 
|  | at any time, and uniqueness of records is left to callers. | 
|  | The ``xfarray_store_anywhere`` function is used to insert a record in any | 
|  | null record slot in the bag; and the ``xfarray_unset`` function removes a | 
|  | record from the bag. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `big in-memory array | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_. | 
|  |  | 
|  | Iterating Array Elements | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Most users of the xfarray require the ability to iterate the records stored in | 
|  | the array. | 
|  | Callers can probe every possible array index with the following: | 
|  |  | 
|  | .. code-block:: c | 
|  |  | 
|  | xfarray_idx_t i; | 
|  | foreach_xfarray_idx(array, i) { | 
|  | xfarray_load(array, i, &rec); | 
|  |  | 
|  | /* do something with rec */ | 
|  | } | 
|  |  | 
|  | All users of this idiom must be prepared to handle null records or must already | 
|  | know that there aren't any. | 
|  |  | 
|  | For xfarray users that want to iterate a sparse array, the ``xfarray_iter`` | 
|  | function ignores indices in the xfarray that have never been written to by | 
|  | calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas | 
|  | of the array that are not populated with memory pages. | 
|  | Once it finds a page, it will skip the zeroed areas of the page. | 
|  |  | 
|  | .. code-block:: c | 
|  |  | 
|  | xfarray_idx_t i = XFARRAY_CURSOR_INIT; | 
|  | while ((ret = xfarray_iter(array, &i, &rec)) == 1) { | 
|  | /* do something with rec */ | 
|  | } | 
|  |  | 
|  | .. _xfarray_sort: | 
|  |  | 
|  | Sorting Array Elements | 
|  | ^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | During the fourth demonstration of online repair, a community reviewer remarked | 
|  | that for performance reasons, online repair ought to load batches of records | 
|  | into btree record blocks instead of inserting records into a new btree one at a | 
|  | time. | 
|  | The btree insertion code in XFS is responsible for maintaining correct ordering | 
|  | of the records, so naturally the xfarray must also support sorting the record | 
|  | set prior to bulk loading. | 
|  |  | 
|  | Case Study: Sorting xfarrays | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | The sorting algorithm used in the xfarray is actually a combination of adaptive | 
|  | quicksort and a heapsort subalgorithm in the spirit of | 
|  | `Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and | 
|  | `pdqsort <https://github.com/orlp/pdqsort>`_, with customizations for the Linux | 
|  | kernel. | 
|  | To sort records in a reasonably short amount of time, ``xfarray`` takes | 
|  | advantage of the binary subpartitioning offered by quicksort, but it also uses | 
|  | heapsort to hedge against performance collapse if the chosen quicksort pivots | 
|  | are poor. | 
|  | Both algorithms are (in general) O(n * lg(n)), but there is a wide performance | 
|  | gulf between the two implementations. | 
|  |  | 
|  | The Linux kernel already contains a reasonably fast implementation of heapsort. | 
|  | It only operates on regular C arrays, which limits the scope of its usefulness. | 
|  | There are two key places where the xfarray uses it: | 
|  |  | 
|  | * Sorting any record subset backed by a single xfile page. | 
|  |  | 
|  | * Loading a small number of xfarray records from potentially disparate parts | 
|  | of the xfarray into a memory buffer, and sorting the buffer. | 
|  |  | 
|  | In other words, ``xfarray`` uses heapsort to constrain the nested recursion of | 
|  | quicksort, thereby mitigating quicksort's worst runtime behavior. | 
|  |  | 
|  | Choosing a quicksort pivot is a tricky business. | 
|  | A good pivot splits the set to sort in half, leading to the divide and conquer | 
|  | behavior that is crucial to  O(n * lg(n)) performance. | 
|  | A poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`) | 
|  | runtime. | 
|  | The xfarray sort routine tries to avoid picking a bad pivot by sampling nine | 
|  | records into a memory buffer and using the kernel heapsort to identify the | 
|  | median of the nine. | 
|  |  | 
|  | Most modern quicksort implementations employ Tukey's "ninther" to select a | 
|  | pivot from a classic C array. | 
|  | Typical ninther implementations pick three unique triads of records, sort each | 
|  | of the triads, and then sort the middle value of each triad to determine the | 
|  | ninther value. | 
|  | As stated previously, however, xfile accesses are not entirely cheap. | 
|  | It turned out to be much more performant to read the nine elements into a | 
|  | memory buffer, run the kernel's in-memory heapsort on the buffer, and choose | 
|  | the 4th element of that buffer as the pivot. | 
|  | Tukey's ninthers are described in J. W. Tukey, `The ninther, a technique for | 
|  | low-effort robust (resistant) location in large samples`, in *Contributions to | 
|  | Survey Sampling and Applied Statistics*, edited by H. David, (Academic Press, | 
|  | 1978), pp. 251–257. | 
|  |  | 
|  | The partitioning of quicksort is fairly textbook -- rearrange the record | 
|  | subset around the pivot, then set up the current and next stack frames to | 
|  | sort with the larger and the smaller halves of the pivot, respectively. | 
|  | This keeps the stack space requirements to log2(record count). | 
|  |  | 
|  | As a final performance optimization, the hi and lo scanning phase of quicksort | 
|  | keeps examined xfile pages mapped in the kernel for as long as possible to | 
|  | reduce map/unmap cycles. | 
|  | Surprisingly, this reduces overall sort runtime by nearly half again after | 
|  | accounting for the application of heapsort directly onto xfile pages. | 
|  |  | 
|  | .. _xfblob: | 
|  |  | 
|  | Blob Storage | 
|  | ```````````` | 
|  |  | 
|  | Extended attributes and directories add an additional requirement for staging | 
|  | records: arbitrary byte sequences of finite length. | 
|  | Each directory entry record needs to store entry name, | 
|  | and each extended attribute needs to store both the attribute name and value. | 
|  | The names, keys, and values can consume a large amount of memory, so the | 
|  | ``xfblob`` abstraction was created to simplify management of these blobs | 
|  | atop an xfile. | 
|  |  | 
|  | Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve | 
|  | and persist objects. | 
|  | The store function returns a magic cookie for every object that it persists. | 
|  | Later, callers provide this cookie to the ``xblob_load`` to recall the object. | 
|  | The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate`` | 
|  | function frees them all because compaction is not needed. | 
|  |  | 
|  | The details of repairing directories and extended attributes will be discussed | 
|  | in a subsequent section about atomic file content exchanges. | 
|  | However, it should be noted that these repair functions only use blob storage | 
|  | to cache a small number of entries before adding them to a temporary ondisk | 
|  | file, which is why compaction is not required. | 
|  |  | 
|  | The proposed patchset is at the start of the | 
|  | `extended attribute repair | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series. | 
|  |  | 
|  | .. _xfbtree: | 
|  |  | 
|  | In-Memory B+Trees | 
|  | ````````````````` | 
|  |  | 
|  | The chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that | 
|  | checking and repairing of secondary metadata commonly requires coordination | 
|  | between a live metadata scan of the filesystem and writer threads that are | 
|  | updating that metadata. | 
|  | Keeping the scan data up to date requires requires the ability to propagate | 
|  | metadata updates from the filesystem into the data being collected by the scan. | 
|  | This *can* be done by appending concurrent updates into a separate log file and | 
|  | applying them before writing the new metadata to disk, but this leads to | 
|  | unbounded memory consumption if the rest of the system is very busy. | 
|  | Another option is to skip the side-log and commit live updates from the | 
|  | filesystem directly into the scan data, which trades more overhead for a lower | 
|  | maximum memory requirement. | 
|  | In both cases, the data structure holding the scan results must support indexed | 
|  | access to perform well. | 
|  |  | 
|  | Given that indexed lookups of scan data is required for both strategies, online | 
|  | fsck employs the second strategy of committing live updates directly into | 
|  | scan data. | 
|  | Because xfarrays are not indexed and do not enforce record ordering, they | 
|  | are not suitable for this task. | 
|  | Conveniently, however, XFS has a library to create and maintain ordered reverse | 
|  | mapping records: the existing rmap btree code! | 
|  | If only there was a means to create one in memory. | 
|  |  | 
|  | Recall that the :ref:`xfile <xfile>` abstraction represents memory pages as a | 
|  | regular file, which means that the kernel can create byte or block addressable | 
|  | virtual address spaces at will. | 
|  | The XFS buffer cache specializes in abstracting IO to block-oriented  address | 
|  | spaces, which means that adaptation of the buffer cache to interface with | 
|  | xfiles enables reuse of the entire btree library. | 
|  | Btrees built atop an xfile are collectively known as ``xfbtrees``. | 
|  | The next few sections describe how they actually work. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `in-memory btree | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_ | 
|  | series. | 
|  |  | 
|  | Using xfiles as a Buffer Cache Target | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Two modifications are necessary to support xfiles as a buffer cache target. | 
|  | The first is to make it possible for the ``struct xfs_buftarg`` structure to | 
|  | host the ``struct xfs_buf`` rhashtable, because normally those are held by a | 
|  | per-AG structure. | 
|  | The second change is to modify the buffer ``ioapply`` function to "read" cached | 
|  | pages from the xfile and "write" cached pages back to the xfile. | 
|  | Multiple access to individual buffers is controlled by the ``xfs_buf`` lock, | 
|  | since the xfile does not provide any locking on its own. | 
|  | With this adaptation in place, users of the xfile-backed buffer cache use | 
|  | exactly the same APIs as users of the disk-backed buffer cache. | 
|  | The separation between xfile and buffer cache implies higher memory usage since | 
|  | they do not share pages, but this property could some day enable transactional | 
|  | updates to an in-memory btree. | 
|  | Today, however, it simply eliminates the need for new code. | 
|  |  | 
|  | Space Management with an xfbtree | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Space management for an xfile is very simple -- each btree block is one memory | 
|  | page in size. | 
|  | These blocks use the same header format as an on-disk btree, but the in-memory | 
|  | block verifiers ignore the checksums, assuming that xfile memory is no more | 
|  | corruption-prone than regular DRAM. | 
|  | Reusing existing code here is more important than absolute memory efficiency. | 
|  |  | 
|  | The very first block of an xfile backing an xfbtree contains a header block. | 
|  | The header describes the owner, height, and the block number of the root | 
|  | xfbtree block. | 
|  |  | 
|  | To allocate a btree block, use ``xfile_seek_data`` to find a gap in the file. | 
|  | If there are no gaps, create one by extending the length of the xfile. | 
|  | Preallocate space for the block with ``xfile_prealloc``, and hand back the | 
|  | location. | 
|  | To free an xfbtree block, use ``xfile_discard`` (which internally uses | 
|  | ``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile. | 
|  |  | 
|  | Populating an xfbtree | 
|  | ^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | An online fsck function that wants to create an xfbtree should proceed as | 
|  | follows: | 
|  |  | 
|  | 1. Call ``xfile_create`` to create an xfile. | 
|  |  | 
|  | 2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure | 
|  | pointing to the xfile. | 
|  |  | 
|  | 3. Pass the buffer cache target, buffer ops, and other information to | 
|  | ``xfbtree_init`` to initialize the passed in ``struct xfbtree`` and write an | 
|  | initial root block to the xfile. | 
|  | Each btree type should define a wrapper that passes necessary arguments to | 
|  | the creation function. | 
|  | For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of | 
|  | all the necessary details for callers. | 
|  |  | 
|  | 4. Pass the xfbtree object to the btree cursor creation function for the | 
|  | btree type. | 
|  | Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this | 
|  | for callers. | 
|  |  | 
|  | 5. Pass the btree cursor to the regular btree functions to make queries against | 
|  | and to update the in-memory btree. | 
|  | For example, a btree cursor for an rmap xfbtree can be passed to the | 
|  | ``xfs_rmap_*`` functions just like any other btree cursor. | 
|  | See the :ref:`next section<xfbtree_commit>` for information on dealing with | 
|  | xfbtree updates that are logged to a transaction. | 
|  |  | 
|  | 6. When finished, delete the btree cursor, destroy the xfbtree object, free the | 
|  | buffer target, and the destroy the xfile to release all resources. | 
|  |  | 
|  | .. _xfbtree_commit: | 
|  |  | 
|  | Committing Logged xfbtree Buffers | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Although it is a clever hack to reuse the rmap btree code to handle the staging | 
|  | structure, the ephemeral nature of the in-memory btree block storage presents | 
|  | some challenges of its own. | 
|  | The XFS transaction manager must not commit buffer log items for buffers backed | 
|  | by an xfile because the log format does not understand updates for devices | 
|  | other than the data device. | 
|  | An ephemeral xfbtree probably will not exist by the time the AIL checkpoints | 
|  | log transactions back into the filesystem, and certainly won't exist during | 
|  | log recovery. | 
|  | For these reasons, any code updating an xfbtree in transaction context must | 
|  | remove the buffer log items from the transaction and write the updates into the | 
|  | backing xfile before committing or cancelling the transaction. | 
|  |  | 
|  | The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement | 
|  | this functionality as follows: | 
|  |  | 
|  | 1. Find each buffer log item whose buffer targets the xfile. | 
|  |  | 
|  | 2. Record the dirty/ordered status of the log item. | 
|  |  | 
|  | 3. Detach the log item from the buffer. | 
|  |  | 
|  | 4. Queue the buffer to a special delwri list. | 
|  |  | 
|  | 5. Clear the transaction dirty flag if the only dirty log items were the ones | 
|  | that were detached in step 3. | 
|  |  | 
|  | 6. Submit the delwri list to commit the changes to the xfile, if the updates | 
|  | are being committed. | 
|  |  | 
|  | After removing xfile logged buffers from the transaction in this manner, the | 
|  | transaction can be committed or cancelled. | 
|  |  | 
|  | Bulk Loading of Ondisk B+Trees | 
|  | ------------------------------ | 
|  |  | 
|  | As mentioned previously, early iterations of online repair built new btree | 
|  | structures by creating a new btree and adding observations individually. | 
|  | Loading a btree one record at a time had a slight advantage of not requiring | 
|  | the incore records to be sorted prior to commit, but was very slow and leaked | 
|  | blocks if the system went down during a repair. | 
|  | Loading records one at a time also meant that repair could not control the | 
|  | loading factor of the blocks in the new btree. | 
|  |  | 
|  | Fortunately, the venerable ``xfs_repair`` tool had a more efficient means for | 
|  | rebuilding a btree index from a collection of records -- bulk btree loading. | 
|  | This was implemented rather inefficiently code-wise, since ``xfs_repair`` | 
|  | had separate copy-pasted implementations for each btree type. | 
|  |  | 
|  | To prepare for online fsck, each of the four bulk loaders were studied, notes | 
|  | were taken, and the four were refactored into a single generic btree bulk | 
|  | loading mechanism. | 
|  | Those notes in turn have been refreshed and are presented below. | 
|  |  | 
|  | Geometry Computation | 
|  | ```````````````````` | 
|  |  | 
|  | The zeroth step of bulk loading is to assemble the entire record set that will | 
|  | be stored in the new btree, and sort the records. | 
|  | Next, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the | 
|  | btree from the record set, the type of btree, and any load factor preferences. | 
|  | This information is required for resource reservation. | 
|  |  | 
|  | First, the geometry computation computes the minimum and maximum records that | 
|  | will fit in a leaf block from the size of a btree block and the size of the | 
|  | block header. | 
|  | Roughly speaking, the maximum number of records is:: | 
|  |  | 
|  | maxrecs = (block_size - header_size) / record_size | 
|  |  | 
|  | The XFS design specifies that btree blocks should be merged when possible, | 
|  | which means the minimum number of records is half of maxrecs:: | 
|  |  | 
|  | minrecs = maxrecs / 2 | 
|  |  | 
|  | The next variable to determine is the desired loading factor. | 
|  | This must be at least minrecs and no more than maxrecs. | 
|  | Choosing minrecs is undesirable because it wastes half the block. | 
|  | Choosing maxrecs is also undesirable because adding a single record to each | 
|  | newly rebuilt leaf block will cause a tree split, which causes a noticeable | 
|  | drop in performance immediately afterwards. | 
|  | The default loading factor was chosen to be 75% of maxrecs, which provides a | 
|  | reasonably compact structure without any immediate split penalties:: | 
|  |  | 
|  | default_load_factor = (maxrecs + minrecs) / 2 | 
|  |  | 
|  | If space is tight, the loading factor will be set to maxrecs to try to avoid | 
|  | running out of space:: | 
|  |  | 
|  | leaf_load_factor = enough space ? default_load_factor : maxrecs | 
|  |  | 
|  | Load factor is computed for btree node blocks using the combined size of the | 
|  | btree key and pointer as the record size:: | 
|  |  | 
|  | maxrecs = (block_size - header_size) / (key_size + ptr_size) | 
|  | minrecs = maxrecs / 2 | 
|  | node_load_factor = enough space ? default_load_factor : maxrecs | 
|  |  | 
|  | Once that's done, the number of leaf blocks required to store the record set | 
|  | can be computed as:: | 
|  |  | 
|  | leaf_blocks = ceil(record_count / leaf_load_factor) | 
|  |  | 
|  | The number of node blocks needed to point to the next level down in the tree | 
|  | is computed as:: | 
|  |  | 
|  | n_blocks = (n == 0 ? leaf_blocks : node_blocks[n]) | 
|  | node_blocks[n + 1] = ceil(n_blocks / node_load_factor) | 
|  |  | 
|  | The entire computation is performed recursively until the current level only | 
|  | needs one block. | 
|  | The resulting geometry is as follows: | 
|  |  | 
|  | - For AG-rooted btrees, this level is the root level, so the height of the new | 
|  | tree is ``level + 1`` and the space needed is the summation of the number of | 
|  | blocks on each level. | 
|  |  | 
|  | - For inode-rooted btrees where the records in the top level do not fit in the | 
|  | inode fork area, the height is ``level + 2``, the space needed is the | 
|  | summation of the number of blocks on each level, and the inode fork points to | 
|  | the root block. | 
|  |  | 
|  | - For inode-rooted btrees where the records in the top level can be stored in | 
|  | the inode fork area, then the root block can be stored in the inode, the | 
|  | height is ``level + 1``, and the space needed is one less than the summation | 
|  | of the number of blocks on each level. | 
|  | This only becomes relevant when non-bmap btrees gain the ability to root in | 
|  | an inode, which is a future patchset and only included here for completeness. | 
|  |  | 
|  | .. _newbt: | 
|  |  | 
|  | Reserving New B+Tree Blocks | 
|  | ``````````````````````````` | 
|  |  | 
|  | Once repair knows the number of blocks needed for the new btree, it allocates | 
|  | those blocks using the free space information. | 
|  | Each reserved extent is tracked separately by the btree builder state data. | 
|  | To improve crash resilience, the reservation code also logs an Extent Freeing | 
|  | Intent (EFI) item in the same transaction as each space allocation and attaches | 
|  | its in-memory ``struct xfs_extent_free_item`` object to the space reservation. | 
|  | If the system goes down, log recovery will use the unfinished EFIs to free the | 
|  | unused space, the free space, leaving the filesystem unchanged. | 
|  |  | 
|  | Each time the btree builder claims a block for the btree from a reserved | 
|  | extent, it updates the in-memory reservation to reflect the claimed space. | 
|  | Block reservation tries to allocate as much contiguous space as possible to | 
|  | reduce the number of EFIs in play. | 
|  |  | 
|  | While repair is writing these new btree blocks, the EFIs created for the space | 
|  | reservations pin the tail of the ondisk log. | 
|  | It's possible that other parts of the system will remain busy and push the head | 
|  | of the log towards the pinned tail. | 
|  | To avoid livelocking the filesystem, the EFIs must not pin the tail of the log | 
|  | for too long. | 
|  | To alleviate this problem, the dynamic relogging capability of the deferred ops | 
|  | mechanism is reused here to commit a transaction at the log head containing an | 
|  | EFD for the old EFI and new EFI at the head. | 
|  | This enables the log to release the old EFI to keep the log moving forwards. | 
|  |  | 
|  | EFIs have a role to play during the commit and reaping phases; please see the | 
|  | next section and the section about :ref:`reaping<reaping>` for more details. | 
|  |  | 
|  | Proposed patchsets are the | 
|  | `bitmap rework | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_ | 
|  | and the | 
|  | `preparation for bulk loading btrees | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_. | 
|  |  | 
|  |  | 
|  | Writing the New Tree | 
|  | ```````````````````` | 
|  |  | 
|  | This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims | 
|  | a block from the reserved list, writes the new btree block header, fills the | 
|  | rest of the block with records, and adds the new leaf block to a list of | 
|  | written blocks:: | 
|  |  | 
|  | ┌────┐ | 
|  | │leaf│ | 
|  | │RRR │ | 
|  | └────┘ | 
|  |  | 
|  | Sibling pointers are set every time a new block is added to the level:: | 
|  |  | 
|  | ┌────┐ ┌────┐ ┌────┐ ┌────┐ | 
|  | │leaf│→│leaf│→│leaf│→│leaf│ | 
|  | │RRR │←│RRR │←│RRR │←│RRR │ | 
|  | └────┘ └────┘ └────┘ └────┘ | 
|  |  | 
|  | When it finishes writing the record leaf blocks, it moves on to the node | 
|  | blocks | 
|  | To fill a node block, it walks each block in the next level down in the tree | 
|  | to compute the relevant keys and write them into the parent node:: | 
|  |  | 
|  | ┌────┐       ┌────┐ | 
|  | │node│──────→│node│ | 
|  | │PP  │←──────│PP  │ | 
|  | └────┘       └────┘ | 
|  | ↙   ↘         ↙   ↘ | 
|  | ┌────┐ ┌────┐ ┌────┐ ┌────┐ | 
|  | │leaf│→│leaf│→│leaf│→│leaf│ | 
|  | │RRR │←│RRR │←│RRR │←│RRR │ | 
|  | └────┘ └────┘ └────┘ └────┘ | 
|  |  | 
|  | When it reaches the root level, it is ready to commit the new btree!:: | 
|  |  | 
|  | ┌─────────┐ | 
|  | │  root   │ | 
|  | │   PP    │ | 
|  | └─────────┘ | 
|  | ↙         ↘ | 
|  | ┌────┐       ┌────┐ | 
|  | │node│──────→│node│ | 
|  | │PP  │←──────│PP  │ | 
|  | └────┘       └────┘ | 
|  | ↙   ↘         ↙   ↘ | 
|  | ┌────┐ ┌────┐ ┌────┐ ┌────┐ | 
|  | │leaf│→│leaf│→│leaf│→│leaf│ | 
|  | │RRR │←│RRR │←│RRR │←│RRR │ | 
|  | └────┘ └────┘ └────┘ └────┘ | 
|  |  | 
|  | The first step to commit the new btree is to persist the btree blocks to disk | 
|  | synchronously. | 
|  | This is a little complicated because a new btree block could have been freed | 
|  | in the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to | 
|  | remove the (stale) buffer from the AIL list before it can write the new blocks | 
|  | to disk. | 
|  | Blocks are queued for IO using a delwri list and written in one large batch | 
|  | with ``xfs_buf_delwri_submit``. | 
|  |  | 
|  | Once the new blocks have been persisted to disk, control returns to the | 
|  | individual repair function that called the bulk loader. | 
|  | The repair function must log the location of the new root in a transaction, | 
|  | clean up the space reservations that were made for the new btree, and reap the | 
|  | old metadata blocks: | 
|  |  | 
|  | 1. Commit the location of the new btree root. | 
|  |  | 
|  | 2. For each incore reservation: | 
|  |  | 
|  | a. Log Extent Freeing Done (EFD) items for all the space that was consumed | 
|  | by the btree builder.  The new EFDs must point to the EFIs attached to | 
|  | the reservation to prevent log recovery from freeing the new blocks. | 
|  |  | 
|  | b. For unclaimed portions of incore reservations, create a regular deferred | 
|  | extent free work item to be free the unused space later in the | 
|  | transaction chain. | 
|  |  | 
|  | c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the | 
|  | reservation of the committing transaction. | 
|  | If the btree loading code suspects this might be about to happen, it must | 
|  | call ``xrep_defer_finish`` to clear out the deferred work and obtain a | 
|  | fresh transaction. | 
|  |  | 
|  | 3. Clear out the deferred work a second time to finish the commit and clean | 
|  | the repair transaction. | 
|  |  | 
|  | The transaction rolling in steps 2c and 3 represent a weakness in the repair | 
|  | algorithm, because a log flush and a crash before the end of the reap step can | 
|  | result in space leaking. | 
|  | Online repair functions minimize the chances of this occurring by using very | 
|  | large transactions, which each can accommodate many thousands of block freeing | 
|  | instructions. | 
|  | Repair moves on to reaping the old blocks, which will be presented in a | 
|  | subsequent :ref:`section<reaping>` after a few case studies of bulk loading. | 
|  |  | 
|  | Case Study: Rebuilding the Inode Index | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | The high level process to rebuild the inode index btree is: | 
|  |  | 
|  | 1. Walk the reverse mapping records to generate ``struct xfs_inobt_rec`` | 
|  | records from the inode chunk information and a bitmap of the old inode btree | 
|  | blocks. | 
|  |  | 
|  | 2. Append the records to an xfarray in inode order. | 
|  |  | 
|  | 3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number | 
|  | of blocks needed for the inode btree. | 
|  | If the free space inode btree is enabled, call it again to estimate the | 
|  | geometry of the finobt. | 
|  |  | 
|  | 4. Allocate the number of blocks computed in the previous step. | 
|  |  | 
|  | 5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and | 
|  | generate the internal node blocks. | 
|  | If the free space inode btree is enabled, call it again to load the finobt. | 
|  |  | 
|  | 6. Commit the location of the new btree root block(s) to the AGI. | 
|  |  | 
|  | 7. Reap the old btree blocks using the bitmap created in step 1. | 
|  |  | 
|  | Details are as follows. | 
|  |  | 
|  | The inode btree maps inumbers to the ondisk location of the associated | 
|  | inode records, which means that the inode btrees can be rebuilt from the | 
|  | reverse mapping information. | 
|  | Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the | 
|  | location of the old inode btree blocks. | 
|  | Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the | 
|  | location of at least one inode cluster buffer. | 
|  | A cluster is the smallest number of ondisk inodes that can be allocated or | 
|  | freed in a single transaction; it is never smaller than 1 fs block or 4 inodes. | 
|  |  | 
|  | For the space represented by each inode cluster, ensure that there are no | 
|  | records in the free space btrees nor any records in the reference count btree. | 
|  | If there are, the space metadata inconsistencies are reason enough to abort the | 
|  | operation. | 
|  | Otherwise, read each cluster buffer to check that its contents appear to be | 
|  | ondisk inodes and to decide if the file is allocated | 
|  | (``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``). | 
|  | Accumulate the results of successive inode cluster buffer reads until there is | 
|  | enough information to fill a single inode chunk record, which is 64 consecutive | 
|  | numbers in the inumber keyspace. | 
|  | If the chunk is sparse, the chunk record may include holes. | 
|  |  | 
|  | Once the repair function accumulates one chunk's worth of data, it calls | 
|  | ``xfarray_append`` to add the inode btree record to the xfarray. | 
|  | This xfarray is walked twice during the btree creation step -- once to populate | 
|  | the inode btree with all inode chunk records, and a second time to populate the | 
|  | free inode btree with records for chunks that have free non-sparse inodes. | 
|  | The number of records for the inode btree is the number of xfarray records, | 
|  | but the record count for the free inode btree has to be computed as inode chunk | 
|  | records are stored in the xfarray. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `AG btree repair | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ | 
|  | series. | 
|  |  | 
|  | Case Study: Rebuilding the Space Reference Counts | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Reverse mapping records are used to rebuild the reference count information. | 
|  | Reference counts are required for correct operation of copy on write for shared | 
|  | file data. | 
|  | Imagine the reverse mapping entries as rectangles representing extents of | 
|  | physical blocks, and that the rectangles can be laid down to allow them to | 
|  | overlap each other. | 
|  | From the diagram below, it is apparent that a reference count record must start | 
|  | or end wherever the height of the stack changes. | 
|  | In other words, the record emission stimulus is level-triggered:: | 
|  |  | 
|  | █    ███ | 
|  | ██      █████ ████   ███        ██████ | 
|  | ██   ████     ███████████ ████     █████████ | 
|  | ████████████████████████████████ ███████████ | 
|  | ^ ^  ^^ ^^    ^ ^^ ^^^  ^^^^  ^ ^^ ^  ^     ^ | 
|  | 2 1  23 21    3 43 234  2123  1 01 2  3     0 | 
|  |  | 
|  | The ondisk reference count btree does not store the refcount == 0 cases because | 
|  | the free space btree already records which blocks are free. | 
|  | Extents being used to stage copy-on-write operations should be the only records | 
|  | with refcount == 1. | 
|  | Single-owner file blocks aren't recorded in either the free space or the | 
|  | reference count btrees. | 
|  |  | 
|  | The high level process to rebuild the reference count btree is: | 
|  |  | 
|  | 1. Walk the reverse mapping records to generate ``struct xfs_refcount_irec`` | 
|  | records for any space having more than one reverse mapping and add them to | 
|  | the xfarray. | 
|  | Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray | 
|  | because these are extents allocated to stage a copy on write operation and | 
|  | are tracked in the refcount btree. | 
|  |  | 
|  | Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old | 
|  | refcount btree blocks. | 
|  |  | 
|  | 2. Sort the records in physical extent order, putting the CoW staging extents | 
|  | at the end of the xfarray. | 
|  | This matches the sorting order of records in the refcount btree. | 
|  |  | 
|  | 3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number | 
|  | of blocks needed for the new tree. | 
|  |  | 
|  | 4. Allocate the number of blocks computed in the previous step. | 
|  |  | 
|  | 5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and | 
|  | generate the internal node blocks. | 
|  |  | 
|  | 6. Commit the location of new btree root block to the AGF. | 
|  |  | 
|  | 7. Reap the old btree blocks using the bitmap created in step 1. | 
|  |  | 
|  | Details are as follows; the same algorithm is used by ``xfs_repair`` to | 
|  | generate refcount information from reverse mapping records. | 
|  |  | 
|  | - Until the reverse mapping btree runs out of records: | 
|  |  | 
|  | - Retrieve the next record from the btree and put it in a bag. | 
|  |  | 
|  | - Collect all records with the same starting block from the btree and put | 
|  | them in the bag. | 
|  |  | 
|  | - While the bag isn't empty: | 
|  |  | 
|  | - Among the mappings in the bag, compute the lowest block number where the | 
|  | reference count changes. | 
|  | This position will be either the starting block number of the next | 
|  | unprocessed reverse mapping or the next block after the shortest mapping | 
|  | in the bag. | 
|  |  | 
|  | - Remove all mappings from the bag that end at this position. | 
|  |  | 
|  | - Collect all reverse mappings that start at this position from the btree | 
|  | and put them in the bag. | 
|  |  | 
|  | - If the size of the bag changed and is greater than one, create a new | 
|  | refcount record associating the block number range that we just walked to | 
|  | the size of the bag. | 
|  |  | 
|  | The bag-like structure in this case is a type 2 xfarray as discussed in the | 
|  | :ref:`xfarray access patterns<xfarray_access_patterns>` section. | 
|  | Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and | 
|  | removed via ``xfarray_unset``. | 
|  | Bag members are examined through ``xfarray_iter`` loops. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `AG btree repair | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ | 
|  | series. | 
|  |  | 
|  | Case Study: Rebuilding File Fork Mapping Indices | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | The high level process to rebuild a data/attr fork mapping btree is: | 
|  |  | 
|  | 1. Walk the reverse mapping records to generate ``struct xfs_bmbt_rec`` | 
|  | records from the reverse mapping records for that inode and fork. | 
|  | Append these records to an xfarray. | 
|  | Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK`` | 
|  | records. | 
|  |  | 
|  | 2. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number | 
|  | of blocks needed for the new tree. | 
|  |  | 
|  | 3. Sort the records in file offset order. | 
|  |  | 
|  | 4. If the extent records would fit in the inode fork immediate area, commit the | 
|  | records to that immediate area and skip to step 8. | 
|  |  | 
|  | 5. Allocate the number of blocks computed in the previous step. | 
|  |  | 
|  | 6. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and | 
|  | generate the internal node blocks. | 
|  |  | 
|  | 7. Commit the new btree root block to the inode fork immediate area. | 
|  |  | 
|  | 8. Reap the old btree blocks using the bitmap created in step 1. | 
|  |  | 
|  | There are some complications here: | 
|  | First, it's possible to move the fork offset to adjust the sizes of the | 
|  | immediate areas if the data and attr forks are not both in BMBT format. | 
|  | Second, if there are sufficiently few fork mappings, it may be possible to use | 
|  | EXTENTS format instead of BMBT, which may require a conversion. | 
|  | Third, the incore extent map must be reloaded carefully to avoid disturbing | 
|  | any delayed allocation extents. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `file mapping repair | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_ | 
|  | series. | 
|  |  | 
|  | .. _reaping: | 
|  |  | 
|  | Reaping Old Metadata Blocks | 
|  | --------------------------- | 
|  |  | 
|  | Whenever online fsck builds a new data structure to replace one that is | 
|  | suspect, there is a question of how to find and dispose of the blocks that | 
|  | belonged to the old structure. | 
|  | The laziest method of course is not to deal with them at all, but this slowly | 
|  | leads to service degradations as space leaks out of the filesystem. | 
|  | Hopefully, someone will schedule a rebuild of the free space information to | 
|  | plug all those leaks. | 
|  | Offline repair rebuilds all space metadata after recording the usage of | 
|  | the files and directories that it decides not to clear, hence it can build new | 
|  | structures in the discovered free space and avoid the question of reaping. | 
|  |  | 
|  | As part of a repair, online fsck relies heavily on the reverse mapping records | 
|  | to find space that is owned by the corresponding rmap owner yet truly free. | 
|  | Cross referencing rmap records with other rmap records is necessary because | 
|  | there may be other data structures that also think they own some of those | 
|  | blocks (e.g. crosslinked trees). | 
|  | Permitting the block allocator to hand them out again will not push the system | 
|  | towards consistency. | 
|  |  | 
|  | For space metadata, the process of finding extents to dispose of generally | 
|  | follows this format: | 
|  |  | 
|  | 1. Create a bitmap of space used by data structures that must be preserved. | 
|  | The space reservations used to create the new metadata can be used here if | 
|  | the same rmap owner code is used to denote all of the objects being rebuilt. | 
|  |  | 
|  | 2. Survey the reverse mapping data to create a bitmap of space owned by the | 
|  | same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved. | 
|  |  | 
|  | 3. Use the bitmap disunion operator to subtract (1) from (2). | 
|  | The remaining set bits represent candidate extents that could be freed. | 
|  | The process moves on to step 4 below. | 
|  |  | 
|  | Repairs for file-based metadata such as extended attributes, directories, | 
|  | symbolic links, quota files and realtime bitmaps are performed by building a | 
|  | new structure attached to a temporary file and exchanging all mappings in the | 
|  | file forks. | 
|  | Afterward, the mappings in the old file fork are the candidate blocks for | 
|  | disposal. | 
|  |  | 
|  | The process for disposing of old extents is as follows: | 
|  |  | 
|  | 4. For each candidate extent, count the number of reverse mapping records for | 
|  | the first block in that extent that do not have the same rmap owner for the | 
|  | data structure being repaired. | 
|  |  | 
|  | - If zero, the block has a single owner and can be freed. | 
|  |  | 
|  | - If not, the block is part of a crosslinked structure and must not be | 
|  | freed. | 
|  |  | 
|  | 5. Starting with the next block in the extent, figure out how many more blocks | 
|  | have the same zero/nonzero other owner status as that first block. | 
|  |  | 
|  | 6. If the region is crosslinked, delete the reverse mapping entry for the | 
|  | structure being repaired and move on to the next region. | 
|  |  | 
|  | 7. If the region is to be freed, mark any corresponding buffers in the buffer | 
|  | cache as stale to prevent log writeback. | 
|  |  | 
|  | 8. Free the region and move on. | 
|  |  | 
|  | However, there is one complication to this procedure. | 
|  | Transactions are of finite size, so the reaping process must be careful to roll | 
|  | the transactions to avoid overruns. | 
|  | Overruns come from two sources: | 
|  |  | 
|  | a. EFIs logged on behalf of space that is no longer occupied | 
|  |  | 
|  | b. Log items for buffer invalidations | 
|  |  | 
|  | This is also a window in which a crash during the reaping process can leak | 
|  | blocks. | 
|  | As stated earlier, online repair functions use very large transactions to | 
|  | minimize the chances of this occurring. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `preparation for bulk loading btrees | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_ | 
|  | series. | 
|  |  | 
|  | Case Study: Reaping After a Regular Btree Repair | 
|  | ```````````````````````````````````````````````` | 
|  |  | 
|  | Old reference count and inode btrees are the easiest to reap because they have | 
|  | rmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount | 
|  | btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees. | 
|  | Creating a list of extents to reap the old btree blocks is quite simple, | 
|  | conceptually: | 
|  |  | 
|  | 1. Lock the relevant AGI/AGF header buffers to prevent allocation and frees. | 
|  |  | 
|  | 2. For each reverse mapping record with an rmap owner corresponding to the | 
|  | metadata structure being rebuilt, set the corresponding range in a bitmap. | 
|  |  | 
|  | 3. Walk the current data structures that have the same rmap owner. | 
|  | For each block visited, clear that range in the above bitmap. | 
|  |  | 
|  | 4. Each set bit in the bitmap represents a block that could be a block from the | 
|  | old data structures and hence is a candidate for reaping. | 
|  | In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)`` | 
|  | are the blocks that might be freeable. | 
|  |  | 
|  | If it is possible to maintain the AGF lock throughout the repair (which is the | 
|  | common case), then step 2 can be performed at the same time as the reverse | 
|  | mapping record walk that creates the records for the new btree. | 
|  |  | 
|  | Case Study: Rebuilding the Free Space Indices | 
|  | ````````````````````````````````````````````` | 
|  |  | 
|  | The high level process to rebuild the free space indices is: | 
|  |  | 
|  | 1. Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore`` | 
|  | records from the gaps in the reverse mapping btree. | 
|  |  | 
|  | 2. Append the records to an xfarray. | 
|  |  | 
|  | 3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number | 
|  | of blocks needed for each new tree. | 
|  |  | 
|  | 4. Allocate the number of blocks computed in the previous step from the free | 
|  | space information collected. | 
|  |  | 
|  | 5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and | 
|  | generate the internal node blocks for the free space by length index. | 
|  | Call it again for the free space by block number index. | 
|  |  | 
|  | 6. Commit the locations of the new btree root blocks to the AGF. | 
|  |  | 
|  | 7. Reap the old btree blocks by looking for space that is not recorded by the | 
|  | reverse mapping btree, the new free space btrees, or the AGFL. | 
|  |  | 
|  | Repairing the free space btrees has three key complications over a regular | 
|  | btree repair: | 
|  |  | 
|  | First, free space is not explicitly tracked in the reverse mapping records. | 
|  | Hence, the new free space records must be inferred from gaps in the physical | 
|  | space component of the keyspace of the reverse mapping btree. | 
|  |  | 
|  | Second, free space repairs cannot use the common btree reservation code because | 
|  | new blocks are reserved out of the free space btrees. | 
|  | This is impossible when repairing the free space btrees themselves. | 
|  | However, repair holds the AGF buffer lock for the duration of the free space | 
|  | index reconstruction, so it can use the collected free space information to | 
|  | supply the blocks for the new free space btrees. | 
|  | It is not necessary to back each reserved extent with an EFI because the new | 
|  | free space btrees are constructed in what the ondisk filesystem thinks is | 
|  | unowned space. | 
|  | However, if reserving blocks for the new btrees from the collected free space | 
|  | information changes the number of free space records, repair must re-estimate | 
|  | the new free space btree geometry with the new record count until the | 
|  | reservation is sufficient. | 
|  | As part of committing the new btrees, repair must ensure that reverse mappings | 
|  | are created for the reserved blocks and that unused reserved blocks are | 
|  | inserted into the free space btrees. | 
|  | Deferrred rmap and freeing operations are used to ensure that this transition | 
|  | is atomic, similar to the other btree repair functions. | 
|  |  | 
|  | Third, finding the blocks to reap after the repair is not overly | 
|  | straightforward. | 
|  | Blocks for the free space btrees and the reverse mapping btrees are supplied by | 
|  | the AGFL. | 
|  | Blocks put onto the AGFL have reverse mapping records with the owner | 
|  | ``XFS_RMAP_OWN_AG``. | 
|  | This ownership is retained when blocks move from the AGFL into the free space | 
|  | btrees or the reverse mapping btrees. | 
|  | When repair walks reverse mapping records to synthesize free space records, it | 
|  | creates a bitmap (``ag_owner_bitmap``) of all the space claimed by | 
|  | ``XFS_RMAP_OWN_AG`` records. | 
|  | The repair context maintains a second bitmap corresponding to the rmap btree | 
|  | blocks and the AGFL blocks (``rmap_agfl_bitmap``). | 
|  | When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap & | 
|  | ~rmap_agfl_bitmap)`` computes the extents that are used by the old free space | 
|  | btrees. | 
|  | These blocks can then be reaped using the methods outlined above. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `AG btree repair | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ | 
|  | series. | 
|  |  | 
|  | .. _rmap_reap: | 
|  |  | 
|  | Case Study: Reaping After Repairing Reverse Mapping Btrees | 
|  | `````````````````````````````````````````````````````````` | 
|  |  | 
|  | Old reverse mapping btrees are less difficult to reap after a repair. | 
|  | As mentioned in the previous section, blocks on the AGFL, the two free space | 
|  | btree blocks, and the reverse mapping btree blocks all have reverse mapping | 
|  | records with ``XFS_RMAP_OWN_AG`` as the owner. | 
|  | The full process of gathering reverse mapping records and building a new btree | 
|  | are described in the case study of | 
|  | :ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that | 
|  | discussion is that the new rmap btree will not contain any records for the old | 
|  | rmap btree, nor will the old btree blocks be tracked in the free space btrees. | 
|  | The list of candidate reaping blocks is computed by setting the bits | 
|  | corresponding to the gaps in the new rmap btree records, and then clearing the | 
|  | bits corresponding to extents in the free space btrees and the current AGFL | 
|  | blocks. | 
|  | The result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the | 
|  | methods outlined above. | 
|  |  | 
|  | The rest of the process of rebuildng the reverse mapping btree is discussed | 
|  | in a separate :ref:`case study<rmap_repair>`. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `AG btree repair | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ | 
|  | series. | 
|  |  | 
|  | Case Study: Rebuilding the AGFL | 
|  | ``````````````````````````````` | 
|  |  | 
|  | The allocation group free block list (AGFL) is repaired as follows: | 
|  |  | 
|  | 1. Create a bitmap for all the space that the reverse mapping data claims is | 
|  | owned by ``XFS_RMAP_OWN_AG``. | 
|  |  | 
|  | 2. Subtract the space used by the two free space btrees and the rmap btree. | 
|  |  | 
|  | 3. Subtract any space that the reverse mapping data claims is owned by any | 
|  | other owner, to avoid re-adding crosslinked blocks to the AGFL. | 
|  |  | 
|  | 4. Once the AGFL is full, reap any blocks leftover. | 
|  |  | 
|  | 5. The next operation to fix the freelist will right-size the list. | 
|  |  | 
|  | See `fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_ for more details. | 
|  |  | 
|  | Inode Record Repairs | 
|  | -------------------- | 
|  |  | 
|  | Inode records must be handled carefully, because they have both ondisk records | 
|  | ("dinodes") and an in-memory ("cached") representation. | 
|  | There is a very high potential for cache coherency issues if online fsck is not | 
|  | careful to access the ondisk metadata *only* when the ondisk metadata is so | 
|  | badly damaged that the filesystem cannot load the in-memory representation. | 
|  | When online fsck wants to open a damaged file for scrubbing, it must use | 
|  | specialized resource acquisition functions that return either the in-memory | 
|  | representation *or* a lock on whichever object is necessary to prevent any | 
|  | update to the ondisk location. | 
|  |  | 
|  | The only repairs that should be made to the ondisk inode buffers are whatever | 
|  | is necessary to get the in-core structure loaded. | 
|  | This means fixing whatever is caught by the inode cluster buffer and inode fork | 
|  | verifiers, and retrying the ``iget`` operation. | 
|  | If the second ``iget`` fails, the repair has failed. | 
|  |  | 
|  | Once the in-memory representation is loaded, repair can lock the inode and can | 
|  | subject it to comprehensive checks, repairs, and optimizations. | 
|  | Most inode attributes are easy to check and constrain, or are user-controlled | 
|  | arbitrary bit patterns; these are both easy to fix. | 
|  | Dealing with the data and attr fork extent counts and the file block counts is | 
|  | more complicated, because computing the correct value requires traversing the | 
|  | forks, or if that fails, leaving the fields invalid and waiting for the fork | 
|  | fsck functions to run. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `inode | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_ | 
|  | repair series. | 
|  |  | 
|  | Quota Record Repairs | 
|  | -------------------- | 
|  |  | 
|  | Similar to inodes, quota records ("dquots") also have both ondisk records and | 
|  | an in-memory representation, and hence are subject to the same cache coherency | 
|  | issues. | 
|  | Somewhat confusingly, both are known as dquots in the XFS codebase. | 
|  |  | 
|  | The only repairs that should be made to the ondisk quota record buffers are | 
|  | whatever is necessary to get the in-core structure loaded. | 
|  | Once the in-memory representation is loaded, the only attributes needing | 
|  | checking are obviously bad limits and timer values. | 
|  |  | 
|  | Quota usage counters are checked, repaired, and discussed separately in the | 
|  | section about :ref:`live quotacheck <quotacheck>`. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `quota | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_ | 
|  | repair series. | 
|  |  | 
|  | .. _fscounters: | 
|  |  | 
|  | Freezing to Fix Summary Counters | 
|  | -------------------------------- | 
|  |  | 
|  | Filesystem summary counters track availability of filesystem resources such | 
|  | as free blocks, free inodes, and allocated inodes. | 
|  | This information could be compiled by walking the free space and inode indexes, | 
|  | but this is a slow process, so XFS maintains a copy in the ondisk superblock | 
|  | that should reflect the ondisk metadata, at least when the filesystem has been | 
|  | unmounted cleanly. | 
|  | For performance reasons, XFS also maintains incore copies of those counters, | 
|  | which are key to enabling resource reservations for active transactions. | 
|  | Writer threads reserve the worst-case quantities of resources from the | 
|  | incore counter and give back whatever they don't use at commit time. | 
|  | It is therefore only necessary to serialize on the superblock when the | 
|  | superblock is being committed to disk. | 
|  |  | 
|  | The lazy superblock counter feature introduced in XFS v5 took this even further | 
|  | by training log recovery to recompute the summary counters from the AG headers, | 
|  | which eliminated the need for most transactions even to touch the superblock. | 
|  | The only time XFS commits the summary counters is at filesystem unmount. | 
|  | To reduce contention even further, the incore counter is implemented as a | 
|  | percpu counter, which means that each CPU is allocated a batch of blocks from a | 
|  | global incore counter and can satisfy small allocations from the local batch. | 
|  |  | 
|  | The high-performance nature of the summary counters makes it difficult for | 
|  | online fsck to check them, since there is no way to quiesce a percpu counter | 
|  | while the system is running. | 
|  | Although online fsck can read the filesystem metadata to compute the correct | 
|  | values of the summary counters, there's no way to hold the value of a percpu | 
|  | counter stable, so it's quite possible that the counter will be out of date by | 
|  | the time the walk is complete. | 
|  | Earlier versions of online scrub would return to userspace with an incomplete | 
|  | scan flag, but this is not a satisfying outcome for a system administrator. | 
|  | For repairs, the in-memory counters must be stabilized while walking the | 
|  | filesystem metadata to get an accurate reading and install it in the percpu | 
|  | counter. | 
|  |  | 
|  | To satisfy this requirement, online fsck must prevent other programs in the | 
|  | system from initiating new writes to the filesystem, it must disable background | 
|  | garbage collection threads, and it must wait for existing writer programs to | 
|  | exit the kernel. | 
|  | Once that has been established, scrub can walk the AG free space indexes, the | 
|  | inode btrees, and the realtime bitmap to compute the correct value of all | 
|  | four summary counters. | 
|  | This is very similar to a filesystem freeze, though not all of the pieces are | 
|  | necessary: | 
|  |  | 
|  | - The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to | 
|  | prevent other threads from thawing the filesystem, or other scrub threads | 
|  | from initiating another fscounters freeze. | 
|  |  | 
|  | - It does not quiesce the log. | 
|  |  | 
|  | With this code in place, it is now possible to pause the filesystem for just | 
|  | long enough to check and correct the summary counters. | 
|  |  | 
|  | +--------------------------------------------------------------------------+ | 
|  | | **Historical Sidebar**:                                                  | | 
|  | +--------------------------------------------------------------------------+ | 
|  | | The initial implementation used the actual VFS filesystem freeze         | | 
|  | | mechanism to quiesce filesystem activity.                                | | 
|  | | With the filesystem frozen, it is possible to resolve the counter values | | 
|  | | with exact precision, but there are many problems with calling the VFS   | | 
|  | | methods directly:                                                        | | 
|  | |                                                                          | | 
|  | | - Other programs can unfreeze the filesystem without our knowledge.      | | 
|  | |   This leads to incorrect scan results and incorrect repairs.            | | 
|  | |                                                                          | | 
|  | | - Adding an extra lock to prevent others from thawing the filesystem     | | 
|  | |   required the addition of a ``->freeze_super`` function to wrap         | | 
|  | |   ``freeze_fs()``.                                                       | | 
|  | |   This in turn caused other subtle problems because it turns out that    | | 
|  | |   the VFS ``freeze_super`` and ``thaw_super`` functions can drop the     | | 
|  | |   last reference to the VFS superblock, and any subsequent access        | | 
|  | |   becomes a UAF bug!                                                     | | 
|  | |   This can happen if the filesystem is unmounted while the underlying    | | 
|  | |   block device has frozen the filesystem.                                | | 
|  | |   This problem could be solved by grabbing extra references to the       | | 
|  | |   superblock, but it felt suboptimal given the other inadequacies of     | | 
|  | |   this approach.                                                         | | 
|  | |                                                                          | | 
|  | | - The log need not be quiesced to check the summary counters, but a VFS  | | 
|  | |   freeze initiates one anyway.                                           | | 
|  | |   This adds unnecessary runtime to live fscounter fsck operations.       | | 
|  | |                                                                          | | 
|  | | - Quiescing the log means that XFS flushes the (possibly incorrect)      | | 
|  | |   counters to disk as part of cleaning the log.                          | | 
|  | |                                                                          | | 
|  | | - A bug in the VFS meant that freeze could complete even when            | | 
|  | |   sync_filesystem fails to flush the filesystem and returns an error.    | | 
|  | |   This bug was fixed in Linux 5.17.                                      | | 
|  | +--------------------------------------------------------------------------+ | 
|  |  | 
|  | The proposed patchset is the | 
|  | `summary counter cleanup | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_ | 
|  | series. | 
|  |  | 
|  | Full Filesystem Scans | 
|  | --------------------- | 
|  |  | 
|  | Certain types of metadata can only be checked by walking every file in the | 
|  | entire filesystem to record observations and comparing the observations against | 
|  | what's recorded on disk. | 
|  | Like every other type of online repair, repairs are made by writing those | 
|  | observations to disk in a replacement structure and committing it atomically. | 
|  | However, it is not practical to shut down the entire filesystem to examine | 
|  | hundreds of billions of files because the downtime would be excessive. | 
|  | Therefore, online fsck must build the infrastructure to manage a live scan of | 
|  | all the files in the filesystem. | 
|  | There are two questions that need to be solved to perform a live walk: | 
|  |  | 
|  | - How does scrub manage the scan while it is collecting data? | 
|  |  | 
|  | - How does the scan keep abreast of changes being made to the system by other | 
|  | threads? | 
|  |  | 
|  | .. _iscan: | 
|  |  | 
|  | Coordinated Inode Scans | 
|  | ``````````````````````` | 
|  |  | 
|  | In the original Unix filesystems of the 1970s, each directory entry contained | 
|  | an index number (*inumber*) which was used as an index into on ondisk array | 
|  | (*itable*) of fixed-size records (*inodes*) describing a file's attributes and | 
|  | its data block mapping. | 
|  | This system is described by J. Lions, `"inode (5659)" | 
|  | <http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions' Commentary on | 
|  | UNIX, 6th Edition*, (Dept. of Computer Science, the University of New South | 
|  | Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson, | 
|  | `"Implementation of the File System" | 
|  | <https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX | 
|  | Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp. | 
|  | 1913-4. | 
|  |  | 
|  | XFS retains most of this design, except now inumbers are search keys over all | 
|  | the space in the data section filesystem. | 
|  | They form a continuous keyspace that can be expressed as a 64-bit integer, | 
|  | though the inodes themselves are sparsely distributed within the keyspace. | 
|  | Scans proceed in a linear fashion across the inumber keyspace, starting from | 
|  | ``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``. | 
|  | Naturally, a scan through a keyspace requires a scan cursor object to track the | 
|  | scan progress. | 
|  | Because this keyspace is sparse, this cursor contains two parts. | 
|  | The first part of this scan cursor object tracks the inode that will be | 
|  | examined next; call this the examination cursor. | 
|  | Somewhat less obviously, the scan cursor object must also track which parts of | 
|  | the keyspace have already been visited, which is critical for deciding if a | 
|  | concurrent filesystem update needs to be incorporated into the scan data. | 
|  | Call this the visited inode cursor. | 
|  |  | 
|  | Advancing the scan cursor is a multi-step process encapsulated in | 
|  | ``xchk_iscan_iter``: | 
|  |  | 
|  | 1. Lock the AGI buffer of the AG containing the inode pointed to by the visited | 
|  | inode cursor. | 
|  | This guarantee that inodes in this AG cannot be allocated or freed while | 
|  | advancing the cursor. | 
|  |  | 
|  | 2. Use the per-AG inode btree to look up the next inumber after the one that | 
|  | was just visited, since it may not be keyspace adjacent. | 
|  |  | 
|  | 3. If there are no more inodes left in this AG: | 
|  |  | 
|  | a. Move the examination cursor to the point of the inumber keyspace that | 
|  | corresponds to the start of the next AG. | 
|  |  | 
|  | b. Adjust the visited inode cursor to indicate that it has "visited" the | 
|  | last possible inode in the current AG's inode keyspace. | 
|  | XFS inumbers are segmented, so the cursor needs to be marked as having | 
|  | visited the entire keyspace up to just before the start of the next AG's | 
|  | inode keyspace. | 
|  |  | 
|  | c. Unlock the AGI and return to step 1 if there are unexamined AGs in the | 
|  | filesystem. | 
|  |  | 
|  | d. If there are no more AGs to examine, set both cursors to the end of the | 
|  | inumber keyspace. | 
|  | The scan is now complete. | 
|  |  | 
|  | 4. Otherwise, there is at least one more inode to scan in this AG: | 
|  |  | 
|  | a. Move the examination cursor ahead to the next inode marked as allocated | 
|  | by the inode btree. | 
|  |  | 
|  | b. Adjust the visited inode cursor to point to the inode just prior to where | 
|  | the examination cursor is now. | 
|  | Because the scanner holds the AGI buffer lock, no inodes could have been | 
|  | created in the part of the inode keyspace that the visited inode cursor | 
|  | just advanced. | 
|  |  | 
|  | 5. Get the incore inode for the inumber of the examination cursor. | 
|  | By maintaining the AGI buffer lock until this point, the scanner knows that | 
|  | it was safe to advance the examination cursor across the entire keyspace, | 
|  | and that it has stabilized this next inode so that it cannot disappear from | 
|  | the filesystem until the scan releases the incore inode. | 
|  |  | 
|  | 6. Drop the AGI lock and return the incore inode to the caller. | 
|  |  | 
|  | Online fsck functions scan all files in the filesystem as follows: | 
|  |  | 
|  | 1. Start a scan by calling ``xchk_iscan_start``. | 
|  |  | 
|  | 2. Advance the scan cursor (``xchk_iscan_iter``) to get the next inode. | 
|  | If one is provided: | 
|  |  | 
|  | a. Lock the inode to prevent updates during the scan. | 
|  |  | 
|  | b. Scan the inode. | 
|  |  | 
|  | c. While still holding the inode lock, adjust the visited inode cursor | 
|  | (``xchk_iscan_mark_visited``) to point to this inode. | 
|  |  | 
|  | d. Unlock and release the inode. | 
|  |  | 
|  | 8. Call ``xchk_iscan_teardown`` to complete the scan. | 
|  |  | 
|  | There are subtleties with the inode cache that complicate grabbing the incore | 
|  | inode for the caller. | 
|  | Obviously, it is an absolute requirement that the inode metadata be consistent | 
|  | enough to load it into the inode cache. | 
|  | Second, if the incore inode is stuck in some intermediate state, the scan | 
|  | coordinator must release the AGI and push the main filesystem to get the inode | 
|  | back into a loadable state. | 
|  |  | 
|  | The proposed patches are the | 
|  | `inode scanner | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_ | 
|  | series. | 
|  | The first user of the new functionality is the | 
|  | `online quotacheck | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_ | 
|  | series. | 
|  |  | 
|  | Inode Management | 
|  | ```````````````` | 
|  |  | 
|  | In regular filesystem code, references to allocated XFS incore inodes are | 
|  | always obtained (``xfs_iget``) outside of transaction context because the | 
|  | creation of the incore context for an existing file does not require metadata | 
|  | updates. | 
|  | However, it is important to note that references to incore inodes obtained as | 
|  | part of file creation must be performed in transaction context because the | 
|  | filesystem must ensure the atomicity of the ondisk inode btree index updates | 
|  | and the initialization of the actual ondisk inode. | 
|  |  | 
|  | References to incore inodes are always released (``xfs_irele``) outside of | 
|  | transaction context because there are a handful of activities that might | 
|  | require ondisk updates: | 
|  |  | 
|  | - The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode | 
|  | release. | 
|  |  | 
|  | - Speculative preallocations need to be unreserved. | 
|  |  | 
|  | - An unlinked file may have lost its last reference, in which case the entire | 
|  | file must be inactivated, which involves releasing all of its resources in | 
|  | the ondisk metadata and freeing the inode. | 
|  |  | 
|  | These activities are collectively called inode inactivation. | 
|  | Inactivation has two parts -- the VFS part, which initiates writeback on all | 
|  | dirty file pages, and the XFS part, which cleans up XFS-specific information | 
|  | and frees the inode if it was unlinked. | 
|  | If the inode is unlinked (or unconnected after a file handle operation), the | 
|  | kernel drops the inode into the inactivation machinery immediately. | 
|  |  | 
|  | During normal operation, resource acquisition for an update follows this order | 
|  | to avoid deadlocks: | 
|  |  | 
|  | 1. Inode reference (``iget``). | 
|  |  | 
|  | 2. Filesystem freeze protection, if repairing (``mnt_want_write_file``). | 
|  |  | 
|  | 3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO. | 
|  |  | 
|  | 4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that | 
|  | can update page cache mappings. | 
|  |  | 
|  | 5. Log feature enablement. | 
|  |  | 
|  | 6. Transaction log space grant. | 
|  |  | 
|  | 7. Space on the data and realtime devices for the transaction. | 
|  |  | 
|  | 8. Incore dquot references, if a file is being repaired. | 
|  | Note that they are not locked, merely acquired. | 
|  |  | 
|  | 9. Inode ``ILOCK`` for file metadata updates. | 
|  |  | 
|  | 10. AG header buffer locks / Realtime metadata inode ILOCK. | 
|  |  | 
|  | 11. Realtime metadata buffer locks, if applicable. | 
|  |  | 
|  | 12. Extent mapping btree blocks, if applicable. | 
|  |  | 
|  | Resources are often released in the reverse order, though this is not required. | 
|  | However, online fsck differs from regular XFS operations because it may examine | 
|  | an object that normally is acquired in a later stage of the locking order, and | 
|  | then decide to cross-reference the object with an object that is acquired | 
|  | earlier in the order. | 
|  | The next few sections detail the specific ways in which online fsck takes care | 
|  | to avoid deadlocks. | 
|  |  | 
|  | iget and irele During a Scrub | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | An inode scan performed on behalf of a scrub operation runs in transaction | 
|  | context, and possibly with resources already locked and bound to it. | 
|  | This isn't much of a problem for ``iget`` since it can operate in the context | 
|  | of an existing transaction, as long as all of the bound resources are acquired | 
|  | before the inode reference in the regular filesystem. | 
|  |  | 
|  | When the VFS ``iput`` function is given a linked inode with no other | 
|  | references, it normally puts the inode on an LRU list in the hope that it can | 
|  | save time if another process re-opens the file before the system runs out | 
|  | of memory and frees it. | 
|  | Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE`` | 
|  | flag on the inode to cause the kernel to try to drop the inode into the | 
|  | inactivation machinery immediately. | 
|  |  | 
|  | In the past, inactivation was always done from the process that dropped the | 
|  | inode, which was a problem for scrub because scrub may already hold a | 
|  | transaction, and XFS does not support nesting transactions. | 
|  | On the other hand, if there is no scrub transaction, it is desirable to drop | 
|  | otherwise unused inodes immediately to avoid polluting caches. | 
|  | To capture these nuances, the online fsck code has a separate ``xchk_irele`` | 
|  | function to set or clear the ``DONTCACHE`` flag to get the required release | 
|  | behavior. | 
|  |  | 
|  | Proposed patchsets include fixing | 
|  | `scrub iget usage | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and | 
|  | `dir iget usage | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_. | 
|  |  | 
|  | .. _ilocking: | 
|  |  | 
|  | Locking Inodes | 
|  | ^^^^^^^^^^^^^^ | 
|  |  | 
|  | In regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks | 
|  | in a well-known order: parent → child when updating the directory tree, and | 
|  | in numerical order of the addresses of their ``struct inode`` object otherwise. | 
|  | For regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page | 
|  | faults. | 
|  | If two MMAPLOCKs must be acquired, they are acquired in numerical order of | 
|  | the addresses of their ``struct address_space`` objects. | 
|  | Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be | 
|  | acquired before transactions are allocated. | 
|  | If two ILOCKs must be acquired, they are acquired in inumber order. | 
|  |  | 
|  | Inode lock acquisition must be done carefully during a coordinated inode scan. | 
|  | Online fsck cannot abide these conventions, because for a directory tree | 
|  | scanner, the scrub process holds the IOLOCK of the file being scanned and it | 
|  | needs to take the IOLOCK of the file at the other end of the directory link. | 
|  | If the directory tree is corrupt because it contains a cycle, ``xfs_scrub`` | 
|  | cannot use the regular inode locking functions and avoid becoming trapped in an | 
|  | ABBA deadlock. | 
|  |  | 
|  | Solving both of these problems is straightforward -- any time online fsck | 
|  | needs to take a second lock of the same class, it uses trylock to avoid an ABBA | 
|  | deadlock. | 
|  | If the trylock fails, scrub drops all inode locks and use trylock loops to | 
|  | (re)acquire all necessary resources. | 
|  | Trylock loops enable scrub to check for pending fatal signals, which is how | 
|  | scrub avoids deadlocking the filesystem or becoming an unresponsive process. | 
|  | However, trylock loops means that online fsck must be prepared to measure the | 
|  | resource being scrubbed before and after the lock cycle to detect changes and | 
|  | react accordingly. | 
|  |  | 
|  | .. _dirparent: | 
|  |  | 
|  | Case Study: Finding a Directory Parent | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Consider the directory parent pointer repair code as an example. | 
|  | Online fsck must verify that the dotdot dirent of a directory points up to a | 
|  | parent directory, and that the parent directory contains exactly one dirent | 
|  | pointing down to the child directory. | 
|  | Fully validating this relationship (and repairing it if possible) requires a | 
|  | walk of every directory on the filesystem while holding the child locked, and | 
|  | while updates to the directory tree are being made. | 
|  | The coordinated inode scan provides a way to walk the filesystem without the | 
|  | possibility of missing an inode. | 
|  | The child directory is kept locked to prevent updates to the dotdot dirent, but | 
|  | if the scanner fails to lock a parent, it can drop and relock both the child | 
|  | and the prospective parent. | 
|  | If the dotdot entry changes while the directory is unlocked, then a move or | 
|  | rename operation must have changed the child's parentage, and the scan can | 
|  | exit early. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `directory repair | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_ | 
|  | series. | 
|  |  | 
|  | .. _fshooks: | 
|  |  | 
|  | Filesystem Hooks | 
|  | ````````````````` | 
|  |  | 
|  | The second piece of support that online fsck functions need during a full | 
|  | filesystem scan is the ability to stay informed about updates being made by | 
|  | other threads in the filesystem, since comparisons against the past are useless | 
|  | in a dynamic environment. | 
|  | Two pieces of Linux kernel infrastructure enable online fsck to monitor regular | 
|  | filesystem operations: filesystem hooks and :ref:`static keys<jump_labels>`. | 
|  |  | 
|  | Filesystem hooks convey information about an ongoing filesystem operation to | 
|  | a downstream consumer. | 
|  | In this case, the downstream consumer is always an online fsck function. | 
|  | Because multiple fsck functions can run in parallel, online fsck uses the Linux | 
|  | notifier call chain facility to dispatch updates to any number of interested | 
|  | fsck processes. | 
|  | Call chains are a dynamic list, which means that they can be configured at | 
|  | run time. | 
|  | Because these hooks are private to the XFS module, the information passed along | 
|  | contains exactly what the checking function needs to update its observations. | 
|  |  | 
|  | The current implementation of XFS hooks uses SRCU notifier chains to reduce the | 
|  | impact to highly threaded workloads. | 
|  | Regular blocking notifier chains use a rwsem and seem to have a much lower | 
|  | overhead for single-threaded applications. | 
|  | However, it may turn out that the combination of blocking chains and static | 
|  | keys are a more performant combination; more study is needed here. | 
|  |  | 
|  | The following pieces are necessary to hook a certain point in the filesystem: | 
|  |  | 
|  | - A ``struct xfs_hooks`` object must be embedded in a convenient place such as | 
|  | a well-known incore filesystem object. | 
|  |  | 
|  | - Each hook must define an action code and a structure containing more context | 
|  | about the action. | 
|  |  | 
|  | - Hook providers should provide appropriate wrapper functions and structs | 
|  | around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type | 
|  | checking to ensure correct usage. | 
|  |  | 
|  | - A callsite in the regular filesystem code must be chosen to call | 
|  | ``xfs_hooks_call`` with the action code and data structure. | 
|  | This place should be adjacent to (and not earlier than) the place where | 
|  | the filesystem update is committed to the transaction. | 
|  | In general, when the filesystem calls a hook chain, it should be able to | 
|  | handle sleeping and should not be vulnerable to memory reclaim or locking | 
|  | recursion. | 
|  | However, the exact requirements are very dependent on the context of the hook | 
|  | caller and the callee. | 
|  |  | 
|  | - The online fsck function should define a structure to hold scan data, a lock | 
|  | to coordinate access to the scan data, and a ``struct xfs_hook`` object. | 
|  | The scanner function and the regular filesystem code must acquire resources | 
|  | in the same order; see the next section for details. | 
|  |  | 
|  | - The online fsck code must contain a C function to catch the hook action code | 
|  | and data structure. | 
|  | If the object being updated has already been visited by the scan, then the | 
|  | hook information must be applied to the scan data. | 
|  |  | 
|  | - Prior to unlocking inodes to start the scan, online fsck must call | 
|  | ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and | 
|  | ``xfs_hooks_add`` to enable the hook. | 
|  |  | 
|  | - Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is | 
|  | complete. | 
|  |  | 
|  | The number of hooks should be kept to a minimum to reduce complexity. | 
|  | Static keys are used to reduce the overhead of filesystem hooks to nearly | 
|  | zero when online fsck is not running. | 
|  |  | 
|  | .. _liveupdate: | 
|  |  | 
|  | Live Updates During a Scan | 
|  | `````````````````````````` | 
|  |  | 
|  | The code paths of the online fsck scanning code and the :ref:`hooked<fshooks>` | 
|  | filesystem code look like this:: | 
|  |  | 
|  | other program | 
|  | ↓ | 
|  | inode lock ←────────────────────┐ | 
|  | ↓                         │ | 
|  | AG header lock                  │ | 
|  | ↓                         │ | 
|  | filesystem function             │ | 
|  | ↓                         │ | 
|  | notifier call chain             │    same | 
|  | ↓                         ├─── inode | 
|  | scrub hook function             │    lock | 
|  | ↓                         │ | 
|  | scan data mutex ←──┐    same    │ | 
|  | ↓            ├─── scan    │ | 
|  | update scan data   │    lock    │ | 
|  | ↑            │            │ | 
|  | scan data mutex ←──┘            │ | 
|  | ↑                         │ | 
|  | inode lock ←────────────────────┘ | 
|  | ↑ | 
|  | scrub function | 
|  | ↑ | 
|  | inode scanner | 
|  | ↑ | 
|  | xfs_scrub | 
|  |  | 
|  | These rules must be followed to ensure correct interactions between the | 
|  | checking code and the code making an update to the filesystem: | 
|  |  | 
|  | - Prior to invoking the notifier call chain, the filesystem function being | 
|  | hooked must acquire the same lock that the scrub scanning function acquires | 
|  | to scan the inode. | 
|  |  | 
|  | - The scanning function and the scrub hook function must coordinate access to | 
|  | the scan data by acquiring a lock on the scan data. | 
|  |  | 
|  | - Scrub hook function must not add the live update information to the scan | 
|  | observations unless the inode being updated has already been scanned. | 
|  | The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``) | 
|  | for this. | 
|  |  | 
|  | - Scrub hook functions must not change the caller's state, including the | 
|  | transaction that it is running. | 
|  | They must not acquire any resources that might conflict with the filesystem | 
|  | function being hooked. | 
|  |  | 
|  | - The hook function can abort the inode scan to avoid breaking the other rules. | 
|  |  | 
|  | The inode scan APIs are pretty simple: | 
|  |  | 
|  | - ``xchk_iscan_start`` starts a scan | 
|  |  | 
|  | - ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or | 
|  | returns zero if there is nothing left to scan | 
|  |  | 
|  | - ``xchk_iscan_want_live_update`` to decide if an inode has already been | 
|  | visited in the scan. | 
|  | This is critical for hook functions to decide if they need to update the | 
|  | in-memory scan information. | 
|  |  | 
|  | - ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the | 
|  | scan | 
|  |  | 
|  | - ``xchk_iscan_teardown`` to finish the scan | 
|  |  | 
|  | This functionality is also a part of the | 
|  | `inode scanner | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_ | 
|  | series. | 
|  |  | 
|  | .. _quotacheck: | 
|  |  | 
|  | Case Study: Quota Counter Checking | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | It is useful to compare the mount time quotacheck code to the online repair | 
|  | quotacheck code. | 
|  | Mount time quotacheck does not have to contend with concurrent operations, so | 
|  | it does the following: | 
|  |  | 
|  | 1. Make sure the ondisk dquots are in good enough shape that all the incore | 
|  | dquots will actually load, and zero the resource usage counters in the | 
|  | ondisk buffer. | 
|  |  | 
|  | 2. Walk every inode in the filesystem. | 
|  | Add each file's resource usage to the incore dquot. | 
|  |  | 
|  | 3. Walk each incore dquot. | 
|  | If the incore dquot is not being flushed, add the ondisk buffer backing the | 
|  | incore dquot to a delayed write (delwri) list. | 
|  |  | 
|  | 4. Write the buffer list to disk. | 
|  |  | 
|  | Like most online fsck functions, online quotacheck can't write to regular | 
|  | filesystem objects until the newly collected metadata reflect all filesystem | 
|  | state. | 
|  | Therefore, online quotacheck records file resource usage to a shadow dquot | 
|  | index implemented with a sparse ``xfarray``, and only writes to the real dquots | 
|  | once the scan is complete. | 
|  | Handling transactional updates is tricky because quota resource usage updates | 
|  | are handled in phases to minimize contention on dquots: | 
|  |  | 
|  | 1. The inodes involved are joined and locked to a transaction. | 
|  |  | 
|  | 2. For each dquot attached to the file: | 
|  |  | 
|  | a. The dquot is locked. | 
|  |  | 
|  | b. A quota reservation is added to the dquot's resource usage. | 
|  | The reservation is recorded in the transaction. | 
|  |  | 
|  | c. The dquot is unlocked. | 
|  |  | 
|  | 3. Changes in actual quota usage are tracked in the transaction. | 
|  |  | 
|  | 4. At transaction commit time, each dquot is examined again: | 
|  |  | 
|  | a. The dquot is locked again. | 
|  |  | 
|  | b. Quota usage changes are logged and unused reservation is given back to | 
|  | the dquot. | 
|  |  | 
|  | c. The dquot is unlocked. | 
|  |  | 
|  | For online quotacheck, hooks are placed in steps 2 and 4. | 
|  | The step 2 hook creates a shadow version of the transaction dquot context | 
|  | (``dqtrx``) that operates in a similar manner to the regular code. | 
|  | The step 4 hook commits the shadow ``dqtrx`` changes to the shadow dquots. | 
|  | Notice that both hooks are called with the inode locked, which is how the | 
|  | live update coordinates with the inode scanner. | 
|  |  | 
|  | The quotacheck scan looks like this: | 
|  |  | 
|  | 1. Set up a coordinated inode scan. | 
|  |  | 
|  | 2. For each inode returned by the inode scan iterator: | 
|  |  | 
|  | a. Grab and lock the inode. | 
|  |  | 
|  | b. Determine that inode's resource usage (data blocks, inode counts, | 
|  | realtime blocks) and add that to the shadow dquots for the user, group, | 
|  | and project ids associated with the inode. | 
|  |  | 
|  | c. Unlock and release the inode. | 
|  |  | 
|  | 3. For each dquot in the system: | 
|  |  | 
|  | a. Grab and lock the dquot. | 
|  |  | 
|  | b. Check the dquot against the shadow dquots created by the scan and updated | 
|  | by the live hooks. | 
|  |  | 
|  | Live updates are key to being able to walk every quota record without | 
|  | needing to hold any locks for a long duration. | 
|  | If repairs are desired, the real and shadow dquots are locked and their | 
|  | resource counts are set to the values in the shadow dquot. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `online quotacheck | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_ | 
|  | series. | 
|  |  | 
|  | .. _nlinks: | 
|  |  | 
|  | Case Study: File Link Count Checking | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | File link count checking also uses live update hooks. | 
|  | The coordinated inode scanner is used to visit all directories on the | 
|  | filesystem, and per-file link count records are stored in a sparse ``xfarray`` | 
|  | indexed by inumber. | 
|  | During the scanning phase, each entry in a directory generates observation | 
|  | data as follows: | 
|  |  | 
|  | 1. If the entry is a dotdot (``'..'``) entry of the root directory, the | 
|  | directory's parent link count is bumped because the root directory's dotdot | 
|  | entry is self referential. | 
|  |  | 
|  | 2. If the entry is a dotdot entry of a subdirectory, the parent's backref | 
|  | count is bumped. | 
|  |  | 
|  | 3. If the entry is neither a dot nor a dotdot entry, the target file's parent | 
|  | count is bumped. | 
|  |  | 
|  | 4. If the target is a subdirectory, the parent's child link count is bumped. | 
|  |  | 
|  | A crucial point to understand about how the link count inode scanner interacts | 
|  | with the live update hooks is that the scan cursor tracks which *parent* | 
|  | directories have been scanned. | 
|  | In other words, the live updates ignore any update about ``A → B`` when A has | 
|  | not been scanned, even if B has been scanned. | 
|  | Furthermore, a subdirectory A with a dotdot entry pointing back to B is | 
|  | accounted as a backref counter in the shadow data for A, since child dotdot | 
|  | entries affect the parent's link count. | 
|  | Live update hooks are carefully placed in all parts of the filesystem that | 
|  | create, change, or remove directory entries, since those operations involve | 
|  | bumplink and droplink. | 
|  |  | 
|  | For any file, the correct link count is the number of parents plus the number | 
|  | of child subdirectories. | 
|  | Non-directories never have children of any kind. | 
|  | The backref information is used to detect inconsistencies in the number of | 
|  | links pointing to child subdirectories and the number of dotdot entries | 
|  | pointing back. | 
|  |  | 
|  | After the scan completes, the link count of each file can be checked by locking | 
|  | both the inode and the shadow data, and comparing the link counts. | 
|  | A second coordinated inode scan cursor is used for comparisons. | 
|  | Live updates are key to being able to walk every inode without needing to hold | 
|  | any locks between inodes. | 
|  | If repairs are desired, the inode's link count is set to the value in the | 
|  | shadow information. | 
|  | If no parents are found, the file must be :ref:`reparented <orphanage>` to the | 
|  | orphanage to prevent the file from being lost forever. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `file link count repair | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_ | 
|  | series. | 
|  |  | 
|  | .. _rmap_repair: | 
|  |  | 
|  | Case Study: Rebuilding Reverse Mapping Records | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Most repair functions follow the same pattern: lock filesystem resources, | 
|  | walk the surviving ondisk metadata looking for replacement metadata records, | 
|  | and use an :ref:`in-memory array <xfarray>` to store the gathered observations. | 
|  | The primary advantage of this approach is the simplicity and modularity of the | 
|  | repair code -- code and data are entirely contained within the scrub module, | 
|  | do not require hooks in the main filesystem, and are usually the most efficient | 
|  | in memory use. | 
|  | A secondary advantage of this repair approach is atomicity -- once the kernel | 
|  | decides a structure is corrupt, no other threads can access the metadata until | 
|  | the kernel finishes repairing and revalidating the metadata. | 
|  |  | 
|  | For repairs going on within a shard of the filesystem, these advantages | 
|  | outweigh the delays inherent in locking the shard while repairing parts of the | 
|  | shard. | 
|  | Unfortunately, repairs to the reverse mapping btree cannot use the "standard" | 
|  | btree repair strategy because it must scan every space mapping of every fork of | 
|  | every file in the filesystem, and the filesystem cannot stop. | 
|  | Therefore, rmap repair foregoes atomicity between scrub and repair. | 
|  | It combines a :ref:`coordinated inode scanner <iscan>`, :ref:`live update hooks | 
|  | <liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the | 
|  | scan for reverse mapping records. | 
|  |  | 
|  | 1. Set up an xfbtree to stage rmap records. | 
|  |  | 
|  | 2. While holding the locks on the AGI and AGF buffers acquired during the | 
|  | scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW | 
|  | staging extents, and the internal log. | 
|  |  | 
|  | 3. Set up an inode scanner. | 
|  |  | 
|  | 4. Hook into rmap updates for the AG being repaired so that the live scan data | 
|  | can receive updates to the rmap btree from the rest of the filesystem during | 
|  | the file scan. | 
|  |  | 
|  | 5. For each space mapping found in either fork of each file scanned, | 
|  | decide if the mapping matches the AG of interest. | 
|  | If so: | 
|  |  | 
|  | a. Create a btree cursor for the in-memory btree. | 
|  |  | 
|  | b. Use the rmap code to add the record to the in-memory btree. | 
|  |  | 
|  | c. Use the :ref:`special commit function <xfbtree_commit>` to write the | 
|  | xfbtree changes to the xfile. | 
|  |  | 
|  | 6. For each live update received via the hook, decide if the owner has already | 
|  | been scanned. | 
|  | If so, apply the live update into the scan data: | 
|  |  | 
|  | a. Create a btree cursor for the in-memory btree. | 
|  |  | 
|  | b. Replay the operation into the in-memory btree. | 
|  |  | 
|  | c. Use the :ref:`special commit function <xfbtree_commit>` to write the | 
|  | xfbtree changes to the xfile. | 
|  | This is performed with an empty transaction to avoid changing the | 
|  | caller's state. | 
|  |  | 
|  | 7. When the inode scan finishes, create a new scrub transaction and relock the | 
|  | two AG headers. | 
|  |  | 
|  | 8. Compute the new btree geometry using the number of rmap records in the | 
|  | shadow btree, like all other btree rebuilding functions. | 
|  |  | 
|  | 9. Allocate the number of blocks computed in the previous step. | 
|  |  | 
|  | 10. Perform the usual btree bulk loading and commit to install the new rmap | 
|  | btree. | 
|  |  | 
|  | 11. Reap the old rmap btree blocks as discussed in the case study about how | 
|  | to :ref:`reap after rmap btree repair <rmap_reap>`. | 
|  |  | 
|  | 12. Free the xfbtree now that it not needed. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `rmap repair | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_ | 
|  | series. | 
|  |  | 
|  | Staging Repairs with Temporary Files on Disk | 
|  | -------------------------------------------- | 
|  |  | 
|  | XFS stores a substantial amount of metadata in file forks: directories, | 
|  | extended attributes, symbolic link targets, free space bitmaps and summary | 
|  | information for the realtime volume, and quota records. | 
|  | File forks map 64-bit logical file fork space extents to physical storage space | 
|  | extents, similar to how a memory management unit maps 64-bit virtual addresses | 
|  | to physical memory addresses. | 
|  | Therefore, file-based tree structures (such as directories and extended | 
|  | attributes) use blocks mapped in the file fork offset address space that point | 
|  | to other blocks mapped within that same address space, and file-based linear | 
|  | structures (such as bitmaps and quota records) compute array element offsets in | 
|  | the file fork offset address space. | 
|  |  | 
|  | Because file forks can consume as much space as the entire filesystem, repairs | 
|  | cannot be staged in memory, even when a paging scheme is available. | 
|  | Therefore, online repair of file-based metadata createas a temporary file in | 
|  | the XFS filesystem, writes a new structure at the correct offsets into the | 
|  | temporary file, and atomically exchanges all file fork mappings (and hence the | 
|  | fork contents) to commit the repair. | 
|  | Once the repair is complete, the old fork can be reaped as necessary; if the | 
|  | system goes down during the reap, the iunlink code will delete the blocks | 
|  | during log recovery. | 
|  |  | 
|  | **Note**: All space usage and inode indices in the filesystem *must* be | 
|  | consistent to use a temporary file safely! | 
|  | This dependency is the reason why online repair can only use pageable kernel | 
|  | memory to stage ondisk space usage information. | 
|  |  | 
|  | Exchanging metadata file mappings with a temporary file requires the owner | 
|  | field of the block headers to match the file being repaired and not the | 
|  | temporary file. | 
|  | The directory, extended attribute, and symbolic link functions were all | 
|  | modified to allow callers to specify owner numbers explicitly. | 
|  |  | 
|  | There is a downside to the reaping process -- if the system crashes during the | 
|  | reap phase and the fork extents are crosslinked, the iunlink processing will | 
|  | fail because freeing space will find the extra reverse mappings and abort. | 
|  |  | 
|  | Temporary files created for repair are similar to ``O_TMPFILE`` files created | 
|  | by userspace. | 
|  | They are not linked into a directory and the entire file will be reaped when | 
|  | the last reference to the file is lost. | 
|  | The key differences are that these files must have no access permission outside | 
|  | the kernel at all, they must be specially marked to prevent them from being | 
|  | opened by handle, and they must never be linked into the directory tree. | 
|  |  | 
|  | +--------------------------------------------------------------------------+ | 
|  | | **Historical Sidebar**:                                                  | | 
|  | +--------------------------------------------------------------------------+ | 
|  | | In the initial iteration of file metadata repair, the damaged metadata   | | 
|  | | blocks would be scanned for salvageable data; the extents in the file    | | 
|  | | fork would be reaped; and then a new structure would be built in its     | | 
|  | | place.                                                                   | | 
|  | | This strategy did not survive the introduction of the atomic repair      | | 
|  | | requirement expressed earlier in this document.                          | | 
|  | |                                                                          | | 
|  | | The second iteration explored building a second structure at a high      | | 
|  | | offset in the fork from the salvage data, reaping the old extents, and   | | 
|  | | using a ``COLLAPSE_RANGE`` operation to slide the new extents into       | | 
|  | | place.                                                                   | | 
|  | |                                                                          | | 
|  | | This had many drawbacks:                                                 | | 
|  | |                                                                          | | 
|  | | - Array structures are linearly addressed, and the regular filesystem    | | 
|  | |   codebase does not have the concept of a linear offset that could be    | | 
|  | |   applied to the record offset computation to build an alternate copy.   | | 
|  | |                                                                          | | 
|  | | - Extended attributes are allowed to use the entire attr fork offset     | | 
|  | |   address space.                                                         | | 
|  | |                                                                          | | 
|  | | - Even if repair could build an alternate copy of a data structure in a  | | 
|  | |   different part of the fork address space, the atomic repair commit     | | 
|  | |   requirement means that online repair would have to be able to perform  | | 
|  | |   a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old     | | 
|  | |   structure was completely replaced.                                     | | 
|  | |                                                                          | | 
|  | | - A crash after construction of the secondary tree but before the range  | | 
|  | |   collapse would leave unreachable blocks in the file fork.              | | 
|  | |   This would likely confuse things further.                              | | 
|  | |                                                                          | | 
|  | | - Reaping blocks after a repair is not a simple operation, and           | | 
|  | |   initiating a reap operation from a restarted range collapse operation  | | 
|  | |   during log recovery is daunting.                                       | | 
|  | |                                                                          | | 
|  | | - Directory entry blocks and quota records record the file fork offset   | | 
|  | |   in the header area of each block.                                      | | 
|  | |   An atomic range collapse operation would have to rewrite this part of  | | 
|  | |   each block header.                                                     | | 
|  | |   Rewriting a single field in block headers is not a huge problem, but   | | 
|  | |   it's something to be aware of.                                         | | 
|  | |                                                                          | | 
|  | | - Each block in a directory or extended attributes btree index contains  | | 
|  | |   sibling and child block pointers.                                      | | 
|  | |   Were the atomic commit to use a range collapse operation, each block   | | 
|  | |   would have to be rewritten very carefully to preserve the graph        | | 
|  | |   structure.                                                             | | 
|  | |   Doing this as part of a range collapse means rewriting a large number  | | 
|  | |   of blocks repeatedly, which is not conducive to quick repairs.         | | 
|  | |                                                                          | | 
|  | | This lead to the introduction of temporary file staging.                 | | 
|  | +--------------------------------------------------------------------------+ | 
|  |  | 
|  | Using a Temporary File | 
|  | `````````````````````` | 
|  |  | 
|  | Online repair code should use the ``xrep_tempfile_create`` function to create a | 
|  | temporary file inside the filesystem. | 
|  | This allocates an inode, marks the in-core inode private, and attaches it to | 
|  | the scrub context. | 
|  | These files are hidden from userspace, may not be added to the directory tree, | 
|  | and must be kept private. | 
|  |  | 
|  | Temporary files only use two inode locks: the IOLOCK and the ILOCK. | 
|  | The MMAPLOCK is not needed here, because there must not be page faults from | 
|  | userspace for data fork blocks. | 
|  | The usage patterns of these two locks are the same as for any other XFS file -- | 
|  | access to file data are controlled via the IOLOCK, and access to file metadata | 
|  | are controlled via the ILOCK. | 
|  | Locking helpers are provided so that the temporary file and its lock state can | 
|  | be cleaned up by the scrub context. | 
|  | To comply with the nested locking strategy laid out in the :ref:`inode | 
|  | locking<ilocking>` section, it is recommended that scrub functions use the | 
|  | xrep_tempfile_ilock*_nowait lock helpers. | 
|  |  | 
|  | Data can be written to a temporary file by two means: | 
|  |  | 
|  | 1. ``xrep_tempfile_copyin`` can be used to set the contents of a regular | 
|  | temporary file from an xfile. | 
|  |  | 
|  | 2. The regular directory, symbolic link, and extended attribute functions can | 
|  | be used to write to the temporary file. | 
|  |  | 
|  | Once a good copy of a data file has been constructed in a temporary file, it | 
|  | must be conveyed to the file being repaired, which is the topic of the next | 
|  | section. | 
|  |  | 
|  | The proposed patches are in the | 
|  | `repair temporary files | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_ | 
|  | series. | 
|  |  | 
|  | Logged File Content Exchanges | 
|  | ----------------------------- | 
|  |  | 
|  | Once repair builds a temporary file with a new data structure written into | 
|  | it, it must commit the new changes into the existing file. | 
|  | It is not possible to swap the inumbers of two files, so instead the new | 
|  | metadata must replace the old. | 
|  | This suggests the need for the ability to swap extents, but the existing extent | 
|  | swapping code used by the file defragmenting tool ``xfs_fsr`` is not sufficient | 
|  | for online repair because: | 
|  |  | 
|  | a. When the reverse-mapping btree is enabled, the swap code must keep the | 
|  | reverse mapping information up to date with every exchange of mappings. | 
|  | Therefore, it can only exchange one mapping per transaction, and each | 
|  | transaction is independent. | 
|  |  | 
|  | b. Reverse-mapping is critical for the operation of online fsck, so the old | 
|  | defragmentation code (which swapped entire extent forks in a single | 
|  | operation) is not useful here. | 
|  |  | 
|  | c. Defragmentation is assumed to occur between two files with identical | 
|  | contents. | 
|  | For this use case, an incomplete exchange will not result in a user-visible | 
|  | change in file contents, even if the operation is interrupted. | 
|  |  | 
|  | d. Online repair needs to swap the contents of two files that are by definition | 
|  | *not* identical. | 
|  | For directory and xattr repairs, the user-visible contents might be the | 
|  | same, but the contents of individual blocks may be very different. | 
|  |  | 
|  | e. Old blocks in the file may be cross-linked with another structure and must | 
|  | not reappear if the system goes down mid-repair. | 
|  |  | 
|  | These problems are overcome by creating a new deferred operation and a new type | 
|  | of log intent item to track the progress of an operation to exchange two file | 
|  | ranges. | 
|  | The new exchange operation type chains together the same transactions used by | 
|  | the reverse-mapping extent swap code, but records intermedia progress in the | 
|  | log so that operations can be restarted after a crash. | 
|  | This new functionality is called the file contents exchange (xfs_exchrange) | 
|  | code. | 
|  | The underlying implementation exchanges file fork mappings (xfs_exchmaps). | 
|  | The new log item records the progress of the exchange to ensure that once an | 
|  | exchange begins, it will always run to completion, even there are | 
|  | interruptions. | 
|  | The new ``XFS_SB_FEAT_INCOMPAT_EXCHRANGE`` incompatible feature flag | 
|  | in the superblock protects these new log item records from being replayed on | 
|  | old kernels. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `file contents exchange | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_ | 
|  | series. | 
|  |  | 
|  | +--------------------------------------------------------------------------+ | 
|  | | **Sidebar: Using Log-Incompatible Feature Flags**                        | | 
|  | +--------------------------------------------------------------------------+ | 
|  | | Starting with XFS v5, the superblock contains a                          | | 
|  | | ``sb_features_log_incompat`` field to indicate that the log contains     | | 
|  | | records that might not readable by all kernels that could mount this     | | 
|  | | filesystem.                                                              | | 
|  | | In short, log incompat features protect the log contents against kernels | | 
|  | | that will not understand the contents.                                   | | 
|  | | Unlike the other superblock feature bits, log incompat bits are          | | 
|  | | ephemeral because an empty (clean) log does not need protection.         | | 
|  | | The log cleans itself after its contents have been committed into the    | | 
|  | | filesystem, either as part of an unmount or because the system is        | | 
|  | | otherwise idle.                                                          | | 
|  | | Because upper level code can be working on a transaction at the same     | | 
|  | | time that the log cleans itself, it is necessary for upper level code to | | 
|  | | communicate to the log when it is going to use a log incompatible        | | 
|  | | feature.                                                                 | | 
|  | |                                                                          | | 
|  | | The log coordinates access to incompatible features through the use of   | | 
|  | | one ``struct rw_semaphore`` for each feature.                            | | 
|  | | The log cleaning code tries to take this rwsem in exclusive mode to      | | 
|  | | clear the bit; if the lock attempt fails, the feature bit remains set.   | | 
|  | | The code supporting a log incompat feature should create wrapper         | | 
|  | | functions to obtain the log feature and call                             | | 
|  | | ``xfs_add_incompat_log_feature`` to set the feature bits in the primary  | | 
|  | | superblock.                                                              | | 
|  | | The superblock update is performed transactionally, so the wrapper to    | | 
|  | | obtain log assistance must be called just prior to the creation of the   | | 
|  | | transaction that uses the functionality.                                 | | 
|  | | For a file operation, this step must happen after taking the IOLOCK      | | 
|  | | and the MMAPLOCK, but before allocating the transaction.                 | | 
|  | | When the transaction is complete, the ``xlog_drop_incompat_feat``        | | 
|  | | function is called to release the feature.                               | | 
|  | | The feature bit will not be cleared from the superblock until the log    | | 
|  | | becomes clean.                                                           | | 
|  | |                                                                          | | 
|  | | Log-assisted extended attribute updates and file content exchanges bothe | | 
|  | | use log incompat features and provide convenience wrappers around the    | | 
|  | | functionality.                                                           | | 
|  | +--------------------------------------------------------------------------+ | 
|  |  | 
|  | Mechanics of a Logged File Content Exchange | 
|  | ``````````````````````````````````````````` | 
|  |  | 
|  | Exchanging contents between file forks is a complex task. | 
|  | The goal is to exchange all file fork mappings between two file fork offset | 
|  | ranges. | 
|  | There are likely to be many extent mappings in each fork, and the edges of | 
|  | the mappings aren't necessarily aligned. | 
|  | Furthermore, there may be other updates that need to happen after the exchange, | 
|  | such as exchanging file sizes, inode flags, or conversion of fork data to local | 
|  | format. | 
|  | This is roughly the format of the new deferred exchange-mapping work item: | 
|  |  | 
|  | .. code-block:: c | 
|  |  | 
|  | struct xfs_exchmaps_intent { | 
|  | /* Inodes participating in the operation. */ | 
|  | struct xfs_inode    *xmi_ip1; | 
|  | struct xfs_inode    *xmi_ip2; | 
|  |  | 
|  | /* File offset range information. */ | 
|  | xfs_fileoff_t       xmi_startoff1; | 
|  | xfs_fileoff_t       xmi_startoff2; | 
|  | xfs_filblks_t       xmi_blockcount; | 
|  |  | 
|  | /* Set these file sizes after the operation, unless negative. */ | 
|  | xfs_fsize_t         xmi_isize1; | 
|  | xfs_fsize_t         xmi_isize2; | 
|  |  | 
|  | /* XFS_EXCHMAPS_* log operation flags */ | 
|  | uint64_t            xmi_flags; | 
|  | }; | 
|  |  | 
|  | The new log intent item contains enough information to track two logical fork | 
|  | offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2, | 
|  | blockcount)``. | 
|  | Each step of an exchange operation exchanges the largest file range mapping | 
|  | possible from one file to the other. | 
|  | After each step in the exchange operation, the two startoff fields are | 
|  | incremented and the blockcount field is decremented to reflect the progress | 
|  | made. | 
|  | The flags field captures behavioral parameters such as exchanging attr fork | 
|  | mappings instead of the data fork and other work to be done after the exchange. | 
|  | The two isize fields are used to exchange the file sizes at the end of the | 
|  | operation if the file data fork is the target of the operation. | 
|  |  | 
|  | When the exchange is initiated, the sequence of operations is as follows: | 
|  |  | 
|  | 1. Create a deferred work item for the file mapping exchange. | 
|  | At the start, it should contain the entirety of the file block ranges to be | 
|  | exchanged. | 
|  |  | 
|  | 2. Call ``xfs_defer_finish`` to process the exchange. | 
|  | This is encapsulated in ``xrep_tempexch_contents`` for scrub operations. | 
|  | This will log an extent swap intent item to the transaction for the deferred | 
|  | mapping exchange work item. | 
|  |  | 
|  | 3. Until ``xmi_blockcount`` of the deferred mapping exchange work item is zero, | 
|  |  | 
|  | a. Read the block maps of both file ranges starting at ``xmi_startoff1`` and | 
|  | ``xmi_startoff2``, respectively, and compute the longest extent that can | 
|  | be exchanged in a single step. | 
|  | This is the minimum of the two ``br_blockcount`` s in the mappings. | 
|  | Keep advancing through the file forks until at least one of the mappings | 
|  | contains written blocks. | 
|  | Mutual holes, unwritten extents, and extent mappings to the same physical | 
|  | space are not exchanged. | 
|  |  | 
|  | For the next few steps, this document will refer to the mapping that came | 
|  | from file 1 as "map1", and the mapping that came from file 2 as "map2". | 
|  |  | 
|  | b. Create a deferred block mapping update to unmap map1 from file 1. | 
|  |  | 
|  | c. Create a deferred block mapping update to unmap map2 from file 2. | 
|  |  | 
|  | d. Create a deferred block mapping update to map map1 into file 2. | 
|  |  | 
|  | e. Create a deferred block mapping update to map map2 into file 1. | 
|  |  | 
|  | f. Log the block, quota, and extent count updates for both files. | 
|  |  | 
|  | g. Extend the ondisk size of either file if necessary. | 
|  |  | 
|  | h. Log a mapping exchange done log item for th mapping exchange intent log | 
|  | item that was read at the start of step 3. | 
|  |  | 
|  | i. Compute the amount of file range that has just been covered. | 
|  | This quantity is ``(map1.br_startoff + map1.br_blockcount - | 
|  | xmi_startoff1)``, because step 3a could have skipped holes. | 
|  |  | 
|  | j. Increase the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2`` | 
|  | by the number of blocks computed in the previous step, and decrease | 
|  | ``xmi_blockcount`` by the same quantity. | 
|  | This advances the cursor. | 
|  |  | 
|  | k. Log a new mapping exchange intent log item reflecting the advanced state | 
|  | of the work item. | 
|  |  | 
|  | l. Return the proper error code (EAGAIN) to the deferred operation manager | 
|  | to inform it that there is more work to be done. | 
|  | The operation manager completes the deferred work in steps 3b-3e before | 
|  | moving back to the start of step 3. | 
|  |  | 
|  | 4. Perform any post-processing. | 
|  | This will be discussed in more detail in subsequent sections. | 
|  |  | 
|  | If the filesystem goes down in the middle of an operation, log recovery will | 
|  | find the most recent unfinished maping exchange log intent item and restart | 
|  | from there. | 
|  | This is how atomic file mapping exchanges guarantees that an outside observer | 
|  | will either see the old broken structure or the new one, and never a mismash of | 
|  | both. | 
|  |  | 
|  | Preparation for File Content Exchanges | 
|  | `````````````````````````````````````` | 
|  |  | 
|  | There are a few things that need to be taken care of before initiating an | 
|  | atomic file mapping exchange operation. | 
|  | First, regular files require the page cache to be flushed to disk before the | 
|  | operation begins, and directio writes to be quiesced. | 
|  | Like any filesystem operation, file mapping exchanges must determine the | 
|  | maximum amount of disk space and quota that can be consumed on behalf of both | 
|  | files in the operation, and reserve that quantity of resources to avoid an | 
|  | unrecoverable out of space failure once it starts dirtying metadata. | 
|  | The preparation step scans the ranges of both files to estimate: | 
|  |  | 
|  | - Data device blocks needed to handle the repeated updates to the fork | 
|  | mappings. | 
|  | - Change in data and realtime block counts for both files. | 
|  | - Increase in quota usage for both files, if the two files do not share the | 
|  | same set of quota ids. | 
|  | - The number of extent mappings that will be added to each file. | 
|  | - Whether or not there are partially written realtime extents. | 
|  | User programs must never be able to access a realtime file extent that maps | 
|  | to different extents on the realtime volume, which could happen if the | 
|  | operation fails to run to completion. | 
|  |  | 
|  | The need for precise estimation increases the run time of the exchange | 
|  | operation, but it is very important to maintain correct accounting. | 
|  | The filesystem must not run completely out of free space, nor can the mapping | 
|  | exchange ever add more extent mappings to a fork than it can support. | 
|  | Regular users are required to abide the quota limits, though metadata repairs | 
|  | may exceed quota to resolve inconsistent metadata elsewhere. | 
|  |  | 
|  | Special Features for Exchanging Metadata File Contents | 
|  | `````````````````````````````````````````````````````` | 
|  |  | 
|  | Extended attributes, symbolic links, and directories can set the fork format to | 
|  | "local" and treat the fork as a literal area for data storage. | 
|  | Metadata repairs must take extra steps to support these cases: | 
|  |  | 
|  | - If both forks are in local format and the fork areas are large enough, the | 
|  | exchange is performed by copying the incore fork contents, logging both | 
|  | forks, and committing. | 
|  | The atomic file mapping exchange mechanism is not necessary, since this can | 
|  | be done with a single transaction. | 
|  |  | 
|  | - If both forks map blocks, then the regular atomic file mapping exchange is | 
|  | used. | 
|  |  | 
|  | - Otherwise, only one fork is in local format. | 
|  | The contents of the local format fork are converted to a block to perform the | 
|  | exchange. | 
|  | The conversion to block format must be done in the same transaction that | 
|  | logs the initial mapping exchange intent log item. | 
|  | The regular atomic mapping exchange is used to exchange the metadata file | 
|  | mappings. | 
|  | Special flags are set on the exchange operation so that the transaction can | 
|  | be rolled one more time to convert the second file's fork back to local | 
|  | format so that the second file will be ready to go as soon as the ILOCK is | 
|  | dropped. | 
|  |  | 
|  | Extended attributes and directories stamp the owning inode into every block, | 
|  | but the buffer verifiers do not actually check the inode number! | 
|  | Although there is no verification, it is still important to maintain | 
|  | referential integrity, so prior to performing the mapping exchange, online | 
|  | repair builds every block in the new data structure with the owner field of the | 
|  | file being repaired. | 
|  |  | 
|  | After a successful exchange operation, the repair operation must reap the old | 
|  | fork blocks by processing each fork mapping through the standard :ref:`file | 
|  | extent reaping <reaping>` mechanism that is done post-repair. | 
|  | If the filesystem should go down during the reap part of the repair, the | 
|  | iunlink processing at the end of recovery will free both the temporary file and | 
|  | whatever blocks were not reaped. | 
|  | However, this iunlink processing omits the cross-link detection of online | 
|  | repair, and is not completely foolproof. | 
|  |  | 
|  | Exchanging Temporary File Contents | 
|  | `````````````````````````````````` | 
|  |  | 
|  | To repair a metadata file, online repair proceeds as follows: | 
|  |  | 
|  | 1. Create a temporary repair file. | 
|  |  | 
|  | 2. Use the staging data to write out new contents into the temporary repair | 
|  | file. | 
|  | The same fork must be written to as is being repaired. | 
|  |  | 
|  | 3. Commit the scrub transaction, since the exchange resource estimation step | 
|  | must be completed before transaction reservations are made. | 
|  |  | 
|  | 4. Call ``xrep_tempexch_trans_alloc`` to allocate a new scrub transaction with | 
|  | the appropriate resource reservations, locks, and fill out a ``struct | 
|  | xfs_exchmaps_req`` with the details of the exchange operation. | 
|  |  | 
|  | 5. Call ``xrep_tempexch_contents`` to exchange the contents. | 
|  |  | 
|  | 6. Commit the transaction to complete the repair. | 
|  |  | 
|  | .. _rtsummary: | 
|  |  | 
|  | Case Study: Repairing the Realtime Summary File | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | In the "realtime" section of an XFS filesystem, free space is tracked via a | 
|  | bitmap, similar to Unix FFS. | 
|  | Each bit in the bitmap represents one realtime extent, which is a multiple of | 
|  | the filesystem block size between 4KiB and 1GiB in size. | 
|  | The realtime summary file indexes the number of free extents of a given size to | 
|  | the offset of the block within the realtime free space bitmap where those free | 
|  | extents begin. | 
|  | In other words, the summary file helps the allocator find free extents by | 
|  | length, similar to what the free space by count (cntbt) btree does for the data | 
|  | section. | 
|  |  | 
|  | The summary file itself is a flat file (with no block headers or checksums!) | 
|  | partitioned into ``log2(total rt extents)`` sections containing enough 32-bit | 
|  | counters to match the number of blocks in the rt bitmap. | 
|  | Each counter records the number of free extents that start in that bitmap block | 
|  | and can satisfy a power-of-two allocation request. | 
|  |  | 
|  | To check the summary file against the bitmap: | 
|  |  | 
|  | 1. Take the ILOCK of both the realtime bitmap and summary files. | 
|  |  | 
|  | 2. For each free space extent recorded in the bitmap: | 
|  |  | 
|  | a. Compute the position in the summary file that contains a counter that | 
|  | represents this free extent. | 
|  |  | 
|  | b. Read the counter from the xfile. | 
|  |  | 
|  | c. Increment it, and write it back to the xfile. | 
|  |  | 
|  | 3. Compare the contents of the xfile against the ondisk file. | 
|  |  | 
|  | To repair the summary file, write the xfile contents into the temporary file | 
|  | and use atomic mapping exchange to commit the new contents. | 
|  | The temporary file is then reaped. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `realtime summary repair | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_ | 
|  | series. | 
|  |  | 
|  | Case Study: Salvaging Extended Attributes | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | In XFS, extended attributes are implemented as a namespaced name-value store. | 
|  | Values are limited in size to 64KiB, but there is no limit in the number of | 
|  | names. | 
|  | The attribute fork is unpartitioned, which means that the root of the attribute | 
|  | structure is always in logical block zero, but attribute leaf blocks, dabtree | 
|  | index blocks, and remote value blocks are intermixed. | 
|  | Attribute leaf blocks contain variable-sized records that associate | 
|  | user-provided names with the user-provided values. | 
|  | Values larger than a block are allocated separate extents and written there. | 
|  | If the leaf information expands beyond a single block, a directory/attribute | 
|  | btree (``dabtree``) is created to map hashes of attribute names to entries | 
|  | for fast lookup. | 
|  |  | 
|  | Salvaging extended attributes is done as follows: | 
|  |  | 
|  | 1. Walk the attr fork mappings of the file being repaired to find the attribute | 
|  | leaf blocks. | 
|  | When one is found, | 
|  |  | 
|  | a. Walk the attr leaf block to find candidate keys. | 
|  | When one is found, | 
|  |  | 
|  | 1. Check the name for problems, and ignore the name if there are. | 
|  |  | 
|  | 2. Retrieve the value. | 
|  | If that succeeds, add the name and value to the staging xfarray and | 
|  | xfblob. | 
|  |  | 
|  | 2. If the memory usage of the xfarray and xfblob exceed a certain amount of | 
|  | memory or there are no more attr fork blocks to examine, unlock the file and | 
|  | add the staged extended attributes to the temporary file. | 
|  |  | 
|  | 3. Use atomic file mapping exchange to exchange the new and old extended | 
|  | attribute structures. | 
|  | The old attribute blocks are now attached to the temporary file. | 
|  |  | 
|  | 4. Reap the temporary file. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `extended attribute repair | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ | 
|  | series. | 
|  |  | 
|  | Fixing Directories | 
|  | ------------------ | 
|  |  | 
|  | Fixing directories is difficult with currently available filesystem features, | 
|  | since directory entries are not redundant. | 
|  | The offline repair tool scans all inodes to find files with nonzero link count, | 
|  | and then it scans all directories to establish parentage of those linked files. | 
|  | Damaged files and directories are zapped, and files with no parent are | 
|  | moved to the ``/lost+found`` directory. | 
|  | It does not try to salvage anything. | 
|  |  | 
|  | The best that online repair can do at this time is to read directory data | 
|  | blocks and salvage any dirents that look plausible, correct link counts, and | 
|  | move orphans back into the directory tree. | 
|  | The salvage process is discussed in the case study at the end of this section. | 
|  | The :ref:`file link count fsck <nlinks>` code takes care of fixing link counts | 
|  | and moving orphans to the ``/lost+found`` directory. | 
|  |  | 
|  | Case Study: Salvaging Directories | 
|  | ````````````````````````````````` | 
|  |  | 
|  | Unlike extended attributes, directory blocks are all the same size, so | 
|  | salvaging directories is straightforward: | 
|  |  | 
|  | 1. Find the parent of the directory. | 
|  | If the dotdot entry is not unreadable, try to confirm that the alleged | 
|  | parent has a child entry pointing back to the directory being repaired. | 
|  | Otherwise, walk the filesystem to find it. | 
|  |  | 
|  | 2. Walk the first partition of data fork of the directory to find the directory | 
|  | entry data blocks. | 
|  | When one is found, | 
|  |  | 
|  | a. Walk the directory data block to find candidate entries. | 
|  | When an entry is found: | 
|  |  | 
|  | i. Check the name for problems, and ignore the name if there are. | 
|  |  | 
|  | ii. Retrieve the inumber and grab the inode. | 
|  | If that succeeds, add the name, inode number, and file type to the | 
|  | staging xfarray and xblob. | 
|  |  | 
|  | 3. If the memory usage of the xfarray and xfblob exceed a certain amount of | 
|  | memory or there are no more directory data blocks to examine, unlock the | 
|  | directory and add the staged dirents into the temporary directory. | 
|  | Truncate the staging files. | 
|  |  | 
|  | 4. Use atomic file mapping exchange to exchange the new and old directory | 
|  | structures. | 
|  | The old directory blocks are now attached to the temporary file. | 
|  |  | 
|  | 5. Reap the temporary file. | 
|  |  | 
|  | **Future Work Question**: Should repair revalidate the dentry cache when | 
|  | rebuilding a directory? | 
|  |  | 
|  | *Answer*: Yes, it should. | 
|  |  | 
|  | In theory it is necessary to scan all dentry cache entries for a directory to | 
|  | ensure that one of the following apply: | 
|  |  | 
|  | 1. The cached dentry reflects an ondisk dirent in the new directory. | 
|  |  | 
|  | 2. The cached dentry no longer has a corresponding ondisk dirent in the new | 
|  | directory and the dentry can be purged from the cache. | 
|  |  | 
|  | 3. The cached dentry no longer has an ondisk dirent but the dentry cannot be | 
|  | purged. | 
|  | This is the problem case. | 
|  |  | 
|  | Unfortunately, the current dentry cache design doesn't provide a means to walk | 
|  | every child dentry of a specific directory, which makes this a hard problem. | 
|  | There is no known solution. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `directory repair | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_ | 
|  | series. | 
|  |  | 
|  | Parent Pointers | 
|  | ``````````````` | 
|  |  | 
|  | A parent pointer is a piece of file metadata that enables a user to locate the | 
|  | file's parent directory without having to traverse the directory tree from the | 
|  | root. | 
|  | Without them, reconstruction of directory trees is hindered in much the same | 
|  | way that the historic lack of reverse space mapping information once hindered | 
|  | reconstruction of filesystem space metadata. | 
|  | The parent pointer feature, however, makes total directory reconstruction | 
|  | possible. | 
|  |  | 
|  | XFS parent pointers contain the information needed to identify the | 
|  | corresponding directory entry in the parent directory. | 
|  | In other words, child files use extended attributes to store pointers to | 
|  | parents in the form ``(dirent_name) → (parent_inum, parent_gen)``. | 
|  | The directory checking process can be strengthened to ensure that the target of | 
|  | each dirent also contains a parent pointer pointing back to the dirent. | 
|  | Likewise, each parent pointer can be checked by ensuring that the target of | 
|  | each parent pointer is a directory and that it contains a dirent matching | 
|  | the parent pointer. | 
|  | Both online and offline repair can use this strategy. | 
|  |  | 
|  | +--------------------------------------------------------------------------+ | 
|  | | **Historical Sidebar**:                                                  | | 
|  | +--------------------------------------------------------------------------+ | 
|  | | Directory parent pointers were first proposed as an XFS feature more     | | 
|  | | than a decade ago by SGI.                                                | | 
|  | | Each link from a parent directory to a child file is mirrored with an    | | 
|  | | extended attribute in the child that could be used to identify the       | | 
|  | | parent directory.                                                        | | 
|  | | Unfortunately, this early implementation had major shortcomings and was  | | 
|  | | never merged into Linux XFS:                                             | | 
|  | |                                                                          | | 
|  | | 1. The XFS codebase of the late 2000s did not have the infrastructure to | | 
|  | |    enforce strong referential integrity in the directory tree.           | | 
|  | |    It did not guarantee that a change in a forward link would always be  | | 
|  | |    followed up with the corresponding change to the reverse links.       | | 
|  | |                                                                          | | 
|  | | 2. Referential integrity was not integrated into offline repair.         | | 
|  | |    Checking and repairs were performed on mounted filesystems without    | | 
|  | |    taking any kernel or inode locks to coordinate access.                | | 
|  | |    It is not clear how this actually worked properly.                    | | 
|  | |                                                                          | | 
|  | | 3. The extended attribute did not record the name of the directory entry | | 
|  | |    in the parent, so the SGI parent pointer implementation cannot be     | | 
|  | |    used to reconnect the directory tree.                                 | | 
|  | |                                                                          | | 
|  | | 4. Extended attribute forks only support 65,536 extents, which means     | | 
|  | |    that parent pointer attribute creation is likely to fail at some      | | 
|  | |    point before the maximum file link count is achieved.                 | | 
|  | |                                                                          | | 
|  | | The original parent pointer design was too unstable for something like   | | 
|  | | a file system repair to depend on.                                       | | 
|  | | Allison Henderson, Chandan Babu, and Catherine Hoang are working on a    | | 
|  | | second implementation that solves all shortcomings of the first.         | | 
|  | | During 2022, Allison introduced log intent items to track physical       | | 
|  | | manipulations of the extended attribute structures.                      | | 
|  | | This solves the referential integrity problem by making it possible to   | | 
|  | | commit a dirent update and a parent pointer update in the same           | | 
|  | | transaction.                                                             | | 
|  | | Chandan increased the maximum extent counts of both data and attribute   | | 
|  | | forks, thereby ensuring that the extended attribute structure can grow   | | 
|  | | to handle the maximum hardlink count of any file.                        | | 
|  | |                                                                          | | 
|  | | For this second effort, the ondisk parent pointer format as originally   | | 
|  | | proposed was ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``.  | | 
|  | | The format was changed during development to eliminate the requirement   | | 
|  | | of repair tools needing to ensure that the ``dirent_pos`` field always   | | 
|  | | matched when reconstructing a directory.                                 | | 
|  | |                                                                          | | 
|  | | There were a few other ways to have solved that problem:                 | | 
|  | |                                                                          | | 
|  | | 1. The field could be designated advisory, since the other three values  | | 
|  | |    are sufficient to find the entry in the parent.                       | | 
|  | |    However, this makes indexed key lookup impossible while repairs are   | | 
|  | |    ongoing.                                                              | | 
|  | |                                                                          | | 
|  | | 2. We could allow creating directory entries at specified offsets, which | | 
|  | |    solves the referential integrity problem but runs the risk that       | | 
|  | |    dirent creation will fail due to conflicts with the free space in the | | 
|  | |    directory.                                                            | | 
|  | |                                                                          | | 
|  | |    These conflicts could be resolved by appending the directory entry    | | 
|  | |    and amending the xattr code to support updating an xattr key and      | | 
|  | |    reindexing the dabtree, though this would have to be performed with   | | 
|  | |    the parent directory still locked.                                    | | 
|  | |                                                                          | | 
|  | | 3. Same as above, but remove the old parent pointer entry and add a new  | | 
|  | |    one atomically.                                                       | | 
|  | |                                                                          | | 
|  | | 4. Change the ondisk xattr format to                                     | | 
|  | |    ``(parent_inum, name) → (parent_gen)``, which would provide the attr  | | 
|  | |    name uniqueness that we require, without forcing repair code to       | | 
|  | |    update the dirent position.                                           | | 
|  | |    Unfortunately, this requires changes to the xattr code to support     | | 
|  | |    attr names as long as 263 bytes.                                      | | 
|  | |                                                                          | | 
|  | | 5. Change the ondisk xattr format to ``(parent_inum, hash(name)) →       | | 
|  | |    (name, parent_gen)``.                                                 | | 
|  | |    If the hash is sufficiently resistant to collisions (e.g. sha256)     | | 
|  | |    then this should provide the attr name uniqueness that we require.    | | 
|  | |    Names shorter than 247 bytes could be stored directly.                | | 
|  | |                                                                          | | 
|  | | 6. Change the ondisk xattr format to ``(dirent_name) → (parent_ino,      | | 
|  | |    parent_gen)``.  This format doesn't require any of the complicated    | | 
|  | |    nested name hashing of the previous suggestions.  However, it was     | | 
|  | |    discovered that multiple hardlinks to the same inode with the same    | | 
|  | |    filename caused performance problems with hashed xattr lookups, so    | | 
|  | |    the parent inumber is now xor'd into the hash index.                  | | 
|  | |                                                                          | | 
|  | | In the end, it was decided that solution #6 was the most compact and the | | 
|  | | most performant.  A new hash function was designed for parent pointers.  | | 
|  | +--------------------------------------------------------------------------+ | 
|  |  | 
|  |  | 
|  | Case Study: Repairing Directories with Parent Pointers | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Directory rebuilding uses a :ref:`coordinated inode scan <iscan>` and | 
|  | a :ref:`directory entry live update hook <liveupdate>` as follows: | 
|  |  | 
|  | 1. Set up a temporary directory for generating the new directory structure, | 
|  | an xfblob for storing entry names, and an xfarray for stashing the fixed | 
|  | size fields involved in a directory update: ``(child inumber, add vs. | 
|  | remove, name cookie, ftype)``. | 
|  |  | 
|  | 2. Set up an inode scanner and hook into the directory entry code to receive | 
|  | updates on directory operations. | 
|  |  | 
|  | 3. For each parent pointer found in each file scanned, decide if the parent | 
|  | pointer references the directory of interest. | 
|  | If so: | 
|  |  | 
|  | a. Stash the parent pointer name and an addname entry for this dirent in the | 
|  | xfblob and xfarray, respectively. | 
|  |  | 
|  | b. When finished scanning that file or the kernel memory consumption exceeds | 
|  | a threshold, flush the stashed updates to the temporary directory. | 
|  |  | 
|  | 4. For each live directory update received via the hook, decide if the child | 
|  | has already been scanned. | 
|  | If so: | 
|  |  | 
|  | a. Stash the parent pointer name an addname or removename entry for this | 
|  | dirent update in the xfblob and xfarray for later. | 
|  | We cannot write directly to the temporary directory because hook | 
|  | functions are not allowed to modify filesystem metadata. | 
|  | Instead, we stash updates in the xfarray and rely on the scanner thread | 
|  | to apply the stashed updates to the temporary directory. | 
|  |  | 
|  | 5. When the scan is complete, replay any stashed entries in the xfarray. | 
|  |  | 
|  | 6. When the scan is complete, atomically exchange the contents of the temporary | 
|  | directory and the directory being repaired. | 
|  | The temporary directory now contains the damaged directory structure. | 
|  |  | 
|  | 7. Reap the temporary directory. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `parent pointers directory repair | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_ | 
|  | series. | 
|  |  | 
|  | Case Study: Repairing Parent Pointers | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Online reconstruction of a file's parent pointer information works similarly to | 
|  | directory reconstruction: | 
|  |  | 
|  | 1. Set up a temporary file for generating a new extended attribute structure, | 
|  | an xfblob for storing parent pointer names, and an xfarray for stashing the | 
|  | fixed size fields involved in a parent pointer update: ``(parent inumber, | 
|  | parent generation, add vs. remove, name cookie)``. | 
|  |  | 
|  | 2. Set up an inode scanner and hook into the directory entry code to receive | 
|  | updates on directory operations. | 
|  |  | 
|  | 3. For each directory entry found in each directory scanned, decide if the | 
|  | dirent references the file of interest. | 
|  | If so: | 
|  |  | 
|  | a. Stash the dirent name and an addpptr entry for this parent pointer in the | 
|  | xfblob and xfarray, respectively. | 
|  |  | 
|  | b. When finished scanning the directory or the kernel memory consumption | 
|  | exceeds a threshold, flush the stashed updates to the temporary file. | 
|  |  | 
|  | 4. For each live directory update received via the hook, decide if the parent | 
|  | has already been scanned. | 
|  | If so: | 
|  |  | 
|  | a. Stash the dirent name and an addpptr or removepptr entry for this dirent | 
|  | update in the xfblob and xfarray for later. | 
|  | We cannot write parent pointers directly to the temporary file because | 
|  | hook functions are not allowed to modify filesystem metadata. | 
|  | Instead, we stash updates in the xfarray and rely on the scanner thread | 
|  | to apply the stashed parent pointer updates to the temporary file. | 
|  |  | 
|  | 5. When the scan is complete, replay any stashed entries in the xfarray. | 
|  |  | 
|  | 6. Copy all non-parent pointer extended attributes to the temporary file. | 
|  |  | 
|  | 7. When the scan is complete, atomically exchange the mappings of the attribute | 
|  | forks of the temporary file and the file being repaired. | 
|  | The temporary file now contains the damaged extended attribute structure. | 
|  |  | 
|  | 8. Reap the temporary file. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `parent pointers repair | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_ | 
|  | series. | 
|  |  | 
|  | Digression: Offline Checking of Parent Pointers | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | Examining parent pointers in offline repair works differently because corrupt | 
|  | files are erased long before directory tree connectivity checks are performed. | 
|  | Parent pointer checks are therefore a second pass to be added to the existing | 
|  | connectivity checks: | 
|  |  | 
|  | 1. After the set of surviving files has been established (phase 6), | 
|  | walk the surviving directories of each AG in the filesystem. | 
|  | This is already performed as part of the connectivity checks. | 
|  |  | 
|  | 2. For each directory entry found, | 
|  |  | 
|  | a. If the name has already been stored in the xfblob, then use that cookie | 
|  | and skip the next step. | 
|  |  | 
|  | b. Otherwise, record the name in an xfblob, and remember the xfblob cookie. | 
|  | Unique mappings are critical for | 
|  |  | 
|  | 1. Deduplicating names to reduce memory usage, and | 
|  |  | 
|  | 2. Creating a stable sort key for the parent pointer indexes so that the | 
|  | parent pointer validation described below will work. | 
|  |  | 
|  | c. Store ``(child_ag_inum, parent_inum, parent_gen, name_hash, name_len, | 
|  | name_cookie)`` tuples in a per-AG in-memory slab.  The ``name_hash`` | 
|  | referenced in this section is the regular directory entry name hash, not | 
|  | the specialized one used for parent pointer xattrs. | 
|  |  | 
|  | 3. For each AG in the filesystem, | 
|  |  | 
|  | a. Sort the per-AG tuple set in order of ``child_ag_inum``, ``parent_inum``, | 
|  | ``name_hash``, and ``name_cookie``. | 
|  | Having a single ``name_cookie`` for each ``name`` is critical for | 
|  | handling the uncommon case of a directory containing multiple hardlinks | 
|  | to the same file where all the names hash to the same value. | 
|  |  | 
|  | b. For each inode in the AG, | 
|  |  | 
|  | 1. Scan the inode for parent pointers. | 
|  | For each parent pointer found, | 
|  |  | 
|  | a. Validate the ondisk parent pointer. | 
|  | If validation fails, move on to the next parent pointer in the | 
|  | file. | 
|  |  | 
|  | b. If the name has already been stored in the xfblob, then use that | 
|  | cookie and skip the next step. | 
|  |  | 
|  | c. Record the name in a per-file xfblob, and remember the xfblob | 
|  | cookie. | 
|  |  | 
|  | d. Store ``(parent_inum, parent_gen, name_hash, name_len, | 
|  | name_cookie)`` tuples in a per-file slab. | 
|  |  | 
|  | 2. Sort the per-file tuples in order of ``parent_inum``, ``name_hash``, | 
|  | and ``name_cookie``. | 
|  |  | 
|  | 3. Position one slab cursor at the start of the inode's records in the | 
|  | per-AG tuple slab. | 
|  | This should be trivial since the per-AG tuples are in child inumber | 
|  | order. | 
|  |  | 
|  | 4. Position a second slab cursor at the start of the per-file tuple slab. | 
|  |  | 
|  | 5. Iterate the two cursors in lockstep, comparing the ``parent_ino``, | 
|  | ``name_hash``, and ``name_cookie`` fields of the records under each | 
|  | cursor: | 
|  |  | 
|  | a. If the per-AG cursor is at a lower point in the keyspace than the | 
|  | per-file cursor, then the per-AG cursor points to a missing parent | 
|  | pointer. | 
|  | Add the parent pointer to the inode and advance the per-AG | 
|  | cursor. | 
|  |  | 
|  | b. If the per-file cursor is at a lower point in the keyspace than | 
|  | the per-AG cursor, then the per-file cursor points to a dangling | 
|  | parent pointer. | 
|  | Remove the parent pointer from the inode and advance the per-file | 
|  | cursor. | 
|  |  | 
|  | c. Otherwise, both cursors point at the same parent pointer. | 
|  | Update the parent_gen component if necessary. | 
|  | Advance both cursors. | 
|  |  | 
|  | 4. Move on to examining link counts, as we do today. | 
|  |  | 
|  | The proposed patchset is the | 
|  | `offline parent pointers repair | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsck>`_ | 
|  | series. | 
|  |  | 
|  | Rebuilding directories from parent pointers in offline repair would be very | 
|  | challenging because xfs_repair currently uses two single-pass scans of the | 
|  | filesystem during phases 3 and 4 to decide which files are corrupt enough to be | 
|  | zapped. | 
|  | This scan would have to be converted into a multi-pass scan: | 
|  |  | 
|  | 1. The first pass of the scan zaps corrupt inodes, forks, and attributes | 
|  | much as it does now. | 
|  | Corrupt directories are noted but not zapped. | 
|  |  | 
|  | 2. The next pass records parent pointers pointing to the directories noted | 
|  | as being corrupt in the first pass. | 
|  | This second pass may have to happen after the phase 4 scan for duplicate | 
|  | blocks, if phase 4 is also capable of zapping directories. | 
|  |  | 
|  | 3. The third pass resets corrupt directories to an empty shortform directory. | 
|  | Free space metadata has not been ensured yet, so repair cannot yet use the | 
|  | directory building code in libxfs. | 
|  |  | 
|  | 4. At the start of phase 6, space metadata have been rebuilt. | 
|  | Use the parent pointer information recorded during step 2 to reconstruct | 
|  | the dirents and add them to the now-empty directories. | 
|  |  | 
|  | This code has not yet been constructed. | 
|  |  | 
|  | .. _dirtree: | 
|  |  | 
|  | Case Study: Directory Tree Structure | 
|  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  |  | 
|  | As mentioned earlier, the filesystem directory tree is supposed to be a | 
|  | directed acylic graph structure. | 
|  | However, each node in this graph is a separate ``xfs_inode`` object with its | 
|  | own locks, which makes validating the tree qualities difficult. | 
|  | Fortunately, non-directories are allowed to have multiple parents and cannot | 
|  | have children, so only directories need to be scanned. | 
|  | Directories typically constitute 5-10% of the files in a filesystem, which | 
|  | reduces the amount of work dramatically. | 
|  |  | 
|  | If the directory tree could be frozen, it would be easy to discover cycles and | 
|  | disconnected regions by running a depth (or breadth) first search downwards | 
|  | from the root directory and marking a bitmap for each directory found. | 
|  | At any point in the walk, trying to set an already set bit means there is a | 
|  | cycle. | 
|  | After the scan completes, XORing the marked inode bitmap with the inode | 
|  | allocation bitmap reveals disconnected inodes. | 
|  | However, one of online repair's design goals is to avoid locking the entire | 
|  | filesystem unless it's absolutely necessary. | 
|  | Directory tree updates can move subtrees across the scanner wavefront on a live | 
|  | filesystem, so the bitmap algorithm cannot be applied. | 
|  |  | 
|  | Directory parent pointers enable an incremental approach to validation of the | 
|  | tree structure. | 
|  | Instead of using one thread to scan the entire filesystem, multiple threads can | 
|  | walk from individual subdirectories upwards towards the root. | 
|  | For this to work, all directory entries and parent pointers must be internally | 
|  | consistent, each directory entry must have a parent pointer, and the link | 
|  | counts of all directories must be correct. | 
|  | Each scanner thread must be able to take the IOLOCK of an alleged parent | 
|  | directory while holding the IOLOCK of the child directory to prevent either | 
|  | directory from being moved within the tree. | 
|  | This is not possible since the VFS does not take the IOLOCK of a child | 
|  | subdirectory when moving that subdirectory, so instead the scanner stabilizes | 
|  | the parent -> child relationship by taking the ILOCKs and installing a dirent | 
|  | update hook to detect changes. | 
|  |  | 
|  | The scanning process uses a dirent hook to detect changes to the directories | 
|  | mentioned in the scan data. | 
|  | The scan works as follows: | 
|  |  | 
|  | 1. For each subdirectory in the filesystem, | 
|  |  | 
|  | a. For each parent pointer of that subdirectory, | 
|  |  | 
|  | 1. Create a path object for that parent pointer, and mark the | 
|  | subdirectory inode number in the path object's bitmap. | 
|  |  | 
|  | 2. Record the parent pointer name and inode number in a path structure. | 
|  |  | 
|  | 3. If the alleged parent is the subdirectory being scrubbed, the path is | 
|  | a cycle. | 
|  | Mark the path for deletion and repeat step 1a with the next | 
|  | subdirectory parent pointer. | 
|  |  | 
|  | 4. Try to mark the alleged parent inode number in a bitmap in the path | 
|  | object. | 
|  | If the bit is already set, then there is a cycle in the directory | 
|  | tree. | 
|  | Mark the path as a cycle and repeat step 1a with the next subdirectory | 
|  | parent pointer. | 
|  |  | 
|  | 5. Load the alleged parent. | 
|  | If the alleged parent is not a linked directory, abort the scan | 
|  | because the parent pointer information is inconsistent. | 
|  |  | 
|  | 6. For each parent pointer of this alleged ancestor directory, | 
|  |  | 
|  | a. Record the parent pointer name and inode number in the path object | 
|  | if no parent has been set for that level. | 
|  |  | 
|  | b. If an ancestor has more than one parent, mark the path as corrupt. | 
|  | Repeat step 1a with the next subdirectory parent pointer. | 
|  |  | 
|  | c. Repeat steps 1a3-1a6 for the ancestor identified in step 1a6a. | 
|  | This repeats until the directory tree root is reached or no parents | 
|  | are found. | 
|  |  | 
|  | 7. If the walk terminates at the root directory, mark the path as ok. | 
|  |  | 
|  | 8. If the walk terminates without reaching the root, mark the path as | 
|  | disconnected. | 
|  |  | 
|  | 2. If the directory entry update hook triggers, check all paths already found | 
|  | by the scan. | 
|  | If the entry matches part of a path, mark that path and the scan stale. | 
|  | When the scanner thread sees that the scan has been marked stale, it deletes | 
|  | all scan data and starts over. | 
|  |  | 
|  | Repairing the directory tree works as follows: | 
|  |  | 
|  | 1. Walk each path of the target subdirectory. | 
|  |  | 
|  | a. Corrupt paths and cycle paths are counted as suspect. | 
|  |  | 
|  | b. Paths already marked for deletion are counted as bad. | 
|  |  | 
|  | c. Paths that reached the root are counted as good. | 
|  |  | 
|  | 2. If the subdirectory is either the root directory or has zero link count, | 
|  | delete all incoming directory entries in the immediate parents. | 
|  | Repairs are complete. | 
|  |  | 
|  | 3. If the subdirectory has exactly one path, set the dotdot entry to the | 
|  | parent and exit. | 
|  |  | 
|  | 4. If the subdirectory has at least one good path, delete all the other | 
|  | incoming directory entries in the immediate parents. | 
|  |  | 
|  | 5. If the subdirectory has no good paths and more than one suspect path, delete | 
|  | all the other incoming directory entries in the immediate parents. | 
|  |  | 
|  | 6. If the subdirectory has zero paths, attach it to the lost and found. | 
|  |  | 
|  | The proposed patches are in the | 
|  | `directory tree repair | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-tree>`_ | 
|  | series. | 
|  |  | 
|  |  | 
|  | .. _orphanage: | 
|  |  | 
|  | The Orphanage | 
|  | ------------- | 
|  |  | 
|  | Filesystems present files as a directed, and hopefully acyclic, graph. | 
|  | In other words, a tree. | 
|  | The root of the filesystem is a directory, and each entry in a directory points | 
|  | downwards either to more subdirectories or to non-directory files. | 
|  | Unfortunately, a disruption in the directory graph pointers result in a | 
|  | disconnected graph, which makes files impossible to access via regular path | 
|  | resolution. | 
|  |  | 
|  | Without parent pointers, the directory parent pointer online scrub code can | 
|  | detect a dotdot entry pointing to a parent directory that doesn't have a link | 
|  | back to the child directory and the file link count checker can detect a file | 
|  | that isn't pointed to by any directory in the filesystem. | 
|  | If such a file has a positive link count, the file is an orphan. | 
|  |  | 
|  | With parent pointers, directories can be rebuilt by scanning parent pointers | 
|  | and parent pointers can be rebuilt by scanning directories. | 
|  | This should reduce the incidence of files ending up in ``/lost+found``. | 
|  |  | 
|  | When orphans are found, they should be reconnected to the directory tree. | 
|  | Offline fsck solves the problem by creating a directory ``/lost+found`` to | 
|  | serve as an orphanage, and linking orphan files into the orphanage by using the | 
|  | inumber as the name. | 
|  | Reparenting a file to the orphanage does not reset any of its permissions or | 
|  | ACLs. | 
|  |  | 
|  | This process is more involved in the kernel than it is in userspace. | 
|  | The directory and file link count repair setup functions must use the regular | 
|  | VFS mechanisms to create the orphanage directory with all the necessary | 
|  | security attributes and dentry cache entries, just like a regular directory | 
|  | tree modification. | 
|  |  | 
|  | Orphaned files are adopted by the orphanage as follows: | 
|  |  | 
|  | 1. Call ``xrep_orphanage_try_create`` at the start of the scrub setup function | 
|  | to try to ensure that the lost and found directory actually exists. | 
|  | This also attaches the orphanage directory to the scrub context. | 
|  |  | 
|  | 2. If the decision is made to reconnect a file, take the IOLOCK of both the | 
|  | orphanage and the file being reattached. | 
|  | The ``xrep_orphanage_iolock_two`` function follows the inode locking | 
|  | strategy discussed earlier. | 
|  |  | 
|  | 3. Use ``xrep_adoption_trans_alloc`` to reserve resources to the repair | 
|  | transaction. | 
|  |  | 
|  | 4. Call ``xrep_orphanage_compute_name`` to compute the new name in the | 
|  | orphanage. | 
|  |  | 
|  | 5. If the adoption is going to happen, call ``xrep_adoption_reparent`` to | 
|  | reparent the orphaned file into the lost and found and invalidate the dentry | 
|  | cache. | 
|  |  | 
|  | 6. Call ``xrep_adoption_finish`` to commit any filesystem updates, release the | 
|  | orphanage ILOCK, and clean the scrub transaction.  Call | 
|  | ``xrep_adoption_commit`` to commit the updates and the scrub transaction. | 
|  |  | 
|  | 7. If a runtime error happens, call ``xrep_adoption_cancel`` to release all | 
|  | resources. | 
|  |  | 
|  | The proposed patches are in the | 
|  | `orphanage adoption | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_ | 
|  | series. | 
|  |  | 
|  | 6. Userspace Algorithms and Data Structures | 
|  | =========================================== | 
|  |  | 
|  | This section discusses the key algorithms and data structures of the userspace | 
|  | program, ``xfs_scrub``, that provide the ability to drive metadata checks and | 
|  | repairs in the kernel, verify file data, and look for other potential problems. | 
|  |  | 
|  | .. _scrubcheck: | 
|  |  | 
|  | Checking Metadata | 
|  | ----------------- | 
|  |  | 
|  | Recall the :ref:`phases of fsck work<scrubphases>` outlined earlier. | 
|  | That structure follows naturally from the data dependencies designed into the | 
|  | filesystem from its beginnings in 1993. | 
|  | In XFS, there are several groups of metadata dependencies: | 
|  |  | 
|  | a. Filesystem summary counts depend on consistency within the inode indices, | 
|  | the allocation group space btrees, and the realtime volume space | 
|  | information. | 
|  |  | 
|  | b. Quota resource counts depend on consistency within the quota file data | 
|  | forks, inode indices, inode records, and the forks of every file on the | 
|  | system. | 
|  |  | 
|  | c. The naming hierarchy depends on consistency within the directory and | 
|  | extended attribute structures. | 
|  | This includes file link counts. | 
|  |  | 
|  | d. Directories, extended attributes, and file data depend on consistency within | 
|  | the file forks that map directory and extended attribute data to physical | 
|  | storage media. | 
|  |  | 
|  | e. The file forks depends on consistency within inode records and the space | 
|  | metadata indices of the allocation groups and the realtime volume. | 
|  | This includes quota and realtime metadata files. | 
|  |  | 
|  | f. Inode records depends on consistency within the inode metadata indices. | 
|  |  | 
|  | g. Realtime space metadata depend on the inode records and data forks of the | 
|  | realtime metadata inodes. | 
|  |  | 
|  | h. The allocation group metadata indices (free space, inodes, reference count, | 
|  | and reverse mapping btrees) depend on consistency within the AG headers and | 
|  | between all the AG metadata btrees. | 
|  |  | 
|  | i. ``xfs_scrub`` depends on the filesystem being mounted and kernel support | 
|  | for online fsck functionality. | 
|  |  | 
|  | Therefore, a metadata dependency graph is a convenient way to schedule checking | 
|  | operations in the ``xfs_scrub`` program: | 
|  |  | 
|  | - Phase 1 checks that the provided path maps to an XFS filesystem and detect | 
|  | the kernel's scrubbing abilities, which validates group (i). | 
|  |  | 
|  | - Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue. | 
|  |  | 
|  | - Phase 3 scans inodes in parallel. | 
|  | For each inode, groups (f), (e), and (d) are checked, in that order. | 
|  |  | 
|  | - Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6 | 
|  | may run reliably. | 
|  |  | 
|  | - Phase 5 starts by checking groups (b) and (c) in parallel before moving on | 
|  | to checking names. | 
|  |  | 
|  | - Phase 6 depends on groups (i) through (b) to find file data blocks to verify, | 
|  | to read them, and to report which blocks of which files are affected. | 
|  |  | 
|  | - Phase 7 checks group (a), having validated everything else. | 
|  |  | 
|  | Notice that the data dependencies between groups are enforced by the structure | 
|  | of the program flow. | 
|  |  | 
|  | Parallel Inode Scans | 
|  | -------------------- | 
|  |  | 
|  | An XFS filesystem can easily contain hundreds of millions of inodes. | 
|  | Given that XFS targets installations with large high-performance storage, | 
|  | it is desirable to scrub inodes in parallel to minimize runtime, particularly | 
|  | if the program has been invoked manually from a command line. | 
|  | This requires careful scheduling to keep the threads as evenly loaded as | 
|  | possible. | 
|  |  | 
|  | Early iterations of the ``xfs_scrub`` inode scanner naïvely created a single | 
|  | workqueue and scheduled a single workqueue item per AG. | 
|  | Each workqueue item walked the inode btree (with ``XFS_IOC_INUMBERS``) to find | 
|  | inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to gather enough | 
|  | information to construct file handles. | 
|  | The file handle was then passed to a function to generate scrub items for each | 
|  | metadata object of each inode. | 
|  | This simple algorithm leads to thread balancing problems in phase 3 if the | 
|  | filesystem contains one AG with a few large sparse files and the rest of the | 
|  | AGs contain many smaller files. | 
|  | The inode scan dispatch function was not sufficiently granular; it should have | 
|  | been dispatching at the level of individual inodes, or, to constrain memory | 
|  | consumption, inode btree records. | 
|  |  | 
|  | Thanks to Dave Chinner, bounded workqueues in userspace enable ``xfs_scrub`` to | 
|  | avoid this problem with ease by adding a second workqueue. | 
|  | Just like before, the first workqueue is seeded with one workqueue item per AG, | 
|  | and it uses INUMBERS to find inode btree chunks. | 
|  | The second workqueue, however, is configured with an upper bound on the number | 
|  | of items that can be waiting to be run. | 
|  | Each inode btree chunk found by the first workqueue's workers are queued to the | 
|  | second workqueue, and it is this second workqueue that queries BULKSTAT, | 
|  | creates a file handle, and passes it to a function to generate scrub items for | 
|  | each metadata object of each inode. | 
|  | If the second workqueue is too full, the workqueue add function blocks the | 
|  | first workqueue's workers until the backlog eases. | 
|  | This doesn't completely solve the balancing problem, but reduces it enough to | 
|  | move on to more pressing issues. | 
|  |  | 
|  | The proposed patchsets are the scrub | 
|  | `performance tweaks | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_ | 
|  | and the | 
|  | `inode scan rebalance | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_ | 
|  | series. | 
|  |  | 
|  | .. _scrubrepair: | 
|  |  | 
|  | Scheduling Repairs | 
|  | ------------------ | 
|  |  | 
|  | During phase 2, corruptions and inconsistencies reported in any AGI header or | 
|  | inode btree are repaired immediately, because phase 3 relies on proper | 
|  | functioning of the inode indices to find inodes to scan. | 
|  | Failed repairs are rescheduled to phase 4. | 
|  | Problems reported in any other space metadata are deferred to phase 4. | 
|  | Optimization opportunities are always deferred to phase 4, no matter their | 
|  | origin. | 
|  |  | 
|  | During phase 3, corruptions and inconsistencies reported in any part of a | 
|  | file's metadata are repaired immediately if all space metadata were validated | 
|  | during phase 2. | 
|  | Repairs that fail or cannot be repaired immediately are scheduled for phase 4. | 
|  |  | 
|  | In the original design of ``xfs_scrub``, it was thought that repairs would be | 
|  | so infrequent that the ``struct xfs_scrub_metadata`` objects used to | 
|  | communicate with the kernel could also be used as the primary object to | 
|  | schedule repairs. | 
|  | With recent increases in the number of optimizations possible for a given | 
|  | filesystem object, it became much more memory-efficient to track all eligible | 
|  | repairs for a given filesystem object with a single repair item. | 
|  | Each repair item represents a single lockable object -- AGs, metadata files, | 
|  | individual inodes, or a class of summary information. | 
|  |  | 
|  | Phase 4 is responsible for scheduling a lot of repair work in as quick a | 
|  | manner as is practical. | 
|  | The :ref:`data dependencies <scrubcheck>` outlined earlier still apply, which | 
|  | means that ``xfs_scrub`` must try to complete the repair work scheduled by | 
|  | phase 2 before trying repair work scheduled by phase 3. | 
|  | The repair process is as follows: | 
|  |  | 
|  | 1. Start a round of repair with a workqueue and enough workers to keep the CPUs | 
|  | as busy as the user desires. | 
|  |  | 
|  | a. For each repair item queued by phase 2, | 
|  |  | 
|  | i.   Ask the kernel to repair everything listed in the repair item for a | 
|  | given filesystem object. | 
|  |  | 
|  | ii.  Make a note if the kernel made any progress in reducing the number | 
|  | of repairs needed for this object. | 
|  |  | 
|  | iii. If the object no longer requires repairs, revalidate all metadata | 
|  | associated with this object. | 
|  | If the revalidation succeeds, drop the repair item. | 
|  | If not, requeue the item for more repairs. | 
|  |  | 
|  | b. If any repairs were made, jump back to 1a to retry all the phase 2 items. | 
|  |  | 
|  | c. For each repair item queued by phase 3, | 
|  |  | 
|  | i.   Ask the kernel to repair everything listed in the repair item for a | 
|  | given filesystem object. | 
|  |  | 
|  | ii.  Make a note if the kernel made any progress in reducing the number | 
|  | of repairs needed for this object. | 
|  |  | 
|  | iii. If the object no longer requires repairs, revalidate all metadata | 
|  | associated with this object. | 
|  | If the revalidation succeeds, drop the repair item. | 
|  | If not, requeue the item for more repairs. | 
|  |  | 
|  | d. If any repairs were made, jump back to 1c to retry all the phase 3 items. | 
|  |  | 
|  | 2. If step 1 made any repair progress of any kind, jump back to step 1 to start | 
|  | another round of repair. | 
|  |  | 
|  | 3. If there are items left to repair, run them all serially one more time. | 
|  | Complain if the repairs were not successful, since this is the last chance | 
|  | to repair anything. | 
|  |  | 
|  | Corruptions and inconsistencies encountered during phases 5 and 7 are repaired | 
|  | immediately. | 
|  | Corrupt file data blocks reported by phase 6 cannot be recovered by the | 
|  | filesystem. | 
|  |  | 
|  | The proposed patchsets are the | 
|  | `repair warning improvements | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_, | 
|  | refactoring of the | 
|  | `repair data dependency | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_ | 
|  | and | 
|  | `object tracking | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_, | 
|  | and the | 
|  | `repair scheduling | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_ | 
|  | improvement series. | 
|  |  | 
|  | Checking Names for Confusable Unicode Sequences | 
|  | ----------------------------------------------- | 
|  |  | 
|  | If ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of | 
|  | phase 4, it moves on to phase 5, which checks for suspicious looking names in | 
|  | the filesystem. | 
|  | These names consist of the filesystem label, names in directory entries, and | 
|  | the names of extended attributes. | 
|  | Like most Unix filesystems, XFS imposes the sparest of constraints on the | 
|  | contents of a name: | 
|  |  | 
|  | - Slashes and null bytes are not allowed in directory entries. | 
|  |  | 
|  | - Null bytes are not allowed in userspace-visible extended attributes. | 
|  |  | 
|  | - Null bytes are not allowed in the filesystem label. | 
|  |  | 
|  | Directory entries and attribute keys store the length of the name explicitly | 
|  | ondisk, which means that nulls are not name terminators. | 
|  | For this section, the term "naming domain" refers to any place where names are | 
|  | presented together -- all the names in a directory, or all the attributes of a | 
|  | file. | 
|  |  | 
|  | Although the Unix naming constraints are very permissive, the reality of most | 
|  | modern-day Linux systems is that programs work with Unicode character code | 
|  | points to support international languages. | 
|  | These programs typically encode those code points in UTF-8 when interfacing | 
|  | with the C library because the kernel expects null-terminated names. | 
|  | In the common case, therefore, names found in an XFS filesystem are actually | 
|  | UTF-8 encoded Unicode data. | 
|  |  | 
|  | To maximize its expressiveness, the Unicode standard defines separate control | 
|  | points for various characters that render similarly or identically in writing | 
|  | systems around the world. | 
|  | For example, the character "Cyrillic Small Letter A" U+0430 "а" often renders | 
|  | identically to "Latin Small Letter A" U+0061 "a". | 
|  |  | 
|  | The standard also permits characters to be constructed in multiple ways -- | 
|  | either by using a defined code point, or by combining one code point with | 
|  | various combining marks. | 
|  | For example, the character "Angstrom Sign U+212B "Å" can also be expressed | 
|  | as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above" | 
|  | U+030A "◌̊". | 
|  | Both sequences render identically. | 
|  |  | 
|  | Like the standards that preceded it, Unicode also defines various control | 
|  | characters to alter the presentation of text. | 
|  | For example, the character "Right-to-Left Override" U+202E can trick some | 
|  | programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png". | 
|  | A second category of rendering problems involves whitespace characters. | 
|  | If the character "Zero Width Space" U+200B is encountered in a file name, the | 
|  | name will render identically to a name that does not have the zero width | 
|  | space. | 
|  |  | 
|  | If two names within a naming domain have different byte sequences but render | 
|  | identically, a user may be confused by it. | 
|  | The kernel, in its indifference to upper level encoding schemes, permits this. | 
|  | Most filesystem drivers persist the byte sequence names that are given to them | 
|  | by the VFS. | 
|  |  | 
|  | Techniques for detecting confusable names are explained in great detail in | 
|  | sections 4 and 5 of the | 
|  | `Unicode Security Mechanisms <https://unicode.org/reports/tr39/>`_ | 
|  | document. | 
|  | When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the | 
|  | Unicode normalization form NFD in conjunction with the confusable name | 
|  | detection component of | 
|  | `libicu <https://github.com/unicode-org/icu>`_ | 
|  | to identify names with a directory or within a file's extended attributes that | 
|  | could be confused for each other. | 
|  | Names are also checked for control characters, non-rendering characters, and | 
|  | mixing of bidirectional characters. | 
|  | All of these potential issues are reported to the system administrator during | 
|  | phase 5. | 
|  |  | 
|  | Media Verification of File Data Extents | 
|  | --------------------------------------- | 
|  |  | 
|  | The system administrator can elect to initiate a media scan of all file data | 
|  | blocks. | 
|  | This scan after validation of all filesystem metadata (except for the summary | 
|  | counters) as phase 6. | 
|  | The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map | 
|  | to find areas that are allocated to file data fork extents. | 
|  | Gaps between data fork extents that are smaller than 64k are treated as if | 
|  | they were data fork extents to reduce the command setup overhead. | 
|  | When the space map scan accumulates a region larger than 32MB, a media | 
|  | verification request is sent to the disk as a directio read of the raw block | 
|  | device. | 
|  |  | 
|  | If the verification read fails, ``xfs_scrub`` retries with single-block reads | 
|  | to narrow down the failure to the specific region of the media and recorded. | 
|  | When it has finished issuing verification requests, it again uses the space | 
|  | mapping ioctl to map the recorded media errors back to metadata structures | 
|  | and report what has been lost. | 
|  | For media errors in blocks owned by files, parent pointers can be used to | 
|  | construct file paths from inode numbers for user-friendly reporting. | 
|  |  | 
|  | 7. Conclusion and Future Work | 
|  | ============================= | 
|  |  | 
|  | It is hoped that the reader of this document has followed the designs laid out | 
|  | in this document and now has some familiarity with how XFS performs online | 
|  | rebuilding of its metadata indices, and how filesystem users can interact with | 
|  | that functionality. | 
|  | Although the scope of this work is daunting, it is hoped that this guide will | 
|  | make it easier for code readers to understand what has been built, for whom it | 
|  | has been built, and why. | 
|  | Please feel free to contact the XFS mailing list with questions. | 
|  |  | 
|  | XFS_IOC_EXCHANGE_RANGE | 
|  | ---------------------- | 
|  |  | 
|  | As discussed earlier, a second frontend to the atomic file mapping exchange | 
|  | mechanism is a new ioctl call that userspace programs can use to commit updates | 
|  | to files atomically. | 
|  | This frontend has been out for review for several years now, though the | 
|  | necessary refinements to online repair and lack of customer demand mean that | 
|  | the proposal has not been pushed very hard. | 
|  |  | 
|  | File Content Exchanges with Regular User Files | 
|  | `````````````````````````````````````````````` | 
|  |  | 
|  | As mentioned earlier, XFS has long had the ability to swap extents between | 
|  | files, which is used almost exclusively by ``xfs_fsr`` to defragment files. | 
|  | The earliest form of this was the fork swap mechanism, where the entire | 
|  | contents of data forks could be exchanged between two files by exchanging the | 
|  | raw bytes in each inode fork's immediate area. | 
|  | When XFS v5 came along with self-describing metadata, this old mechanism grew | 
|  | some log support to continue rewriting the owner fields of BMBT blocks during | 
|  | log recovery. | 
|  | When the reverse mapping btree was later added to XFS, the only way to maintain | 
|  | the consistency of the fork mappings with the reverse mapping index was to | 
|  | develop an iterative mechanism that used deferred bmap and rmap operations to | 
|  | swap mappings one at a time. | 
|  | This mechanism is identical to steps 2-3 from the procedure above except for | 
|  | the new tracking items, because the atomic file mapping exchange mechanism is | 
|  | an iteration of an existing mechanism and not something totally novel. | 
|  | For the narrow case of file defragmentation, the file contents must be | 
|  | identical, so the recovery guarantees are not much of a gain. | 
|  |  | 
|  | Atomic file content exchanges are much more flexible than the existing swapext | 
|  | implementations because it can guarantee that the caller never sees a mix of | 
|  | old and new contents even after a crash, and it can operate on two arbitrary | 
|  | file fork ranges. | 
|  | The extra flexibility enables several new use cases: | 
|  |  | 
|  | - **Atomic commit of file writes**: A userspace process opens a file that it | 
|  | wants to update. | 
|  | Next, it opens a temporary file and calls the file clone operation to reflink | 
|  | the first file's contents into the temporary file. | 
|  | Writes to the original file should instead be written to the temporary file. | 
|  | Finally, the process calls the atomic file mapping exchange system call | 
|  | (``XFS_IOC_EXCHANGE_RANGE``) to exchange the file contents, thereby | 
|  | committing all of the updates to the original file, or none of them. | 
|  |  | 
|  | .. _exchrange_if_unchanged: | 
|  |  | 
|  | - **Transactional file updates**: The same mechanism as above, but the caller | 
|  | only wants the commit to occur if the original file's contents have not | 
|  | changed. | 
|  | To make this happen, the calling process snapshots the file modification and | 
|  | change timestamps of the original file before reflinking its data to the | 
|  | temporary file. | 
|  | When the program is ready to commit the changes, it passes the timestamps | 
|  | into the kernel as arguments to the atomic file mapping exchange system call. | 
|  | The kernel only commits the changes if the provided timestamps match the | 
|  | original file. | 
|  | A new ioctl (``XFS_IOC_COMMIT_RANGE``) is provided to perform this. | 
|  |  | 
|  | - **Emulation of atomic block device writes**: Export a block device with a | 
|  | logical sector size matching the filesystem block size to force all writes | 
|  | to be aligned to the filesystem block size. | 
|  | Stage all writes to a temporary file, and when that is complete, call the | 
|  | atomic file mapping exchange system call with a flag to indicate that holes | 
|  | in the temporary file should be ignored. | 
|  | This emulates an atomic device write in software, and can support arbitrary | 
|  | scattered writes. | 
|  |  | 
|  | Vectorized Scrub | 
|  | ---------------- | 
|  |  | 
|  | As it turns out, the :ref:`refactoring <scrubrepair>` of repair items mentioned | 
|  | earlier was a catalyst for enabling a vectorized scrub system call. | 
|  | Since 2018, the cost of making a kernel call has increased considerably on some | 
|  | systems to mitigate the effects of speculative execution attacks. | 
|  | This incentivizes program authors to make as few system calls as possible to | 
|  | reduce the number of times an execution path crosses a security boundary. | 
|  |  | 
|  | With vectorized scrub, userspace pushes to the kernel the identity of a | 
|  | filesystem object, a list of scrub types to run against that object, and a | 
|  | simple representation of the data dependencies between the selected scrub | 
|  | types. | 
|  | The kernel executes as much of the caller's plan as it can until it hits a | 
|  | dependency that cannot be satisfied due to a corruption, and tells userspace | 
|  | how much was accomplished. | 
|  | It is hoped that ``io_uring`` will pick up enough of this functionality that | 
|  | online fsck can use that instead of adding a separate vectored scrub system | 
|  | call to XFS. | 
|  |  | 
|  | The relevant patchsets are the | 
|  | `kernel vectorized scrub | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_ | 
|  | and | 
|  | `userspace vectorized scrub | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_ | 
|  | series. | 
|  |  | 
|  | Quality of Service Targets for Scrub | 
|  | ------------------------------------ | 
|  |  | 
|  | One serious shortcoming of the online fsck code is that the amount of time that | 
|  | it can spend in the kernel holding resource locks is basically unbounded. | 
|  | Userspace is allowed to send a fatal signal to the process which will cause | 
|  | ``xfs_scrub`` to exit when it reaches a good stopping point, but there's no way | 
|  | for userspace to provide a time budget to the kernel. | 
|  | Given that the scrub codebase has helpers to detect fatal signals, it shouldn't | 
|  | be too much work to allow userspace to specify a timeout for a scrub/repair | 
|  | operation and abort the operation if it exceeds budget. | 
|  | However, most repair functions have the property that once they begin to touch | 
|  | ondisk metadata, the operation cannot be cancelled cleanly, after which a QoS | 
|  | timeout is no longer useful. | 
|  |  | 
|  | Defragmenting Free Space | 
|  | ------------------------ | 
|  |  | 
|  | Over the years, many XFS users have requested the creation of a program to | 
|  | clear a portion of the physical storage underlying a filesystem so that it | 
|  | becomes a contiguous chunk of free space. | 
|  | Call this free space defragmenter ``clearspace`` for short. | 
|  |  | 
|  | The first piece the ``clearspace`` program needs is the ability to read the | 
|  | reverse mapping index from userspace. | 
|  | This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl. | 
|  | The second piece it needs is a new fallocate mode | 
|  | (``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and | 
|  | maps it to a file. | 
|  | Call this file the "space collector" file. | 
|  | The third piece is the ability to force an online repair. | 
|  |  | 
|  | To clear all the metadata out of a portion of physical storage, clearspace | 
|  | uses the new fallocate map-freespace call to map any free space in that region | 
|  | to the space collector file. | 
|  | Next, clearspace finds all metadata blocks in that region by way of | 
|  | ``GETFSMAP`` and issues forced repair requests on the data structure. | 
|  | This often results in the metadata being rebuilt somewhere that is not being | 
|  | cleared. | 
|  | After each relocation, clearspace calls the "map free space" function again to | 
|  | collect any newly freed space in the region being cleared. | 
|  |  | 
|  | To clear all the file data out of a portion of the physical storage, clearspace | 
|  | uses the FSMAP information to find relevant file data blocks. | 
|  | Having identified a good target, it uses the ``FICLONERANGE`` call on that part | 
|  | of the file to try to share the physical space with a dummy file. | 
|  | Cloning the extent means that the original owners cannot overwrite the | 
|  | contents; any changes will be written somewhere else via copy-on-write. | 
|  | Clearspace makes its own copy of the frozen extent in an area that is not being | 
|  | cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic file content exchanges | 
|  | <exchrange_if_unchanged>` feature) to change the target file's data extent | 
|  | mapping away from the area being cleared. | 
|  | When all other mappings have been moved, clearspace reflinks the space into the | 
|  | space collector file so that it becomes unavailable. | 
|  |  | 
|  | There are further optimizations that could apply to the above algorithm. | 
|  | To clear a piece of physical storage that has a high sharing factor, it is | 
|  | strongly desirable to retain this sharing factor. | 
|  | In fact, these extents should be moved first to maximize sharing factor after | 
|  | the operation completes. | 
|  | To make this work smoothly, clearspace needs a new ioctl | 
|  | (``FS_IOC_GETREFCOUNTS``) to report reference count information to userspace. | 
|  | With the refcount information exposed, clearspace can quickly find the longest, | 
|  | most shared data extents in the filesystem, and target them first. | 
|  |  | 
|  | **Future Work Question**: How might the filesystem move inode chunks? | 
|  |  | 
|  | *Answer*: To move inode chunks, Dave Chinner constructed a prototype program | 
|  | that creates a new file with the old contents and then locklessly runs around | 
|  | the filesystem updating directory entries. | 
|  | The operation cannot complete if the filesystem goes down. | 
|  | That problem isn't totally insurmountable: create an inode remapping table | 
|  | hidden behind a jump label, and a log item that tracks the kernel walking the | 
|  | filesystem to update directory entries. | 
|  | The trouble is, the kernel can't do anything about open files, since it cannot | 
|  | revoke them. | 
|  |  | 
|  | **Future Work Question**: Can static keys be used to minimize the cost of | 
|  | supporting ``revoke()`` on XFS files? | 
|  |  | 
|  | *Answer*: Yes. | 
|  | Until the first revocation, the bailout code need not be in the call path at | 
|  | all. | 
|  |  | 
|  | The relevant patchsets are the | 
|  | `kernel freespace defrag | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_ | 
|  | and | 
|  | `userspace freespace defrag | 
|  | <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_ | 
|  | series. | 
|  |  | 
|  | Shrinking Filesystems | 
|  | --------------------- | 
|  |  | 
|  | Removing the end of the filesystem ought to be a simple matter of evacuating | 
|  | the data and metadata at the end of the filesystem, and handing the freed space | 
|  | to the shrink code. | 
|  | That requires an evacuation of the space at end of the filesystem, which is a | 
|  | use of free space defragmentation! |