Documentation/filesystems/ext4/allocators.rst - linux - Git at Google

 .. SPDX-License-Identifier: GPL-2.0

 Block and Inode Allocation Policy
 ---------------------------------

 ext4 recognizes (better than ext3, anyway) that data locality is
 generally a desirably quality of a filesystem. On a spinning disk,
 keeping related blocks near each other reduces the amount of movement
 that the head actuator and disk must perform to access a data block,
 thus speeding up disk IO. On an SSD there of course are no moving parts,
 but locality can increase the size of each transfer request while
 reducing the total number of requests. This locality may also have the
 effect of concentrating writes on a single erase block, which can speed
 up file rewrites significantly. Therefore, it is useful to reduce
 fragmentation whenever possible.

 The first tool that ext4 uses to combat fragmentation is the multi-block
 allocator. When a file is first created, the block allocator
 speculatively allocates 8KiB of disk space to the file on the assumption
 that the space will get written soon. When the file is closed, the
 unused speculative allocations are of course freed, but if the
 speculation is correct (typically the case for full writes of small
 files) then the file data gets written out in a single multi-block
 extent. A second related trick that ext4 uses is delayed allocation.
 Under this scheme, when a file needs more blocks to absorb file writes,
 the filesystem defers deciding the exact placement on the disk until all
 the dirty buffers are being written out to disk. By not committing to a
 particular placement until it's absolutely necessary (the commit timeout
 is hit, or sync() is called, or the kernel runs out of memory), the hope
 is that the filesystem can make better location decisions.

 The third trick that ext4 (and ext3) uses is that it tries to keep a
 file's data blocks in the same block group as its inode. This cuts down
 on the seek penalty when the filesystem first has to read a file's inode
 to learn where the file's data blocks live and then seek over to the
 file's data blocks to begin I/O operations.

 The fourth trick is that all the inodes in a directory are placed in the
 same block group as the directory, when feasible. The working assumption
 here is that all the files in a directory might be related, therefore it
 is useful to try to keep them all together.

 The fifth trick is that the disk volume is cut up into 128MB block
 groups; these mini-containers are used as outlined above to try to
 maintain data locality. However, there is a deliberate quirk -- when a
 directory is created in the root directory, the inode allocator scans
 the block groups and puts that directory into the least heavily loaded
 block group that it can find. This encourages directories to spread out
 over a disk; as the top-level directory/file blobs fill up one block
 group, the allocators simply move on to the next block group. Allegedly
 this scheme evens out the loading on the block groups, though the author
 suspects that the directories which are so unlucky as to land towards
 the end of a spinning drive get a raw deal performance-wise.

 Of course if all of these mechanisms fail, one can always use e4defrag
 to defragment files.
	.. SPDX-License-Identifier: GPL-2.0

	Block and Inode Allocation Policy
	---------------------------------

	ext4 recognizes (better than ext3, anyway) that data locality is
	generally a desirably quality of a filesystem. On a spinning disk,
	keeping related blocks near each other reduces the amount of movement
	that the head actuator and disk must perform to access a data block,
	thus speeding up disk IO. On an SSD there of course are no moving parts,
	but locality can increase the size of each transfer request while
	reducing the total number of requests. This locality may also have the
	effect of concentrating writes on a single erase block, which can speed
	up file rewrites significantly. Therefore, it is useful to reduce
	fragmentation whenever possible.

	The first tool that ext4 uses to combat fragmentation is the multi-block
	allocator. When a file is first created, the block allocator
	speculatively allocates 8KiB of disk space to the file on the assumption
	that the space will get written soon. When the file is closed, the
	unused speculative allocations are of course freed, but if the
	speculation is correct (typically the case for full writes of small
	files) then the file data gets written out in a single multi-block
	extent. A second related trick that ext4 uses is delayed allocation.
	Under this scheme, when a file needs more blocks to absorb file writes,
	the filesystem defers deciding the exact placement on the disk until all
	the dirty buffers are being written out to disk. By not committing to a
	particular placement until it's absolutely necessary (the commit timeout
	is hit, or sync() is called, or the kernel runs out of memory), the hope
	is that the filesystem can make better location decisions.

	The third trick that ext4 (and ext3) uses is that it tries to keep a
	file's data blocks in the same block group as its inode. This cuts down
	on the seek penalty when the filesystem first has to read a file's inode
	to learn where the file's data blocks live and then seek over to the
	file's data blocks to begin I/O operations.

	The fourth trick is that all the inodes in a directory are placed in the
	same block group as the directory, when feasible. The working assumption
	here is that all the files in a directory might be related, therefore it
	is useful to try to keep them all together.

	The fifth trick is that the disk volume is cut up into 128MB block
	groups; these mini-containers are used as outlined above to try to
	maintain data locality. However, there is a deliberate quirk -- when a
	directory is created in the root directory, the inode allocator scans
	the block groups and puts that directory into the least heavily loaded
	block group that it can find. This encourages directories to spread out
	over a disk; as the top-level directory/file blobs fill up one block
	group, the allocators simply move on to the next block group. Allegedly
	this scheme evens out the loading on the block groups, though the author
	suspects that the directories which are so unlucky as to land towards
	the end of a spinning drive get a raw deal performance-wise.

	Of course if all of these mechanisms fail, one can always use e4defrag
	to defragment files.