|
|
Line 1: |
Line 1: |
− | = Problem statement =
| + | 2010-12-13: [http://prdownloads.sourceforge.net/e2fsprogs/e2fsprogs-1.41.13.tar.gz e2fsprogs version 1.41.13] has been released. |
− | | + | |
− | If the vast majority of the files are large, and the system is running
| + | |
− | under tight memory pressure, such that the block allocation bitmaps
| + | |
− | and buddy bitmaps are frequently getting evicted from the system, if
| + | |
− | the filesystem's free space is moderately fragmented, file allocations
| + | |
− | and unlinks may require reading multiple block bitmaps into memory,
| + | |
− | and this can be a surprisingly large amount of overhead, since the
| + | |
− | reading the various block bitmaps is a synchronous operation that
| + | |
− | tends to involve a large number of seeks.
| + | |
− | | + | |
− | One way of solving this problem is to use a large block size than 4k
| + | |
− | --- say, perhaps 256k or 1 megabyte. By reducing the number of blocks | + | |
− | needed to allocate the same file, and at the same time increasing the
| + | |
− | amount of storage that can be covered by a block allocation bitmap, we
| + | |
− | can address this problem, at the cost of increased internal
| + | |
− | fragmentation. However Linux has a fundamental assumption that the
| + | |
− | page size is >= the file system block size.
| + | |
− | | + | |
− | So the goal is to design a system that allows us to get the benefits
| + | |
− | of larger block sizes without the complexities caused by large block
| + | |
− | sizes.
| + | |
− | | + | |
− | = Constraints =
| + | |
− | | + | |
− | We want to keep the changes to the ext4 code base small and easily
| + | |
− | testable. Ideally, no changes to the core kernel should be required.
| + | |
− | What follows from this is a requirement that the fundamental block
| + | |
− | size (as far as the VM is concerned) must remain at 4k.
| + | |
− | | + | |
− | = Design =
| + | |
− | | + | |
− | The solution to this problem is to use an increased allocation size as
| + | |
− | far as the block allocaiton bitmaps are concerned. However, the size
| + | |
− | of allocation bitmaps, and the granularity of blocks as far as the the
| + | |
− | extent tree blocks are concerned, are still based on the original
| + | |
− | maximum 4k block size.
| + | |
− | | + | |
− | We assume the fields previously used for defining the fragment size in
| + | |
− | the ext4 superblock: s_log_frag_size and redefine it to be
| + | |
− | s_log_alloc_size. This defines a size, the allocation blocksize. The
| + | |
− | allocation blocksize is only enabled if a new INCOMPAT feature is
| + | |
− | defined, EXT4_FEATURE_INCOMPAT_BIGALLOC, and it must be greater than
| + | |
− | or equal to the block size.
| + | |
− | | + | |
− | The allocation blocksize controls the allocation granularity of a bit
| + | |
− | in the block bitmap. However, the maximum size of a block bitmap
| + | |
− | remains a single block size. Hence, for a file system with a 4k block
| + | |
− | size and a 1 megabyte allocation blocksize, there can be a maximum of
| + | |
− | 32,768 (4k * 8) allocation blocks, and so a single block group can now
| + | |
− | cover a maximum of 32GiB.
| + | |
− | | + | |
− | To minimize changes to ext, we will constrain an allocation block such
| + | |
− | that it must contain only fixed file system metadata (i.e., block
| + | |
− | bitmaps, inode bitmaps, and inode table) or data blocks belonging to a
| + | |
− | single inode. Furthermore, if allocation block is to be used for data
| + | |
− | blocks for an inode, it must be used contiguously starting at a file
| + | |
− | offset which is a multiple of the allocation blocksize. The first
| + | |
− | block of the allocation block must be mapped in the extent tree (or
| + | |
− | indirect block), so that the ext4_map_blocks() can easily find (and
| + | |
− | continue using) partially used allocation block.
| + | |
− | | + | |
− | Because we are not changing the definition of a block, the only
| + | |
− | changes that need to be made are at the intersection of allocating to
| + | |
− | an inode (or to file system metadata). This is good, because it means
| + | |
− | the bulk of ext4 does not need to be changed
| + | |
− | | + | |
− | | + | |
− | = Kernel Changes required =
| + | |
− | | + | |
− | 1) Globally throughout ext4: uses of EXT4_BLOCKS_PER_GROUP() need to
| + | |
− | be audited to see if they should be EXT4_BLOCKS_PER_GROUP() or
| + | |
− | EXT4_ALLOC_BLOCKS_PER_GROUP().
| + | |
− | | + | |
− | 2) ext4_map_blocks() and its downstream functions need to be changed so
| + | |
− | that they understand the new allocation rules, and in particular
| + | |
− | understand that before allocating a new block, they need to see if a
| + | |
− | partially allocated block has already been allocated, and can be used
| + | |
− | to fulfill the current allocation.
| + | |
− | | + | |
− | 3) mballoc.c will need little or no changes, other than the
| + | |
− | EXT4_BLOCKS_PER_GROUP()/EXT4_ALLOC_BLOCKS_PER_GROUP() audit discussed
| + | |
− | in (1).
| + | |
− | | + | |
− | = E2fsprogs changes required =
| + | |
− | | + | |
− | # Changes to dumpe2fs, debugfs, etc., to understand the new
| + | |
− | superblock field.
| + | |
− | | + | |
− | # Changes to mke2fs to understand how to allocate the inode, block,
| + | |
− | and inode table metadata blocks.
| + | |
− | | + | |
− | # Changes to e2fsck pass1 and pass 1b/c/d to understand that a single
| + | |
− | allocation block can be shared across multiple file system metadata
| + | |
− | blocks or a file's data blocks. (But the fact that allocation block
| + | |
− | can be used contiguously vis-a-vis an inode's logical block space will
| + | |
− | make these changes relatively straightforwarwd.)
| + | |
− | | + | |
− | = Upsides =
| + | |
− | | + | |
− | Block allocation and deallocation for large files should be much
| + | |
− | faster, since a block bitmap will cover much more space, and it is
| + | |
− | much less likely that a single file system operation will need to
| + | |
− | touch multiple block bitmaps.
| + | |
− | | + | |
− | Using a larger allocation blocksize will also reduce external
| + | |
− | fragmentation. By reducing the free space fragmentation, when files
| + | |
− | are written (and later read) seeking between data blocks will be
| + | |
− | greatly reduced.
| + | |
− | | + | |
− | Directory blocks will be contiguous, which will speed up situations
| + | |
− | where a directory is very, very large.
| + | |
− | | + | |
− | = Downsides =
| + | |
− | | + | |
− | Internal fragmentation will be expensive for small files. So this is
| + | |
− | only useful for file systems where most files are large, or where the
| + | |
− | file system performance is more important than the losses caused by
| + | |
− | internal fragmentation.
| + | |
− | | + | |
− | Directories will also be allocated in chucks of the allocation block
| + | |
− | size. If this is especially large (such as 1 MiB), and there are a
| + | |
− | large number of directories, this could be quite expensive.
| + | |
− | Applications which use multi-level directory schemes to keep
| + | |
− | directories small to optimize for ext2's very slow large directory
| + | |
− | performance could be especially vulnerable.
| + | |