Ext4 VM Images

From Ext4
Revision as of 02:11, 28 February 2014 by Djwong (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search



Various ext4 users have a particular use-case for system provisioning where they create a FS image, populate it with whatever files they want, and then want to reduce the image to a minimal one to conserve network bandwidth. At deploy time the minimal image is copied to a disk and expanded with resize2fs to fill the whole disk. There are a number of strategies to handle this operation, each with strengths and weaknesses; this guide attempts to present some of the better known solutions that have come up over the years.

Creating The Smallest Image You Need

The easiest way to create a petite ext4 image is to create a filesystem that is slightly larger than the amount of data to be copied and then to copy the files in. cp writes out the files on by one, with the result that fragmentation should be fairly low. While it is still true that ext4 becomes less efficient when it's approaching ~90% full, the kernel driver's coping strategies are still better than that of resize2fs (see below).

The most difficult aspect of this strategy, of course, is estimating the size of the filesystem to create. While the space requirements of file data, extents, and directories can be estimated fairly neatly by summing each inode's i_blocks/i_blocks_high field (provided the source files are themselves on an ext2/3/4), there are other metadata in the filesystem that also require space, i.e. bitmaps, superblocks, group descriptors, and inode tables. Assuming a default mkfs invocation (has_journal,flexbg,extents,^64bit), there are four things to sum: block group metadata, block group descriptors, inode tables, and the journal. Each block group will be 128MiB in size (blocksize^2 * 8); each block group may need three blocks (two bitmaps and a backup superblock). The block group descriptors are usually 32 bytes apiece. Inode tables usually require 256 bytes for every potential inode, and a potential inode is usually reserved for every 16KiB of space. The journal is usually 128MiB. All of these options can be tweaked via mke2fs.conf or by passing command line options to mkfs.

The author (djwong) guesses that adding 5% overhead will be more than enough for ext4. Space can be conserved by creating a journal-less filesystem for deploy and then adding one later, if desired.

Please read the section on Copying Sparse Images Around for details about how to transport filesystem.

Sparse Images

In general, there are large parts of a filesystem that are unused and need not be copied around when transmitting filesystem images. There are several ways in which a filesystem image can be said to be sparse: (1) the unused blocks have had zeroes written to them; (2) the underlying storage has been instructed not to map physical storage to the unused blocks and therefore returns zeroes (hopefully) for a disk read; or (3) the image lives on a filesystem and the underlying filesystem does not map a physical block to the logical block in the file. Ideally, we'd be able to detect runs of unmapped blocks and runs of mapped but zeroed blocks to avoid transmitting data unnecessarily.

Underlying Storage

If a filesystem's underlying storage supports unmapping blocks, it is possible to use this capability to detach blocks from unused areas of the FS. This can be obtained by hosting the FS atop: a raw image on a reasonably sophisticated filesystem; qcow2; device-mapper's thin provisioning (dm-thinp) scheme; an expensive SCSI array; or a (competently engineered) SSD.

With a recent enough QEMU (1.5?) one can present virtual disks to the guest via virtio-scsi, sym53c8xx, or ahci. If the backing for the virtual disk (file on a FS, raw block device, etc.) supports discard/trim/unmap, qemu will detect this and enable the guest to pass those requests through to the storage. QEMU 1.4 has it off by default, but I think 1.5 has it on by default. Thanks to Calvin Walton for pointing this out.

Good Tools for Making Sparse Images: fstrim/e2fsck/zerofree

fstrim is a tool that instructs online filesystems to issue a discard/trim/unmap command for all unused blocks in the filesystem. If the underlying storage supports it, this will result in the unused blocks being detached from the image or disk. fstrim should work on raw devices; support for loopback-mounted file images was added to the kernel in 2013. Note, however, that as of kernel 3.13 fstrim will not fall back to writing zeroes if the storage does not support discard/trim/unmap. This may change in the near future.

Recent versions of e2fsck have been taught to discard unused blocks after a full filesystem check. The discard operation is performed during pass 5, and only if the filesystem doesn't contain recognizable errors. Keep in mind that discarding the device may prevent further recovery if the filesystem has errors that are not known to the version of e2fsck. To use this method, run e2fsck -E discard.

zerofree is an offline tool for ext4 that reads the block bitmaps and writes zeroes to unused blocks. Note that it does not (yet) support issuing discard/trim/unmap commands to the underlying storage the way that fstrim can. Ted T'so says that he uses compress-rootfs to maintains the VM root filesystem that he uses to test upstream ext4 changes. After updating to the latest Debian unstable package updates and installing the latest updates from the xfstests and e2fsprogs git repositories, he runs compress-rootfs which uses the zerofree.c program to compress the qcow2 root file system image that he use with kvm.

Bad Tool for Making Sparse Images: cat /dev/zero > /mnt/bigfile

NOTE: This method is not recommended by the ext4 developers!

While it is possible to zero out unused disk blocks by writing a lot of zeroes to a file and then deleting it, this is not a recommended way to do this! First of all, building such a huge file requires the construction of an extent tree, which means that the filesystem ends up allocating disk blocks in order to zero unused disk blocks. This is not efficient, and if you've not mounted with -o discard, the extent tree blocks are never zeroed. This problem gets much worse on any filesystem that doesn't use extents (ext2/3). The second problem is that ext4 remembers which block groups have never had blocks allocated, which enables e2fsck to skip checking whether the block group's free block count corresponds to the block bitmap. This slows down e2fsck.

Kristian Hermansen mentioned this method in a G+ thread.

Copying Sparse Images Around

To copy a sparse FS image from one storage device to another, one can use e2image -rap src_dev dest_dev to copy only the blocks that are in use from src_dev to dest_dev. Ted T'so also points out that the -c option compares blocks between src_dev and dest_dev, and copies the block if there's a discrepancy; apparently he was using this to keep two root FSes in sync with each other on his laptop. This only works with newer versions of e2fsprogs, alas (1.42.9?).

A neat way to copy a sparse FS image hosted atop a file or device is to use rsync -S. The -S option tells rsync to look for runs of zeroes in the source file, and not to copy them to the destination file, regardless of whether or not the source has had its unused blocks unmapped. This is a little inefficient, since it doesn't query the file's logical block map.

If you know the source is a file that is also maximally (or satisfactorily) sparse, you can also use cp --sparse=always to copy the file. The --sparse option tells cp to read the extent map of the file to determine which data needn't be copied, though it only seems to query the file's logical block map if there's a large enough discrepancy between the source file's size and blocks allocated.

For those wishing to send a compressed tarball, tar -S creates a tarball with sparse files inside. Like rsync, it looks for runs of zeroes to skip and doesn't query the file's logical block map.

Shrinking an FS to Minimal Size

Note: This method is NOT recommended by the ext4 developers!

One common approach to solving the 'minified FS' problem is to run mkfs on a large-ish partition, copy the desired files into the filesystem, and then run resize2fs -M to shrink the file to the smallest possible size. Surprisingly, this does not necessarily yield the smallest possible image, due to a negative interaction between the ext4 block allocator and the way resize2fs implements block migration. Put simply, ext4 tries to minimize fragmentation by creating top level directories in different block groups and trying to store the same number of directories in each block groups. For normal use this is mostly ok because seeking between mostly contiguous files in different block groups is generally less costly (particularly given the use of disk readahead) than seeking all over heavily fragmented files crammed into as few block groups as possible.

However, when it comes time to minimize the filesystem, resize2fs will find itself with a lot of blocks to move. The block migration algorithm was not designed for efficiency; given a block to move, it simply moves it to the lowest available block and hopes for the best. Unfortunately, the general result of this is heavy file fragmentation. Worse yet, the increased fragmentation requires a more complex extent tree, which in turn will eat more disk blocks. The end result is an inefficient and slow filesystem.

Ted T'so adds that the only time you should try to do an off-line resize2fs shrink is if you are shrinking the file system by a handful of blocks as part of converting a file system in place to use LVM or LUKS encryption, and you need to make room for some metadata blocks at the end of the partition.


The original discussion got started when Jon Bernard asked about OpenStack's interaction with ext4 resize2fs:

In order to support very large partitions, the filesystem is created with an abnormally large inode table so that large resizes would be possible. I traced it to this commit as best I can tell. I assumed that additional inodes would be allocated along with block groups during an online resize, but that commit contradicts my current understanding.

Ted T'so replied (and djwong has cleaned up),

Additional inodes *are* allocated as the file system is grown. Whoever thought otherwise was wrong. What happens is that there is a fixed number of inodes per block group. When the file system is resized, either by growing or shrinking file system, as block groups are added or removed from the file system, the inodes are added or removed along with the block groups.

Personal tools