Ext4 VM Images

From Ext4
Revision as of 00:08, 16 February 2014 by Djwong (Talk | contribs)

Jump to: navigation, search

(djwong is still cleaning this up; readers beware)

Contents

Overview

Warious ext4 users have a particular use-case for system provisioning where they create a FS image, populate it with whatever files they want, and then want to reduce the image to a minimal one to conserve network bandwidth. At deploy time the minimal image is copied to a disk and expanded with resize2fs to fill the whole disk. There are a number of strategies to handle this operation, each with strengths and weaknesses.

Creating a The Smallest Image You Need

The easiest way to create a petite ext4 image is to create a filesystem that is slightly larger than the amount of data to be copied and then to copy the files in. cp writes out the files on by one, with the result that fragmentation should be fairly low. While it is still true that ext4 becomes less efficient when it's approaching ~90% full, the kernel driver's coping strategies are still better than that of resize2fs (see below).

The most difficult aspect of this strategy, of course, is estimating the size of the filesystem to create. While the space requirements of file data, extents, and directories can be estimated fairly neatly by summing each inode's i_blocks/i_blocks_high field (provided the source files are themselves on an ext2/3/4), there are other metadata in the filesystem that also require space, i.e. bitmaps, superblocks, group descriptors, and inode tables. Assuming a default mkfs invocation (has_journal,flexbg,extents,^64bit), there are four things to sum: block group metadata, block group descriptors, inode tables, and the journal. Each block group will be 128MiB in size (blocksize^2 * 8); each block group may need three blocks (two bitmaps and a backup superblock). The block group descriptors are usually 32 bytes apiece. Inode tables usually require 256 bytes for every potential inode, and a potential inode is usually reserved for every 16KiB of space. The journal is usually 128MiB. All of these options can be tweaked via mke2fs.conf or by passing command line options to mkfs.

The author (djwong) guesses that adding 5% overhead will be more than enough for ext4. Space can be conserved by creating a journal-less filesystem for deploy and then adding one later, if desired.

Please read the section on Copying Sparse Images Around for details about how to transport filesystem.

Sparse Images

In general, there are large parts of a filesystem that are unused and need not be copied around when transmitting filesystem images.

Underlying Storage

If a filesystem's underlying storage supports unmapping blocks, it is possible to use this capability to detach blocks from unused areas of the FS. This can be obtained by hosting the FS atop: a raw image on a reasonably sophisticated filesystem; qcow2; device-mapper's thin provisioning (dm-thinp) scheme; an expensive SCSI array; or a (competently engineered) SSD.

However, one might wonder, how does one tell the underlying storage to forget about unused disk blocks? The simplest method is to mount the filesystem and run fstrim on it. djwong has a mutant zerofree that he uses to do the same thing offline, though he hasn't yet submitted this for e2fsprogs.

With a recent enough QEMU (1.5?) one can present virtual disks to the guest via virtio-scsi, sym53c8xx, or ahci. If the backing for the virtual disk (file on a FS, raw block device, etc.) supports discard/trim/unmap, qemu will detect this and enable the guest to pass those requests through to the storage. QEMU 1.4 has it off by default, but I think 1.5 has it on by default. Thanks to Calvin Walton for pointing this out.

zerofree/fstrim

fstrim is a tool that instructs online filesystems to issue a discard/trim/unmap command for all unused blocks in the filesystem. If the underlying storage supports it, this will result in the unused blocks being detached from the image or disk. As of kernel 3.13, unfortunately, it will not fall back to writing zeroes if the storage does not support discard/trim/unmap. This may change in the near future.

zerofree[1] is an offline tool for ext4 that reads the block bitmaps and writes zeroes to unused blocks. Note that it does not (yet) support issuing discard/trim/unmap commands to the underlying storage the way that fstrim can.

cat /dev/zero > /mnt/bigfile

NOTE: This method is not recommended by the ext4 developers!

While it is possible to zero out unused disk blocks by writing a lot of zeroes to a file and then deleting it, this is not a recommended way to do this! First of all, building such a huge file requires the construction of an extent tree, which means that the filesystem ends up allocating disk blocks in order to zero unused disk blocks. This is not efficient, and if you've not mounted with -o discard, the extent tree blocks are never zeroed. This problem gets much worse on any filesystem that doesn't use extents (ext2/3). The second problem is that ext4 remembers which block groups have never had blocks allocated, which enables e2fsck to skip checking whether the block group's free block count corresponds to the block bitmap. This slows down e2fsck.

Copying Sparse Images Around

To copy a sparse FS image from one storage device to another, one can use e2image -rap src_dev dest_dev to copy only the blocks that are in use from src_dev to dest_dev. Ted T'so also points out that the -c option compares blocks between src_dev and dest_dev, and copies the block if there's a discrepancy; apparently he was using this to keep two root FSes in sync with each other on his laptop. This only works with newer versions of e2fsprogs, alas (1.42.9?).

A neat way to copy a sparse FS image hosted atop a file is to use rsync -S. The -S option tells rsync to look for runs of zeroes in the source file, and not to copy them to the destination file, regardless of whether or not the source file is sparse. If you know the source file is maximally sparse, you can also use cp --sparse=always to copy the file. The --sparse option tells cp to read the extent map of the file to determine which data needn't be copied, though it only seems to query the extent map functionality if there's a large enough discrepancy between the source file's size and blocks allocated. For those wishing to send a compressed image, tar -S can create a sparse archive.

Shrinking an FS to Minimal Size

Note: This method is NOT recommended by the ext4 developers!

One common approach to solving the 'minified FS' problem is to run mkfs on a large-ish partition, copy the desired files into the filesystem, and then run resize2fs -M to shrink the file to the smallest possible size. Surprisingly, this does not necessarily yield the smallest possible image, due to a negative interaction between the ext4 block allocator and the way resize2fs implements block migration. Put simply, ext4 tries to minimize fragmentation by creating top level directories in different block groups and trying to store the same number of directories in each block groups. For normal use this is mostly ok because seeking between mostly contiguous files in different block groups is generally less costly (particularly given the use of disk readahead) than seeking all over heavily fragmented files crammed into as few block groups as possible.

However, when it comes time to minimize the filesystem, resize2fs will find itself with a lot of blocks to move. The block migration algorithm was not designed for efficiency; given a block to move, it simply moves it to the lowest available block and hopes for the best. Unfortunately, the general result of this is heavy file fragmentation. Worse yet, the increased fragmentation requires a more complex extent tree, which in turn will eat more disk blocks. The end result is an inefficient and slow filesystem.

Ted T'so adds that the only time you should try to do an off-line resize2fs shrink is if you are shrinking the file system by a handful of blocks as part of converting a file system in place to use LVM or LUKS encryption, and you need to make room for some metadata blocks at the end of the partition.

Other Discussion

Jon Bernard wrote,

In order to support very large partitions, the filesystem is created
with an abnormally large inode table so that large resizes would be
possible.  I traced it to this commit as best I can tell:

    https://github.com/openstack/diskimage-builder/commit/fb246a02eb2ed330d3cc37f5795b3ed026aabe07

I assumed that additional inodes would be allocated along with block
groups during an online resize, but that commit contradicts my current
understanding. 

Ted T'so replied (and djwong has cleaned up somewhat),

Additional inodes *are* allocated as the file system is grown. Whoever thought otherwise was wrong. What happens is that there is a fixed number of inodes per block group. When the file system is resized, either by growing or shrinking file system, as block groups are added or removed from the file system, the inodes are added or removed along with the block groups.

What causes the least optimal data block layout is copying files into a large file system and then shrinking the file system to its minimum size with resize2fs -M. resize2fs' block migration algorithm is pretty stupid -- all blocks that require moving are moved, one by one to the lowest available block, without any regards to file fragmentation.

From a fragmentation standpoint it is better to create a file system that is slightly larger than the data you're trying to copy into it. There is so some non-optimality that occurs as the file system gets filled beyond about 90% full, but it's not nearly as bad as shrinking the file system -- which you should avoid at all costs.

From a performance point of view, the only time you should try to do an off-line resize2fs shrink is if you are shrinking the file system by a handful of blocks as part of converting a file system in place to use LVM or LUKS encryption, and you need to make room for some metadata blocks at the end of the partition.

The other thing thing to note is that if you are using a format such as qcow2, or something like the device-mapper's thin-provisining (thinp) scheme, or if you are willing to deal with sparse files, one approach is to not resize the file system at all. You could just use a tool like zerofree[1] to zero out all of the unused blocks in the file system, and then use "/bin/cp --sparse=always" to cause all zero blocks to be treated as sparse blocks on the destination file.

[1] http://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/kvm-xfstests/util/zerofree.c

This is part of how Ted maintains his root filesystem that he uses in a VM for testing ext4 changes upstream. After updating to the latest Debian unstable package updates and installing the latest updates from the xfstests and e2fsprogs git repositories, he runs the following script which uses the zerofree.c program to compress the qcow2 root file system image that he use with kvm:

http://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/kvm-xfstests/compress-rootfs

Also, starting with e2fsprogs 1.42.10, there's another way you can efficiently deploy a large file system image by only copying the blocks which are in use, by using a command like this:

      e2image -rap src_fs dest_fs

(See also the -c flag as described in e2image's man page if you want to use this technique to do incremental image-based backups onto a flash-based backup medium; Ted was using this for a while to keep two laptop SSD's root filesystems in sync with one another.)

So there are lots of ways that you can do what you need, all without playing games with resize2fs. Perhaps some of them would actually be better for your use case.

Calvin Walton notes that with a sufficiently recent QEMU profile (qemu 1.5+), if one configures an FS image as a virtual SCSI disk, it is possible to use fstrim inside the VM to make the backing file sparse.

NOTE: It is not a good idea to "zero" the filesystem image by "cat /dev/zero > /mnt/zerofile; rm -rf /mnt/zerofile"! While this does have the effect of filling most of the filesystem's free blocks with zeroes, it will be necessary to populate a block map or an extent tree; these blocks will not be zeroed. It is much more efficient to zero unused blocks offline or discard/trim unused blocks online, since there's no need to waste time invoking the block allocator on a huge temporary file.

Personal tools