Ext4 VM Images

From Ext4
Revision as of 23:25, 15 February 2014 by Djwong (Talk | contribs)

Jump to: navigation, search

(djwong is still cleaning this up; readers beware)

Contents

Overview

Warious ext4 users have a particular use-case for system provisioning where they create a FS image, populate it with whatever files they want, and then want to reduce the image to a minimal one to conserve network bandwidth. At deploy time the minimal image is copied to a disk and expanded with resize2fs to fill the whole disk. There are a number of strategies to handle this operation, each with strengths and weaknesses.

Creating a The Smallest Image You Need

The easiest way to create a petite ext4 image is to create a filesystem that is slightly larger than the amount of data to be copied and then to copy the files in. cp writes out the files on by one, with the result that fragmentation should be fairly low. While it is still true that ext4 becomes less efficient when it's approaching ~90% full, the kernel driver's coping strategies are still better than that of resize2fs (see below).

The most difficult aspect of this strategy, of course, is estimating the size of the filesystem to create. While the space requirements of file data, extents, and directories can be estimated fairly neatly by summing each inode's i_blocks/i_blocks_high field (provided the source files are themselves on an ext2/3/4), there are other metadata in the filesystem that also require space, i.e. bitmaps, superblocks, group descriptors, and inode tables. Assuming a default mkfs invocation (has_journal,flexbg,extents,^64bit), there are four things to sum: block group metadata, block group descriptors, inode tables, and the journal. Each block group will be 128MiB in size (blocksize^2 * 8); each block group may need three blocks (two bitmaps and a backup superblock). The block group descriptors are usually 32 bytes apiece. Inode tables usually require 256 bytes for every potential inode, and a potential inode is usually reserved for every 16KiB of space. The journal is usually 128MiB. All of these options can be tweaked via mke2fs.conf or by passing command line options to mkfs.

The author (djwong) guesses that adding 5% overhead will be more than enough for ext4. Space can be conserved by creating a journal-less filesystem for deploy and then adding one later, if desired.

Please read the section on Copying Sparse Images Around for details about how to transport filesystem.

Sparse Images

cat /dev/zero > /mnt/bigfile

zerofree/fstrim

Underlying Storage

Copying Sparse Images Around

Shrinking an FS to Minimal Size

Note: This method is NOT recommended by the ext4 developers!

One common approach to solving the 'minified FS' problem is to run mkfs on a large-ish partition, copy the desired files into the filesystem, and then run resize2fs -M to shrink the file to the smallest possible size. Surprisingly, this does not necessarily yield the smallest possible image, due to a negative interaction between the ext4 block allocator and the way resize2fs implements block migration. Put simply, ext4 tries to minimize fragmentation by creating top level directories in different block groups and trying to store the same number of directories in each block groups. For normal use this is mostly ok because seeking between mostly contiguous files in different block groups is generally less costly (particularly given the use of disk readahead) than seeking all over heavily fragmented files crammed into as few block groups as possible.

However, when it comes time to minimize the filesystem, resize2fs will find itself with a lot of blocks to move. The block migration algorithm was not designed for efficiency; given a block to move, it simply moves it to the lowest available block and hopes for the best. Unfortunately, the general result of this is heavy file fragmentation. Worse yet, the increased fragmentation requires a more complex extent tree, which in turn will eat more disk blocks. The end result is an inefficient and slow filesystem.

Other Discussion

Jon Bernard wrote,

In order to support very large partitions, the filesystem is created
with an abnormally large inode table so that large resizes would be
possible.  I traced it to this commit as best I can tell:

    https://github.com/openstack/diskimage-builder/commit/fb246a02eb2ed330d3cc37f5795b3ed026aabe07

I assumed that additional inodes would be allocated along with block
groups during an online resize, but that commit contradicts my current
understanding. 

Ted T'so replied (and djwong has cleaned up somewhat),

Additional inodes *are* allocated as the file system is grown. Whoever thought otherwise was wrong. What happens is that there is a fixed number of inodes per block group. When the file system is resized, either by growing or shrinking file system, as block groups are added or removed from the file system, the inodes are added or removed along with the block groups.

What causes the least optimal data block layout is copying files into a large file system and then shrinking the file system to its minimum size with resize2fs -M. resize2fs' block migration algorithm is pretty stupid -- all blocks that require moving are moved, one by one to the lowest available block, without any regards to file fragmentation.

From a fragmentation standpoint it is better to create a file system that is slightly larger than the data you're trying to copy into it. There is so some non-optimality that occurs as the file system gets filled beyond about 90% full, but it's not nearly as bad as shrinking the file system -- which you should avoid at all costs.

From a performance point of view, the only time you should try to do an off-line resize2fs shrink is if you are shrinking the file system by a handful of blocks as part of converting a file system in place to use LVM or LUKS encryption, and you need to make room for some metadata blocks at the end of the partition.

The other thing thing to note is that if you are using a format such as qcow2, or something like the device-mapper's thin-provisining (thinp) scheme, or if you are willing to deal with sparse files, one approach is to not resize the file system at all. You could just use a tool like zerofree[1] to zero out all of the unused blocks in the file system, and then use "/bin/cp --sparse=always" to cause all zero blocks to be treated as sparse blocks on the destination file.

[1] http://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/kvm-xfstests/util/zerofree.c

This is part of how Ted maintains his root filesystem that he uses in a VM for testing ext4 changes upstream. After updating to the latest Debian unstable package updates and installing the latest updates from the xfstests and e2fsprogs git repositories, he runs the following script which uses the zerofree.c program to compress the qcow2 root file system image that he use with kvm:

http://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/kvm-xfstests/compress-rootfs

Also, starting with e2fsprogs 1.42.10, there's another way you can efficiently deploy a large file system image by only copying the blocks which are in use, by using a command like this:

      e2image -rap src_fs dest_fs

(See also the -c flag as described in e2image's man page if you want to use this technique to do incremental image-based backups onto a flash-based backup medium; Ted was using this for a while to keep two laptop SSD's root filesystems in sync with one another.)

So there are lots of ways that you can do what you need, all without playing games with resize2fs. Perhaps some of them would actually be better for your use case.

Calvin Walton notes that with a sufficiently recent QEMU profile (qemu 1.5+), if one configures an FS image as a virtual SCSI disk, it is possible to use fstrim inside the VM to make the backing file sparse.

NOTE: It is not a good idea to "zero" the filesystem image by "cat /dev/zero > /mnt/zerofile; rm -rf /mnt/zerofile"! While this does have the effect of filling most of the filesystem's free blocks with zeroes, it will be necessary to populate a block map or an extent tree; these blocks will not be zeroed. It is much more efficient to zero unused blocks offline or discard/trim unused blocks online, since there's no need to waste time invoking the block allocator on a huge temporary file.

Personal tools