Ext4 Metadata Checksums
Add crc32c to ext4 superblock, inode, block and inode bitmap, extent tree, directory block, htree block, and extended attribute objects with as few disk layout adjustments as possible.
As much as we wish our storage hardware was 100% reliable, it is still quite possible for data to be corrupted on disk, corrupted during transfer over a wire, or written to the wrong places. To protect against this sort of non-hostile corruption, it is desirable to store checksums of metadata objects on the filesystem to prevent broken metadata from shredding the filesystem. In theory, btrfs has stronger guarantees against corruption (uniform checksums on _all_ metadata blocks, redundant copies of all metadata, etc.) but this retrofit to ext4 will provide stronger protections for users who desire to stay with or refuse to migrate off of ext4, and at the fairly low cost of a single tune2fs/e2fsck.
This document is intended to record Darrick's metadata checksum design as he works on writing the necessary patches.
The popular sentiment is that a CRC will suffice to detect bit flips and other various corruption. The existing block group checksum uses the ANSI CRC16 polynomial (0x8005), which probably suffices for 32-byte block group descriptors. However, this crc16 is not be the most desirable function for the other metadata objects; longer CRCs are generally better at detecting errors when the data being checksummed gets large. It is expected that this will be the case since the bitmaps and the directory blocks are generally 4KiB in size.
The CRC32c polynomial (0x1EDC6F41) seems to have stronger error detection abilities over regular CRC32 (0x04C11DB7). It is implemented in hardware on Core i7 Intel CPUs and can be made to run reasonably quickly on other processors. Therefore, it seems desirable to use it. Further study is required to determine which CRCs (and which implementations) are fastest.
For the space-constrained block groups (at least in standard 32-bit mode) It has been suggested that because CRC16 is implemented in software, we should find a way to use the fast crc32c function yet somehow shrink the checksum to fit in 16 bits.
For the bitmap checksums it seems possible to take advantage of the property crc32(a ^ b) = crc32(a) ^ crc32(b).
I culled crc code from the Linux kernel and e2fsprogs, and linked it all into a big dumb program that crcs a large block of data. Following are bandwidth results (K/s) from various machines and a block size of 512MB:
|3.6GHz Pentium 4||3.6GHz||391,433||345,666||915,502||925,717||512,564||511,035||437,946||n/a|
|Athlon64 X2 4200+||2.2GHz||261,927||298,252||767,435||767,469||392,507||392,520||337,204||n/a|
|3GHz Pentium 4||3GHz||360,264||307,781||793,679||790,873||421,749||421,491||393,766||n/a|
|1GHz Pentium 3||1GHz||67,448||68,429||157,668||157,609||116,705||116,294||107,558||n/a|
|P4 Xeon MP||2.7GHz||174,024||150,326||267,248||267,390||175,788||176,342||185,110||n/a|
At a 4K block size the time slices are so tiny that it's difficult to identify any clear trends.
The crc32-kern and crc32-e2fsprogs implementations seem to use the slice-by-4 optimization. I think Bob Pearson is proposing to replace it with a slice-by-8, though that hasn't gone upstream yet. The crc32c and crc16 implementations use the classic Sarwate algorithm; converting them to use a more efficient algorithm will probably speed them up significantly. Hardware crc32c eats the others for lunch. The crc32-e2fs and crc32-kern implementations seem fairly similar, so it is a surprise that the kernel implementation runs at twice the speed.
It appears to be trivially easy to adapt the kernel's crc32 slice-by-4 code for use with crc32c; this results in a similar performance profile. I would like to push such a beast upstream.
As a side note, it is also desirable to optimize the crc16-t10dif algorithm, not for ext4 but for DIF disks.
Existing Metadata Checksumming
The block group descriptor is protected by a CRC16. On a 64-bit filesystem, it may be possible either to extend the field to 32-bits, or to stuff a 32-bit crc into 16 bits per the "Stuffing" section above.
jbd2 has a (probably infrequently) used journal_checksum feature that ensures the integrity of the journal contents. Currently it supports CRC32, MD5, or SHA1 checksums, though as of Linux 3.0 it only seems to support CRC32. This can be easily switched over to CRC32c.
On-Disk Structure Modifications
Darrick will try to implement this without requiring an on-disk format change. Basically, that means that we have to find places where checksums can be crammed into existing data structures.
Andi Kleen posted a patch to checksum the superblock. Darrick plans to massage this patch a little bit, and store the crc32c into the superblock somewhere around offset 0x240.
Inode checksums are only supported on Linux. The checksum is a crc32c field at offset 0x7C, which puts it in the middle of osd2.linux2. The checksum covers just the inode, not any extended attributes that may follow the inode in the inode table.
Inode/Block Bitmap (64-bit)
Each bitmap has its own crc32c checksum; both checksums are stored in the block group descriptor. The inode bitmap checksum is at offset 0x18, and the block bitmap checksum is at offset 0x38. This only works if the 64bit feature is set, unfortunately.
Inode/Block Bitmap (32-bit)
For 32-bit filesystems, Darrick is considering using the 16-bit fields in the block group descriptor at offset 0x18 and 0x20 to store either crc16 or stuffed crc32c values of the inode and block bitmaps. It's probably better to have a slow crc16 over no crc at all.
Filesystem blocks are always 1024, 2048, or 4096 bytes, and the extent tree header and entry structures are both 12 bytes long. Therefore, because 2^n % 12 >= 4, there is sufficient space to store a crc32c just past the end of the last
struct ext4_extent. The checksum is computed only the part of the extent block that is in use.
Regular directory leaf blocks (i.e. blocks that are not secretly htree nodes) are a semi-packed array of variable-length records. A 12-byte directory entry is created at the end of the block with a an inode of 0 to make the entry look unused to old ext4 drivers; a name_len of 0; and a rec_len large enough to hold a crc32c. In a cursory analysis of 250,000 directories, just 29 had blocks that did not have sufficient space to hold the 12-byte tail. tune2fs will advise users to run e2fsck -D to rebuild all directories so that all directory blocks may have a checksum.
The htree root and internal nodes do not hide a checksum in a fake dirent at the end of the block because that would require the removal of two
struct dx_entry from each htree block. Instead, the limit count is decreased by 1 and the crc32c stored at the end of the block. Again, tune2fs will advise users to run e2fsck -D to rebuild all directories and perform any necessary htree rebalancing.
Unfortunately, in adding htree checksums to a very very large directory, it is possible to overflow the htree.
Extended Attributes (EAs)
For EAs stored in a separate disk block (i.e. not stored after the inode), there is sufficient space to store a crc32c directly in the header.
For EAs stored in the extra space after the inode, there is a 4-byte "h_magic" field. This field is set to 0xEA020000 by the Linux 3.0 kernel driver ... but the kernel never reads that value back, and e2fsprogs skips over it. It does NOT appear that the field is being used by anything. Unless someone else is using that field for undocumented purposes, Darrick intends to turn the field into a crc32c of that extra space, since EAs are the only thing that ever come after inodes.
Metadata Not Being Upgraded
Direct/indirect/triple-indirect block maps are not targeted for checksums, as this results in a totally incompatible disk format change and reduces the maximum file size considerably. Files should be converted to extents via
chattr +e for increased safety and less overhead.
A user should be able to turn on this feature at mke2fs time simply by specifying
-O metadata_csum. Because the 64bit feature allows arbitrarily large block group descriptors that are large enough to enable crc32c for the bitmaps, mke2fs should warn the user if the feature set is metadata_csum,^64bit when it becomes the case that the 64bit feature has been tested thoroughly.
It should be possible to convert existing filesystems with a simple
tune2fs -O metadata_csum. tune2fs will apply checksums to all metadata structures that can trivially take them, and tell the user to run
e2fsck -D if necessary. e2fsck will gain the ability to reorganize directory tree blocks to accommodate the checksum fields. Obviously, 64bit mode cannot (currently) be enabled on existing filesystems.
It should be possible to disable metadata checksumming on an existing filesystem with
tune2fs -O ^metadata_csum, with the same conditions outlined for enabling checksums on an existing filesystem.
debugfs should try to display checksums whenever possible.
It should NOT be possible for old fs code to write to a filesystem with metadata checksums enabled. The metadata_csum flag is implemented as a ROCOMPAT flag, which should keep (non-malicious) old programs from messing things up.
Stuff Darrick Hasn't Thought Hard Enough About
- Other filesystems' use of checksums??
- Other ext4 features being concurrently developed?
- Things like Lustre that use some of the ext4 fields without noting it in the ext4 documentation.
- Hardware acceleration of CRC32/CRC16.
- Defensive programming when we have to parse the metadata that is being checksummed (extent tree? dir blocks? htree blocks?)