Ext4 Metadata Checksums
Add crc32c to ext4 superblock, inode, block and inode bitmap, extent tree, directory block, htree block, and extended attribute objects with as few disk layout adjustments as possible.
As much as we wish our storage hardware was 100% reliable, it is still quite possible for data to be corrupted on disk, corrupted during transfer over a wire, or written to the wrong places. To protect against this sort of non-hostile corruption, it is desirable to store checksums of metadata objects on the filesystem to prevent broken metadata from shredding the filesystem. In theory, btrfs has stronger guarantees against corruption (uniform checksums on _all_ metadata blocks, redundant copies of all metadata, etc.) but this retrofit to ext4 will provide stronger protections for users who desire to stay with or refuse to migrate off of ext4, and at the fairly low cost of a single tune2fs/e2fsck.
This document is intended to record Darrick's metadata checksum design as he works on writing the necessary patches.
The popular sentiment is that a CRC will suffice to detect bit flips and other various corruption. The existing block group checksum uses the ANSI CRC16 polynomial (0x8005), which probably suffices for 32-byte block group descriptors. However, this crc16 is not be the most desirable function for the other metadata objects; longer CRCs are generally better at detecting errors when the data being checksummed gets large. It is expected that this will be the case since the bitmaps and the directory blocks are generally 4KiB in size.
The CRC32c polynomial (0x1EDC6F41) seems to have stronger error detection abilities over regular CRC32 (0x04C11DB7). It is implemented in hardware on Core i7 Intel CPUs and can be made to run reasonably quickly on other processors. Therefore, it seems desirable to use it. Further study is required to determine which CRCs (and which implementations) are fastest.
For the space-constrained block groups (at least in standard 32-bit mode) It has been suggested that because CRC16 is implemented in software, we should find a way to use the fast crc32c function yet somehow shrink the checksum to fit in 16 bits.
For the bitmap checksums it seems possible to take advantage of the property crc32(a ^ b) = crc32(a) ^ crc32(b).
I culled crc code from the Linux kernel and e2fsprogs, and linked it all into a big dumb program that crcs a large block of data. Following are bandwidth results (K/s) from various machines and a block size of 512MB:
|3.6GHz Pentium 4||3.6GHz||391,433||345,666||915,502||925,717||512,564||511,035||437,946||n/a||1,097,068||1,146,004||1,099,856|
|Athlon64 X2 4200+||2.2GHz||261,927||298,252||767,435||767,469||392,507||392,520||337,204||n/a||1,193,278||1,102,328||1,136,237|
|3GHz Pentium 4||3GHz||360,264||307,781||793,679||790,873||421,749||421,491||393,766||n/a||935,662||942,952||910,220|
|1GHz Pentium 3||1GHz||67,448||68,429||157,668||157,609||116,705||116,294||107,558||n/a|
|P4 Xeon MP||2.7GHz||174,024||150,326||267,248||267,390||175,788||176,342||185,110||n/a||319,609||320,717||270,821|
Here is a description of the various CRC implementations tested:
|crc16||ANSI CRC16 algorithm in kernel (Sarwate)|
|crc16-t10dif||T10 CRC16 used for DIF in kernel (Sarwate)|
|crc32-kern-be||BE CRC32 in kernel (slice by 4)|
|crc32-kern-le||LE CRC32 in kernel (slice by 4)|
|crc32-e2fs-be||BE CRC32 in e2fsprogs 1.41 (slice by 4)|
|crc32-e2fs-le||LE CRC32 in e2fsprogs 1.41 (slice by 4)|
|crc32c||Default CRC32C in kernel (Sarwate)|
|crc32c-intel||Accelerated CRC32C on Intel Core i7|
|crc32c-by8-be||Bob Pearson's updated BE CRC32 algorithm, but with CRC32C polynomial (slice by 8)|
|crc32c-by8-le||Bob Pearson's updated LE CRC32 algorithm, but with CRC32C polynomial (slice by 8)|
|crc32c-intelby8||Intel's CRC32C algorithm http://prdownloads.sourceforge.net/slicing-by-8/ (slice by 8)|
At a 4K block size the time slices are so tiny that it's difficult to identify any clear trends.
It is well known that Sarwate's algorithm has been superseded (performance-wise) by the bit slicing implementations; these results support that conclusion. All slice-by-N implementations had #define'd a polynomial, making it trivially easy to port the code to the "default" CRC32C implementation. Obviously, the hardware solution eats all the others for lunch, though it only exceeds the slice-by-8 algorithm by a factor of ~2.5x and the slice-by-4 algorithms by a factor of ~4x. Either way, 1.5GB/s of _metadata_ updates is quite a lot, so the performance hit might not be too hard provided we can replace the current software crc32c code with one of these slice-by algorithms.
As a side note, it is also desirable to optimize the crc16-t10dif algorithm, not for ext4 but for DIF disks.
Also, I hear that the upcoming SPARC T4 will have hardware CRC32c acceleration.
Existing Metadata Checksumming
The block group descriptor is protected by a CRC16. On a 64-bit filesystem, it may be possible either to extend the field to 32-bits, or to stuff a 32-bit crc into 16 bits per the "Stuffing" section above.
jbd2 has a (probably infrequently) used journal_checksum feature that ensures the integrity of the journal contents. Currently it supports CRC32, MD5, or SHA1 checksums, though as of Linux 3.0 it only seems to support CRC32. This can be easily switched over to CRC32c.
On-Disk Structure Modifications
Darrick will try to implement this without requiring an on-disk format change. Basically, that means that we have to find places where checksums can be crammed into existing data structures.
Andi Kleen posted a patch to checksum the superblock. Darrick plans to massage this patch a little bit, and store the crc32c into the superblock somewhere around offset 0x240.
Inode checksums are only supported on Linux. The checksum is a crc32c field at offset 0x7C, which puts it in the middle of osd2.linux2. The checksum covers the inode and everything else that follows it (afaik in-inode extended attribute blocks).
Inode/Block Bitmap (64-bit)
Each bitmap has its own crc32c checksum; both checksums are stored in the block group descriptor. The inode bitmap checksum is at offset 0x18, and the block bitmap checksum is at offset 0x38. This only works if the 64bit feature is set, unfortunately.
Inode/Block Bitmap (32-bit)
For 32-bit filesystems, Darrick is considering using the 16-bit fields in the block group descriptor at offset 0x18 and 0x20 to store either crc16 or stuffed crc32c values of the inode and block bitmaps. It's probably better to have a slow crc16 over no crc at all.
Filesystem blocks are always 1024, 2048, or 4096 bytes, and the extent tree header and entry structures are both 12 bytes long. Therefore, because 2^n % 12 >= 4, there is sufficient space to store a crc32c just past the end of the last
struct ext4_extent. The checksum is computed only the part of the extent block that is in use.
Regular directory leaf blocks (i.e. blocks that are not secretly htree nodes) are a semi-packed array of variable-length records. A 12-byte directory entry is created at the end of the block with a an inode of 0 to make the entry look unused to old ext4 drivers; a name_len of 0; and a rec_len large enough to hold a crc32c. In a cursory analysis of 250,000 directories, just 29 had blocks that did not have sufficient space to hold the 12-byte tail. tune2fs will advise users to run e2fsck -D to rebuild all directories so that all directory blocks may have a checksum.
The htree root and internal nodes do not hide a checksum in a fake dirent at the end of the block because that would require the removal of two
struct dx_entry from each htree block. Instead, the limit count is decreased by 1 and the crc32c stored at the end of the block. Again, tune2fs will advise users to run e2fsck -D to rebuild all directories and perform any necessary htree rebalancing.
Unfortunately, in adding htree checksums to a very very large directory, it is possible to overflow the htree.
Extended Attributes (EAs)
For EAs stored in a separate disk block (i.e. not stored after the inode), there is sufficient space to store a crc32c directly in the header.
For EAs stored in the extra space after the inode, Darrick thought incorrectly that the h_magic field was never checked. That turned out to be untrue, so his new proposal is to follow Andreas Dilger's suggestion simply to extend the inode checksum to cover the extra space after the inode structure. That will require a fair amount of changes to e2fsprogs, but not a lot for the kernel.
Metadata Not Being Upgraded
Direct/indirect/triple-indirect block maps are not targeted for checksums, as this results in a totally incompatible disk format change and reduces the maximum file size considerably. Files should be converted to extents via
chattr +e for increased safety and less overhead.
A user should be able to turn on this feature at mke2fs time simply by specifying
-O metadata_csum. Because the 64bit feature allows arbitrarily large block group descriptors that are large enough to enable crc32c for the bitmaps, mke2fs should warn the user if the feature set is metadata_csum,^64bit when it becomes the case that the 64bit feature has been tested thoroughly.
It should be possible to convert existing filesystems with a simple
tune2fs -O metadata_csum. tune2fs will apply checksums to all metadata structures that can trivially take them, and tell the user to run
e2fsck -D if necessary. e2fsck will gain the ability to reorganize directory tree blocks to accommodate the checksum fields. Obviously, 64bit mode cannot (currently) be enabled on existing filesystems.
It should be possible to disable metadata checksumming on an existing filesystem with
tune2fs -O ^metadata_csum, with the same conditions outlined for enabling checksums on an existing filesystem.
debugfs should try to display checksums whenever possible.
It should NOT be possible for old fs code to write to a filesystem with metadata checksums enabled. The metadata_csum flag is implemented as a ROCOMPAT flag, which should keep (non-malicious) old programs from messing things up.
Stuff Darrick Hasn't Thought Hard Enough About
- Other filesystems' use of checksums??
- Other ext4 features being concurrently developed?
- Value-adds that use some ext4 fields without noting it in the ext4 documentation.
- Defensive programming when we have to parse the metadata that is being checksummed (extent tree? dir blocks? htree blocks?)