Ext4 Metadata Checksums

From Ext4
(Difference between revisions)
Jump to: navigation, search
(Extended Attributes (EAs))
(Inodes)
Line 131: Line 131:
 
== Inodes ==
 
== Inodes ==
  
Inode checksums are only supported on Linux.  The checksum is a crc32c field at offset 0x7C, which puts it in the middle of osd2.linux2.  The checksum covers just the inode, not any extended attributes that may follow the inode in the inode table.
+
Inode checksums are only supported on Linux.  The checksum is a crc32c field at offset 0x7C, which puts it in the middle of osd2.linux2.  The checksum covers the inode and everything else that follows it (afaik in-inode extended attribute blocks).
  
 
== Inode/Block Bitmap (64-bit) ==
 
== Inode/Block Bitmap (64-bit) ==

Revision as of 22:29, 22 August 2011

Contents

Overview

Add crc32c to ext4 superblock, inode, block and inode bitmap, extent tree, directory block, htree block, and extended attribute objects with as few disk layout adjustments as possible.

As much as we wish our storage hardware was 100% reliable, it is still quite possible for data to be corrupted on disk, corrupted during transfer over a wire, or written to the wrong places. To protect against this sort of non-hostile corruption, it is desirable to store checksums of metadata objects on the filesystem to prevent broken metadata from shredding the filesystem. In theory, btrfs has stronger guarantees against corruption (uniform checksums on _all_ metadata blocks, redundant copies of all metadata, etc.) but this retrofit to ext4 will provide stronger protections for users who desire to stay with or refuse to migrate off of ext4, and at the fairly low cost of a single tune2fs/e2fsck.

This document is intended to record Darrick's metadata checksum design as he works on writing the necessary patches.

Algorithm

The popular sentiment is that a CRC will suffice to detect bit flips and other various corruption. The existing block group checksum uses the ANSI CRC16 polynomial (0x8005), which probably suffices for 32-byte block group descriptors. However, this crc16 is not be the most desirable function for the other metadata objects; longer CRCs are generally better at detecting errors when the data being checksummed gets large. It is expected that this will be the case since the bitmaps and the directory blocks are generally 4KiB in size.

The CRC32c polynomial (0x1EDC6F41) seems to have stronger error detection abilities over regular CRC32 (0x04C11DB7). It is implemented in hardware on Core i7 Intel CPUs and can be made to run reasonably quickly on other processors. Therefore, it seems desirable to use it. Further study is required to determine which CRCs (and which implementations) are fastest.

CRC Stuffing

For the space-constrained block groups (at least in standard 32-bit mode) It has been suggested that because CRC16 is implemented in software, we should find a way to use the fast crc32c function yet somehow shrink the checksum to fit in 16 bits.

For the bitmap checksums it seems possible to take advantage of the property crc32(a ^ b) = crc32(a) ^ crc32(b).

Benchmarking

I culled crc code from the Linux kernel and e2fsprogs, and linked it all into a big dumb program that crcs a large block of data. Following are bandwidth results (K/s) from various machines and a block size of 512MB:

machine clock crc16 crc16-t10dif crc32-kern-be crc32-kern-le crc32-e2fs-be crc32-e2fs-le crc32c crc32c-intel crc32c-by8-be crc32c-by8-le crc32c-intelby8
Xeon X5650 2.67GHz 381,856 293,951 1,039,389 1,059,679 454,377 454,133 419,964 4,447,431 1,684,071 1,698,309 1,843,101
Core i7-950 3.06GHz 363,599 279,431 996,363 994,851 429,477 428,275 398,382 4,131,127 1,573,210 1,593,776 1,714,893
3.6GHz Pentium 4 3.6GHz 391,433 345,666 915,502 925,717 512,564 511,035 437,946 n/a 1,097,068 1,146,004 1,099,856
Core2 6700 2.67GHz 332,726 320,891 933,688 937,826 453,658 453,377 390,229 n/a 1,653,483 1,652,018 1,362,838
1.5GHz POWER5+ 1.5GHz 160,096 111,396 285,927 314,650 169,446 169,447 160,106 n/a 620,102 624,184 599,048
1.9GHz POWER5+ 1.9GHz 202,266 140,844 360,555 396,713 214,207 214,202 202,224 n/a 807,723 808,243 775,657
Athlon64 X2 4200+ 2.2GHz 261,927 298,252 767,435 767,469 392,507 392,520 337,204 n/a 1,193,278 1,102,328 1,136,237
3GHz Pentium 4 3GHz 360,264 307,781 793,679 790,873 421,749 421,491 393,766 n/a 935,662 942,952 910,220
1GHz Pentium 3 1GHz 67,448 68,429 157,668 157,609 116,705 116,294 107,558 n/a
VIA C7 2GHz 133,243 132,670 296,732 296,757 228,180 228,417 153,906 n/a 339,504 343,237 327,777
VIA C7 800MHz 52,759 52,765 118,037 118,832 90,874 90,483 60,962 n/a 138,600 137,445 132,069
Opteron 8218 2.6GHz 304,453 346,510 888,013 890,044 454,597 454,210 391,157 n/a 1,189,312 1,176,844 1,176,380
Xeon E5450 3GHz 405,184 326,124 1,052,806 1,055,434 511,349 510,867 421,542 n/a 1,675,781 1,686,921 1,816,082
P4 Xeon MP 2.7GHz 174,024 150,326 267,248 267,390 175,788 176,342 185,110 n/a 319,609 320,717 270,821
Xeon E3110 3GHz 406,181 326,324 1,055,929 1,057,013 518,032 516,353 422,631 n/a 1,676,384 1,696,455 1,831,592
500MHz PIII 500MHz 34,034 34,778 93,968 96,528 62,248 62,896 55,315 n/a 121,693 121,570 116,931
Core2 T7400 2.16GHz 277,295 261,794 758,097 758,311 367,066 366,937 316,754 n/a 1,329,832 1,328,357 1,088,756
Core2 T2300 1.66GHz 210,691 232,884 586,950 587,660 298,031 297,973 239,845 n/a 855,838 855,600 763,868
Core2 T7500 2.2GHz 304,027 286,315 835,736 836,694 400,011 400,388 348,750 n/a 1,465,904 1,467,464 1,181,531
Xeon X5550 2.67GHz 385,203 296,862 1,053,178 1,054,078 455,272 455,312 422,926 4,351,392 1,667,016 1,676,230 1,822,632
PowerMac G5 2GHz 212,214 147,982 377,590 417,308 225,339 225,339 212,190 n/a/ 738,237 736,327 728,993
Xeon X5570 2.93GHz 384,908 259,286 855,428 855,416 421,520 421,524 406,596 4,283,526 1,818,824 1,818,756 1,632,126
Xeon X7560 2.3GHz 197,739 140,100 427,931 427,931 213,622 224,348 204,148 2,143,132 898,852 889,125 863,381
Opteron 8354 2.2GHz 257,997 258,429 650,962 650,855 369,342 367,794 337,798 n/a 984,548 984,264 996,814
Core i7?? 2.6GHz 241,697 193,481 597,500 597,550 267,273 267,266 264,275 3,249,929 1,257,160 1,257,236 1,219,009

Here is a description of the various CRC implementations tested:

algorithm description
crc16 ANSI CRC16 algorithm in kernel (Sarwate)
crc16-t10dif T10 CRC16 used for DIF in kernel (Sarwate)
crc32-kern-be BE CRC32 in kernel (slice by 4)
crc32-kern-le LE CRC32 in kernel (slice by 4)
crc32-e2fs-be BE CRC32 in e2fsprogs 1.41 (slice by 4)
crc32-e2fs-le LE CRC32 in e2fsprogs 1.41 (slice by 4)
crc32c Default CRC32C in kernel (Sarwate)
crc32c-intel Accelerated CRC32C on Intel Core i7
crc32c-by8-be Bob Pearson's updated BE CRC32 algorithm, but with CRC32C polynomial (slice by 8)
crc32c-by8-le Bob Pearson's updated LE CRC32 algorithm, but with CRC32C polynomial (slice by 8)
crc32c-intelby8 Intel's CRC32C algorithm http://prdownloads.sourceforge.net/slicing-by-8/ (slice by 8)

At a 4K block size the time slices are so tiny that it's difficult to identify any clear trends.

It is well known that Sarwate's algorithm has been superseded (performance-wise) by the bit slicing implementations; these results support that conclusion. All slice-by-N implementations had #define'd a polynomial, making it trivially easy to port the code to the "default" CRC32C implementation. Obviously, the hardware solution eats all the others for lunch, though it only exceeds the slice-by-8 algorithm by a factor of ~2.5x and the slice-by-4 algorithms by a factor of ~4x. Either way, 1.5GB/s of _metadata_ updates is quite a lot, so the performance hit might not be too hard provided we can replace the current software crc32c code with one of these slice-by algorithms.

As a side note, it is also desirable to optimize the crc16-t10dif algorithm, not for ext4 but for DIF disks.

Existing Metadata Checksumming

Block Groups

The block group descriptor is protected by a CRC16. On a 64-bit filesystem, it may be possible either to extend the field to 32-bits, or to stuff a 32-bit crc into 16 bits per the "Stuffing" section above.

Journal

jbd2 has a (probably infrequently) used journal_checksum feature that ensures the integrity of the journal contents. Currently it supports CRC32, MD5, or SHA1 checksums, though as of Linux 3.0 it only seems to support CRC32. This can be easily switched over to CRC32c.

On-Disk Structure Modifications

Darrick will try to implement this without requiring an on-disk format change. Basically, that means that we have to find places where checksums can be crammed into existing data structures.

Superblock

Andi Kleen posted a patch to checksum the superblock. Darrick plans to massage this patch a little bit, and store the crc32c into the superblock somewhere around offset 0x240.

Inodes

Inode checksums are only supported on Linux. The checksum is a crc32c field at offset 0x7C, which puts it in the middle of osd2.linux2. The checksum covers the inode and everything else that follows it (afaik in-inode extended attribute blocks).

Inode/Block Bitmap (64-bit)

Each bitmap has its own crc32c checksum; both checksums are stored in the block group descriptor. The inode bitmap checksum is at offset 0x18, and the block bitmap checksum is at offset 0x38. This only works if the 64bit feature is set, unfortunately.

Inode/Block Bitmap (32-bit)

For 32-bit filesystems, Darrick is considering using the 16-bit fields in the block group descriptor at offset 0x18 and 0x20 to store either crc16 or stuffed crc32c values of the inode and block bitmaps. It's probably better to have a slow crc16 over no crc at all.

Extent Tree

Filesystem blocks are always 1024, 2048, or 4096 bytes, and the extent tree header and entry structures are both 12 bytes long. Therefore, because 2^n % 12 >= 4, there is sufficient space to store a crc32c just past the end of the last struct ext4_extent. The checksum is computed only the part of the extent block that is in use.

Directory Blocks

Regular directory leaf blocks (i.e. blocks that are not secretly htree nodes) are a semi-packed array of variable-length records. A 12-byte directory entry is created at the end of the block with a an inode of 0 to make the entry look unused to old ext4 drivers; a name_len of 0; and a rec_len large enough to hold a crc32c. In a cursory analysis of 250,000 directories, just 29 had blocks that did not have sufficient space to hold the 12-byte tail. tune2fs will advise users to run e2fsck -D to rebuild all directories so that all directory blocks may have a checksum.

HTree

The htree root and internal nodes do not hide a checksum in a fake dirent at the end of the block because that would require the removal of two struct dx_entry from each htree block. Instead, the limit count is decreased by 1 and the crc32c stored at the end of the block. Again, tune2fs will advise users to run e2fsck -D to rebuild all directories and perform any necessary htree rebalancing.

Unfortunately, in adding htree checksums to a very very large directory, it is possible to overflow the htree.

Extended Attributes (EAs)

For EAs stored in a separate disk block (i.e. not stored after the inode), there is sufficient space to store a crc32c directly in the header.

For EAs stored in the extra space after the inode, Darrick thought incorrectly that the h_magic field was never checked. That turned out to be untrue, so his new proposal is to follow Andreas Dilger's suggestion simply to extend the inode checksum to cover the extra space after the inode structure. That will require a fair amount of changes to e2fsprogs, but not a lot for the kernel.

Metadata Not Being Upgraded

Direct/indirect/triple-indirect block maps are not targeted for checksums, as this results in a totally incompatible disk format change and reduces the maximum file size considerably. Files should be converted to extents via chattr +e for increased safety and less overhead.

Tool Updates

A user should be able to turn on this feature at mke2fs time simply by specifying -O metadata_csum. Because the 64bit feature allows arbitrarily large block group descriptors that are large enough to enable crc32c for the bitmaps, mke2fs should warn the user if the feature set is metadata_csum,^64bit when it becomes the case that the 64bit feature has been tested thoroughly.

It should be possible to convert existing filesystems with a simple tune2fs -O metadata_csum. tune2fs will apply checksums to all metadata structures that can trivially take them, and tell the user to run e2fsck -D if necessary. e2fsck will gain the ability to reorganize directory tree blocks to accommodate the checksum fields. Obviously, 64bit mode cannot (currently) be enabled on existing filesystems.

It should be possible to disable metadata checksumming on an existing filesystem with tune2fs -O ^metadata_csum, with the same conditions outlined for enabling checksums on an existing filesystem.

debugfs should try to display checksums whenever possible.

It should NOT be possible for old fs code to write to a filesystem with metadata checksums enabled. The metadata_csum flag is implemented as a ROCOMPAT flag, which should keep (non-malicious) old programs from messing things up.

Stuff Darrick Hasn't Thought Hard Enough About

  • Other filesystems' use of checksums??
  • Other ext4 features being concurrently developed?
  • Value-adds that use some ext4 fields without noting it in the ext4 documentation.
  • Defensive programming when we have to parse the metadata that is being checksummed (extent tree? dir blocks? htree blocks?)
Personal tools