Ext4 Disk Layout

From Ext4
Revision as of 23:01, 30 March 2011 by Djwong (Talk | contribs)

Jump to: navigation, search

This document attempts to describe the on-disk format for ext4 filesystems. The same general ideas should apply to ext2/3 filesystems as well, though they do not support all the features that ext4 supports, and the fields will be shorter.

NOTE: This is a work in progress, based on notes that the author (Djwong) made while picking apart a filesystem by hand. The data structure definitions were pulled out of fs/ext4/ext4.h in 2.6.38.

Contents

Miscellany

ext4 divides a storage device into an array of logical blocks both to reduce bookkeeping overhead and to increase throughput by forcing larger transfer sizes. Generally, the block size will be 4KiB (coincidentally, the same size as pages on x86 and the block layer's default block size), though the actual size is calculated as 2 ^ (10 + sb.s_log_block_size) bytes. Throughout this document, disk locations are given in terms of these logical blocks, not raw LBAs, and not 1024-byte blocks.

Block Groups

An ext4 file system is split into a series of block groups. To reduce performance difficulties due to fragmentation, the block allocator tries very hard to keep each file's blocks within the same group, thereby reducing seek times. The size of a block group is specified in sb.s_blocks_per_group blocks, though it can also calculated as 8 * block_size_in_bytes. With the default block size of 4KiB, each group will contain 32,768 blocks, for a length of 128MB. The number of block groups is the size of the device divided by the size of a block group.

Layout

The layout of a standard block group is approximately as follows (each of these fields is discussed in a separate section below):

Group 0 Padding ext4 Super Block Group Descriptors Reserved GDT Blocks Data Block Bitmap inode Bitmap inode Table Data Blocks
1024 bytes 1 block many blocks many blocks 1 block 1 block many blocks many more blocks

For the special case of block group 0, the first 1024 bytes are unused, to allow for the installation of x86 boot sectors and other oddities. The superblock will start at offset 1024 bytes, whichever block that happens to be (usually 0). However, if for some reason the block size = 1024, then block 0 is marked in use and the superblock goes in block 1. For all other block groups, there is no padding.

The ext4 driver primarily works with the superblock and the group descriptors that are found in block group 0. Redundant copies of the superblock and group descriptors are written to some of the block groups across the disk in case the beginning of the disk gets trashed, though not all block groups necessarily host a redundant copy (see following paragraph for more details). If the group does not have a redundant copy, the block group begins with the data block bitmap. Note also that when the filesystem is freshly formatted, mkfs will allocate "reserve GDT block" space after the block group descriptors and before the start of the block bitmaps to allow for future expansion of the filesystem. By default, a filesystem is allowed to increase in size by a factor of 1024x over the original filesystem size.

Flexible Block Groups

Starting in ext4, there is a new feature called flexible block groups (flex_bg). In a flex_bg, several block groups are tied together as one logical block group; the bitmap spaces and the inode table space in the first block group of the flex_bg are expanded to include the bitmaps and inode tables of all other block groups in the flex_bg. For example, if the flex_bg size is 4, then group 0 will contain (in order) the superblock, group descriptors, data block bitmaps for groups 0-3, inode bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining space in group 0 is for file data. The effect of this is to group the block metadata close together for faster loading, and to enable large files to be continuous on disk. Backup copies of the superblock and group descriptors are always at the beginning of block groups, even if flex_bg is enabled. The number of block groups that make up a flex_bg is given by 2 ^ sb.s_log_groups_per_flex.

Meta Block Groups

Normally, a complete copy of the entire block group descriptor table is recorded after every copy of the superblock. Assuming the default group size of 2^27 bytes (128MiB) and 64-byte group descriptors, this imposes a limitation of 2^21 block groups, or 256TB. With the meta block group feature enabled, each block group contains redundant copies of the block group descriptor for that group, thereby enabling the creation of the full 2^32 block groups, for a total size of 512EiB.

Lazy Block Group Initialization

New also for ext4, the inode bitmap and inode tables in a group are uninitialized if the corresponding flag is set in the group descriptor. This is to reduce mkfs time considerably. If the group descriptor checksum feature is enabled, then even the group descriptors can be uninitialized.

The Super Block

The superblock records various information about the enclosing filesystem, such as block counts, inode counts, supported features, maintenance information, and more.

If the sparse_super feature flag is set, redundant copies of the superblock and group descriptors are kept only in the groups whose group number is either 0 or a power of 3, 5, or 7. If the flag is not set, redundant copies are kept in all groups.

The ext4 superblock is laid out as follows in struct ext4_super_block:

Offset Size Name Description
0x0 __le32 s_inodes_count Total inode count.
0x4 __le32 s_blocks_count_lo Total block count.
0x8 __le32 s_r_blocks_count_lo Reserved block count.
0xC __le32 s_free_blocks_count_lo Free block count.
0x10 __le32 s_free_inodes_count Free inode count.
0x14 __le32 s_first_data_block First data block.
0x18 __le32 s_log_block_size Block size is 2 ^ (10 + s_log_block_size).
0x1C __le32 s_obso_log_frag_size (Obsolete) fragment size.
0x20 __le32 s_blocks_per_group Blocks per group.
0x24 __le32 s_obso_frags_per_group (Obsolete) fragments per group.
0x28 __le32 s_inodes_per_group Inodes per group.
0x2C __le32 s_mtime Mount time, in seconds since the epoch.
0x30 __le32 s_wtime Write time, in seconds since the epoch.
0x34 __le16 s_mnt_count Number of mounts since the last fsck.
0x36 __le16 s_max_mnt_count Number of mounts beyond which a fsck is needed.
0x38 __le16 s_magic Magic signature, 0xEF53
0x3A __le16 s_state File system state. Valid values are:
0x0001 Cleanly umounted
0x0002 Errors detected
0x0004 Orphans being recovered
0x3C __le16 s_errors Behaviour when detecting errors. One of:
1 Continue
2 Remount read-only
3 Panic
0x3E __le16 s_minor_rev_level Minor revision level.
0x40 __le32 s_lastcheck Time of last check, in seconds since the epoch.
0x44 __le32 s_checkinterval Maximum time between checks, in seconds.
0x48 __le32 s_creator_os OS. One of:
0 Linux
1 Hurd
2 Masix
3 FreeBSD
4 Lites
0x4C __le32 s_rev_level Revision level. One of:
0 Original format
1 v2 format w/ dynamic inode sizes
0x50 __le16 s_def_resuid Default uid for reserved blocks.
0x52 __le16 s_def_resgid Default gid for reserved blocks.
These fields are for EXT4_DYNAMIC_REV superblocks only.

Note: the difference between the compatible feature set and the incompatible feature set is that if there is a bit set in the incompatible feature set that the kernel doesn't know about, it should refuse to mount the filesystem.

e2fsck's requirements are more strict; if it doesn't know about a feature in either the compatible or incompatible feature set, it must abort and not try to meddle with things it doesn't understand...

0x54 __le32 s_first_ino First non-reserved inode.
0x58 __le16 s_inode_size Size of inode structure, in bytes.
0x5A __le16 s_block_group_nr Block group # of this superblock.
0x5C __le32 s_feature_compat Compatible feature set flags. Kernel can still read/write this fs even if it doesn't understand a flag; fsck should not do that. Any of:
0x1 Directory preallocation.
0x2 "imagic inodes". Not clear from the code what this does.
0x4 Has a journal.
0x8 Supports extended attributes.
0x10 Has reserved GDT blocks for filesystem expansion.
0x20 Has directory indices.
0x40 "Lazy BG". Not in 2.6.38, seems to have been for uninitialized block groups?
0x80 "Exclude inode". Not documented or used outside of e2fsprogs.
0x60 __le32 s_feature_incompat Incompatible feature set. If the kernel or fsck doesn't understand one of these bits, it should stop. Any of:
0x1 Compression.
0x2 Directory entries record the file type. See ext4_dir_entry_2 below.
0x4 Filesystem needs recovery.
0x8 Filesystem has a separate journal device.
0x10 Meta block groups. See the earlier discussion of this feature.
0x40 Files in this filesystem use extents.
0x80 Enable a filesystem size of 2^64 blocks.
0x100 Multiple mount protection. Not implemented.
0x200 Flexible block groups. See the earlier discussion of this feature.
0x400 Inodes can be used for large extended attributes. (Not implemented?)
0x1000 Data in directory entry. (Not implemented?)
0x64 __le32 s_feature_ro_compat Readonly-compatible feature set. If the kernel doesn't understand one of these bits, it can still mount read-only. Any of:
0x1 Sparse superblocks. See the earlier discussion of this feature.
0x2 This filesystem has been used to store a file greater than 2GB.
0x8 This filesystem has files whose sizes are represented in units of logical blocks, not 512-byte sectors. This implies a very large file indeed!
0x10 Group descriptors have checksums. In addition to detecting corruption, this is useful for lazy formatting with uninitialized groups.
0x20 Indicates that the old ext3 32,000 subdirectory limit no longer applies.
0x40 Indicates that large inodes exist on this filesystem.
0x80 This filesystem has a snapshot.
0x68 __u8 s_uuid[16] 128-bit UUID for volume.
0x78 char s_volume_name[16] Volume label.
0x88 char s_last_mounted[64] Directory where filesystem was last mounted.
0xC8 __le32 s_algorithm_usage_bitmap For compression
Performance hints. Directory preallocation should only happen if the EXT4_FEATURE_COMPAT_DIR_PREALLOC flag is on.
0xCC __u8 s_prealloc_blocks # of blocks to try to preallocate for ... files?
0xCD __u8 s_prealloc_dir_blocks # of blocks to preallocate for directories.
0xCE __le16 s_reserved_gdt_blocks Number of reserved GDT entries for future filesystem expansion.
Journaling support valid if EXT4_FEATURE_COMPAT_HAS_JOURNAL set.
0xD0 __u8 s_journal_uuid[16] UUID of journal superblock
0xE0 __le32 s_journal_inum inode number of journal file.
0xE4 __le32 s_journal_dev Device number of journal file, if the external journal feature flag is set.
0xE8 __le32 s_last_orphan Start of list of orphaned inodes to delete.
0xEC __le32 s_hash_seed[4] HTREE hash seed.
0xFC __u8 s_def_hash_version Default hash algorithm to use for directory hashes. One of:
0x0 Legacy.
0x1 Half MD4.
0x2 Tea.
0x3 Legacy, unsigned.
0x4 Half MD4, unsigned.
0x5 Tea, unsigned.
0xFD __u8 s_jnl_backup_type ?
0xFE __le16 s_desc_size Size of group descriptors, in bytes, if the 64bit incompat feature flag is set.
0x100 __le32 s_default_mount_opts Default mount options. Any of:
0x0001 Print debugging info upon (re)mount.
0x0002 New files take the gid of the containing directory (instead of the fsgid of the current process).
0x0004 Support userspace-provided extended attributes.
0x0008 Support POSIX access control lists (ACLs).
0x0010 Do not support 32-bit UIDs.
0x0020 All data and metadata are commited to the journal.
0x0040 All data are flushed to the disk before metadata are committed to the journal.
0x0060 Data ordering is not preserved; data may be written after the metadata has been written.
0x0100 Disable write flushes.
0x0200 Track which blocks in a filesystem are metadata and therefore should not be used as data blocks.
0x0400 Enable DISCARD support, where the storage device is told about blocks becoming unused.
0x0800 Disable delayed allocation.
0x104 __le32 s_first_meta_bg First metablock block group, if the meta_bg feature is enabled.
0x108 __le32 s_mkfs_time When the filesystem was created, in seconds since the epoch.
0x10C __le32 s_jnl_blocks[17] Backup copy of the first 68 bytes of the journal inode.
64bit support valid if EXT4_FEATURE_COMPAT_64BIT
0x150 __le32 s_blocks_count_hi High 32-bits of the block count.
0x154 __le32 s_r_blocks_count_hi High 32-bits of the reserved block count.
0x158 __le32 s_free_blocks_count_hi High 32-bits of the free block count.
0x15C __le16 s_min_extra_isize All inodes have at least # bytes.
0x15E __le16 s_want_extra_isize New inodes should reserve # bytes.
0x160 __le32 s_flags Miscellaneous flags. Any of:
0x0001 Signed directory hash in use.
0x0002 Unsigned directory hash in use.
0x0004 To test development code.
0x164 __le16 s_raid_stride RAID stride. This is the number of logical blocks read from or written to the disk before moving to the next disk. This affects the placement of filesystem metadata, which will hopefully make RAID storage faster.
0x166 __le16 s_mmp_interval # seconds to wait in multi-mount prevention (MMP) checking. In theory, MMP is a mechanism to record in the superblock which host and device have mounted the filesystem, in order to prevent multiple mounts. This feature does not seem to be implemented...
0x168 __le64 s_mmp_block Block # for multi-mount protection data.
0x170 __le32 s_raid_stripe_width RAID stripe width. This is the number of logical blocks read from or written to the disk before coming back to the current disk. This is used by the block allocator to try to reduce the number of read-modify-write operations in a RAID5/6.
0x174 __u8 s_log_groups_per_flex Size of a flexible block group is 2 ^ s_log_groups_per_flex.
0x175 __u8 s_reserved_char_pad
0x176 __le16 s_reserved_pad
0x178 __le64 s_kbytes_written Number of KiB written to this filesystem over its lifetime.
0x180 __le32 s_snapshot_inum inode number of active snapshot.
0x184 __le32 s_snapshot_id Sequential ID of active snapshot.
0x188 __le64 s_snapshot_r_blocks_count Number of blocks reserved for active snapshot's future use.
0x190 __le32 s_snapshot_list inode number of the head of the on-disk snapshot list.
0x194 __le32 s_error_count Number of errors seen.
0x198 __le32 s_first_error_time First time an error happened, in seconds since the epoch.
0x19C __le32 s_first_error_ino inode involved in first error.
0x1A0 __le64 s_first_error_block Number of block involved of first error.
0x1A8 __u8 s_first_error_func[32] Name of function where the error happened.
0x1C8 __le32 s_first_error_line Line number where error happened.
0x1CC __le32 s_last_error_time Time of most recent error, in seconds since the epoch.
0x1D0 __le32 s_last_error_ino inode involved in most recent error.
0x1D4 __le32 s_last_error_line Line number where most recent error happened.
0x1D8 __le64 s_last_error_block Number of block involved in most recent error.
0x1E0 __u8 s_last_error_func[32] Name of function where the most recent error happened.
0x200 __u8 s_mount_opts[64] ASCIIZ string of mount options.
0x240 __le32 s_reserved[112] Padding to the end of the block.

Total size is 1024 bytes.

Block Group Descriptors

Each block group on the filesystem has one of these descriptors associated with it. As noted in the Layout section above, the group descriptors (if present) are the second item in the block group. The standard configuration is for each block group to contain a full copy of the block group descriptor table unless the sparse_super feature flag is set.

Notice how the group descriptor records the location of both bitmaps and the inode table (i.e. they can float). This means that within a block group, the only data structures with fixed locations are the superblock and the group descriptor table. The flex_bg mechanism uses this property to group several block groups into a flex group and lay out all of the groups' bitmaps and inode tables into one long run in the first group of the flex group.

If the meta_bg feature flag is set, then several block groups are grouped together into a meta group. Note that in the meta_bg case, however, the first and last two block groups within the larger meta group contain only group descriptors for the groups inside the meta group.

flex_bg and meta_bg do not appear to be mutually exclusive features.

The block group descriptor is laid out in struct ext4_group_desc.

Offset Size Name Description
0x0 __le32 bg_block_bitmap_lo Lower 32-bits of location of block bitmap.
0x4 __le32 bg_inode_bitmap_lo Lower 32-bits of location of inode bitmap.
0x8 __le32 bg_inode_table_lo Lower 32-bits of location of inode table.
0xC __le16 bg_free_blocks_count_lo Lower 32-bits of free block count.
0xE __le16 bg_free_inodes_count_lo Lower 32-bits of free inode count.
0x10 __le16 bg_used_dirs_count_lo Lower 32-bits of directory count.
0x12 __le16 bg_flags Block group flags. Any of:
0x1 inode table and bitmap are not initialized.
0x2 block bitmap is not initialized.
0x4 inode table is zeroed.
0x14 __u32 bg_reserved[2] Likely block/inode bitmap checksum.
0x1C __le16 bg_itable_unused_lo Lower 16-bits of unused inode count.
0x1E __le16 bg_checksum Group descriptor checksum; crc16(sb_uuid+group+desc). Probably only calculated if the rocompat bg_checksum feature flag is set.
0x20 __le32 bg_block_bitmap_hi Upper 32-bits of location of block bitmap.
0x24 __le32 bg_inode_bitmap_hi Upper 32-bits of location of inodes bitmap.
0x28 __le32 bg_inode_table_hi Upper 32-bits of location of inodes table.
0x2C __le16 bg_free_blocks_count_hi Upper 32-bits of free block count.
0x2E __le16 bg_free_inodes_count_hi Upper 32-bits of free inode count.
0x30 __le16 bg_used_dirs_count_hi Upper 32-bits of directory count.
0x32 __le16 bg_itable_unused_hi Upper 32-bits of unused inode count.
0x34 __u32 bg_reserved2[3] Padding to 64 bytes.

Total size is 64 bytes.

Block and inode Bitmaps

The data block bitmap tracks the usage of data blocks within the block group.

The inode bitmap records which entries in the inode table are in use.

As with most bitmaps, one bit represents the usage status of one data block or inode table entry. This implies a block group size of 8 * number_of_bytes_in_a_logical_block.

Inode Table

The inode table is a linear array of struct ext4_inode. The table is sized to have enough blocks to store at least sb.s_inode_size * sb.s_inodes_per_group bytes. The number of the block group containing an inode can be calculated as inode_number / sb.s_inodes_per_group, and the offset into the group's table is inode_number % sb.s_inodes_per_group.

The inode table entry is laid out in struct ext4_inode.

Offset Size Name Description
0x0 __le16 i_mode File mode. Any of:
0x1 S_IXOTH (Others may execute)
0x2 S_IWOTH (Others may write)
0x4 S_IROTH (Others may read)
0x10 S_IXGRP (Group members may execute)
0x20 S_IWGRP (Group members may write)
0x40 S_IRGRP (Group members may read)
0x100 S_IXUSR (Owner may execute)
0x200 S_IWUSR (Owner may write)
0x400 S_IRUSR (Owner may read)
0x1000 S_ISVTX (Sticky bit)
0x2000 S_ISGID (Set GID)
0x4000 S_ISUID (Set UID)
0x2 __le16 i_uid Lower 16-bits of Owner UID.
0x4 __le32 i_size_lo Lower 32-bits of size in bytes.
0x8 __le32 i_atime Last access time, in seconds since the epoch.
0xC __le32 i_ctime Last inode change time, in seconds since the epoch.
0x10 __le32 i_mtime Last data modification time, in seconds since the epoch.
0x14 __le32 i_dtime Deletion Time, in seconds since the epoch.
0x18 __le16 i_gid Lower 16-bits of GID.
0x1A __le16 i_links_count Hard link count.
0x1C __le32 i_blocks_lo Lower 32-bits of block count.
0x20 __le32 i_flags Inode flags. Any of:
0x1 Synchronous writes.
0x2 Do not update access time.
0x4 Append-only file.
0x8 Immutable file.
0x10 Removed, but still open, directory (dead).
0x20 Inode is not counted in quota calculations.
0x40 Directory modifications are synchronous.
0x80 Do not update creation or modification time.
0x100 Swap file; do not truncate.
0x200 Inode is internal to the filesystem.
0x400 Inode has an associated IMA(huh?) struct.
0x800 Automount quasi-directory
0x24 4 bytes

Union osd1:

Tag Contents
linux1
Offset Size Name Description
0x0 __le32 l_i_version Version
hurd1
Offset Size Name Description
0x0 __le32 h_i_translator ??
masix1
Offset Size Name Description
0x0 __le32 m_i_reserved ??
0x28 __le32 i_block[EXT4_N_BLOCKS=15] Block map or extent tree.
0x64 __le32 i_generation File version (for NFS).
0x68 __le32 i_file_acl_lo Lower 32-bits of file ACL location.
0x6C __le32 i_size_high Upper 32-bits of file size.
0x70 __le32 i_obso_faddr (Obsolete) fragment address.
0x74 12 bytes

Union osd2:

Tag Contents
linux2
Offset Size Name Description
0x0 __le16 l_i_blocks_high Upper 16-bits of the block count.
0x0 __le16 l_i_file_acl_high Upper 16-bits of the file ACL location.
0x0 __le16 l_i_uid_high Upper 16-bits of the Owner UID.
0x0 __le16 l_i_gid_high Upper 16-bits of the GID.
0x0 __u32 l_i_reserved2 ??
hurd2
Offset Size Name Description
0x0 __le16 h_i_reserved1 ??
0x0 __u16 h_i_mode_high Upper 16-bits of the file mode.
0x0 __le16 h_i_uid_high Upper 16-bits of the Owner UID.
0x0 __le16 h_i_gid_high Upper 16-bits of the GID.
0x0 __u32 h_i_author Author code?
masix2
Offset Size Name Description
0x0 __le16 h_i_reserved1 ??
0x0 __u16 m_i_file_acl_high Upper 16-bits of the file ACL location.
0x0 __u32 m_i_reserved2[2] ??
0x80 __le16 i_extra_isize ??
0x82 __le16 i_pad1 ??
0x84 __le32 i_ctime_extra epoch). This provides sub-second precision.
0x88 __le32 i_mtime_extra epoch). This provides sub-second precision.
0x8C __le32 i_atime_extra epoch). This provides sub-second precision.
0x90 __le32 i_crtime File Creation time, in seconds since the epoch.
0x94 __le32 i_crtime_extra epoch). This provides sub-second precision.
0x98 __le32 i_version_hi Upper 32-bits for version number.

Note that the size of the structure is 156 bytes, though the standard inode size in ext4 is 256 bytes. It was 128 previously. I think(?) the extra space can be used for extended attributes.

Directory Entries

struct ext4_dir_entry {
        __le32  inode;                  /* Inode number */
        __le16  rec_len;                /* Directory entry length */
        __le16  name_len;               /* Name length */
        char    name[EXT4_NAME_LEN];    /* File name */
};

/*
 * The new version of the directory entry.  Since EXT4 structures are
 * stored in intel byte order, and the name_len field could never be
 * bigger than 255 chars, it's safe to reclaim the extra byte for the
 * file_type field.
 */
struct ext4_dir_entry_2 {
        __le32  inode;                  /* Inode number */
        __le16  rec_len;                /* Directory entry length */
        __u8    name_len;               /* Name length */
        __u8    file_type;
        char    name[EXT4_NAME_LEN];    /* File name */
};

Extent Tree

/*
 * ext4_inode has i_block array (60 bytes total).
 * The first 12 bytes store ext4_extent_header;
 * the remainder stores an array of ext4_extent.
 */

/*
 * This is the extent on-disk structure.
 * It's used at the bottom of the tree.
 */
struct ext4_extent {
	__le32	ee_block;	/* first logical block extent covers */
	__le16	ee_len;		/* number of blocks covered by extent */
	__le16	ee_start_hi;	/* high 16 bits of physical block */
	__le32	ee_start_lo;	/* low 32 bits of physical block */
};

/*
 * This is index on-disk structure.
 * It's used at all the levels except the bottom.
 */
struct ext4_extent_idx {
	__le32	ei_block;	/* index covers logical blocks from 'block' */
	__le32	ei_leaf_lo;	/* pointer to the physical block of the next *
				 * level. leaf or next index could be there */
	__le16	ei_leaf_hi;	/* high 16 bits of physical block */
	__u16	ei_unused;
};

/*
 * Each block (leaves and indexes), even inode-stored has header.
 */
struct ext4_extent_header {
	__le16	eh_magic;	/* probably will support different formats */
	__le16	eh_entries;	/* number of valid entries */
	__le16	eh_max;		/* capacity of store in entries */
	__le16	eh_depth;	/* has tree real underlying blocks? */
	__le32	eh_generation;	/* generation of the tree */
};

Personal tools