Ext4 Disk Layout
This document attempts to describe the on-disk format for ext4 filesystems. The same general ideas should apply to ext2/3 filesystems as well, though they do not support all the features that ext4 supports, and the fields will be shorter.
NOTE: This is a work in progress, based on notes that the author (Djwong) made while picking apart a filesystem by hand. The data structure definitions were pulled out of fs/ext4/ext4.h in 2.6.38.
Contents |
Miscellany
ext4 divides a storage device into an array of logical blocks both to reduce bookkeeping overhead and to increase throughput by forcing larger transfer sizes. Generally, the block size will be 4KiB (coincidentally, the same size as pages on x86 and the block layer's default block size), though the actual size is calculated as 2 ^ (10 + sb.s_log_block_size
) bytes. Throughout this document, disk locations are given in terms of these logical blocks, not raw LBAs, and not 1024-byte blocks.
Block Groups
An ext4 file system is split into a series of block groups. To reduce performance difficulties due to fragmentation, the block allocator tries very hard to keep each file's blocks within the same group, thereby reducing seek times. The size of a block group is specified in sb.s_blocks_per_group
blocks, though it can also calculated as 8 * block_size_in_bytes
. With the default block size of 4KiB, each group will contain 32,768 blocks, for a length of 128MB. The number of block groups is the size of the device divided by the size of a block group.
Layout
The layout of a standard block group is approximately as follows (each of these fields is discussed in a separate section below):
Group 0 Padding | ext4 Super Block | Group Descriptors | Reserved GDT Blocks | Data Block Bitmap | inode Bitmap | inode Table | Data Blocks |
1024 bytes | 1 block | many blocks | many blocks | 1 block | 1 block | many blocks | many more blocks |
For the special case of block group 0, the first 1024 bytes are unused, to allow for the installation of x86 boot sectors and other oddities. The superblock will start at offset 1024 bytes, whichever block that happens to be (usually 0). However, if for some reason the block size = 1024, then block 0 is marked in use and the superblock goes in block 1. For all other block groups, there is no padding.
The ext4 driver primarily works with the superblock and the group descriptors that are found in block group 0. Redundant copies of the superblock and group descriptors are written to some of the block groups across the disk in case the beginning of the disk gets trashed, though not all block groups necessarily host a redundant copy (see following paragraph for more details). If the group does not have a redundant copy, the block group begins with the data block bitmap. Note also that when the filesystem is freshly formatted, mkfs will allocate "reserve GDT block" space after the block group descriptors and before the start of the block bitmaps to allow for future expansion of the filesystem. By default, a filesystem is allowed to increase in size by a factor of 1024x over the original filesystem size.
If the sparse_super feature flag is set, redundant copies of the superblock and group descriptors are kept only in the groups whose group number is either 0 or a power of 3, 5, or 7. If the flag is not set, redundant copies are kept in all groups.
sb.s_log_groups_per_flex
.
New also for ext4, the inode bitmap and inode tables in a group are uninitialized if the corresponding flag is set in the group descriptor. This is to reduce mkfs time considerably.The Super Block
The ext4 superblock is laid out as follows: struct ext4_super_block {
Offset | Field Type | Name | Description | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0x0 | __le32 | s_inodes_count | Total inode count. | ||||||||||||||||||||||
0x4 | __le32 | s_blocks_count_lo | Total block count. | ||||||||||||||||||||||
0x8 | __le32 | s_r_blocks_count_lo | Reserved block count. | ||||||||||||||||||||||
0xC | __le32 | s_free_blocks_count_lo | Free block count. | ||||||||||||||||||||||
0x10 | __le32 | s_free_inodes_count | Free inode count. | ||||||||||||||||||||||
0x14 | __le32 | s_first_data_block | First data block. | ||||||||||||||||||||||
0x18 | __le32 | s_log_block_size | Block size is 2 ^ (10 + s_log_block_size). | ||||||||||||||||||||||
0x1C | __le32 | s_obso_log_frag_size | (Obsolete) fragment size. | ||||||||||||||||||||||
0x20 | __le32 | s_blocks_per_group | Blocks per group. | ||||||||||||||||||||||
0x24 | __le32 | s_obso_frags_per_group | (Obsolete) fragments per group. | ||||||||||||||||||||||
0x28 | __le32 | s_inodes_per_group | Inodes per group. | ||||||||||||||||||||||
0x2C | __le32 | s_mtime | Mount time, in seconds since the epoch. | ||||||||||||||||||||||
0x30 | __le32 | s_wtime | Write time, in seconds since the epoch. | ||||||||||||||||||||||
0x34 | __le16 | s_mnt_count | Number of mounts since the last fsck. | ||||||||||||||||||||||
0x36 | __le16 | s_max_mnt_count | Number of mounts beyond which a fsck is needed. | ||||||||||||||||||||||
0x38 | __le16 | s_magic | Magic signature, 0xEF53 | ||||||||||||||||||||||
0x3A | __le16 | s_state | File system state. Valid values are:
| ||||||||||||||||||||||
0x3C | __le16 | s_errors | Behaviour when detecting errors. One of:
| ||||||||||||||||||||||
0x3E | __le16 | s_minor_rev_level | minor revision level. | ||||||||||||||||||||||
0x40 | __le32 | s_lastcheck | Time of last check, in seconds since the eopch. | ||||||||||||||||||||||
0x44 | __le32 | s_checkinterval | Max. time between checks, in seconds. | ||||||||||||||||||||||
0x48 | __le32 | s_creator_os | OS. One of:
| ||||||||||||||||||||||
0x4C | __le32 | s_rev_level | Revision level. One of:
| ||||||||||||||||||||||
0x50 | __le16 | s_def_resuid | Default uid for reserved blocks. | ||||||||||||||||||||||
0x52 | __le16 | s_def_resgid | Default gid for reserved blocks. | ||||||||||||||||||||||
These fields are for EXT4_DYNAMIC_REV superblocks only.
Note: the difference between the compatible feature set and the incompatible feature set is that if there is a bit set in the incompatible feature set that the kernel doesn't know about, it should refuse to mount the filesystem. e2fsck's requirements are more strict; if it doesn't know about a feature in either the compatible or incompatible feature set, it must abort and not try to meddle with things it doesn't understand... | |||||||||||||||||||||||||
0x54 | __le32 | s_first_ino | First non-reserved inode. | ||||||||||||||||||||||
0x58 | __le16 | s_inode_size | Size of inode structure, in bytes. | ||||||||||||||||||||||
0x5A | __le16 | s_block_group_nr | Block group # of this superblock. | ||||||||||||||||||||||
0x5C | __le32 | s_feature_compat | Compatible feature set flags. Kernel can still read/write this fs even if it doesn't understand a flag; fsck should not do that. Any of:
| ||||||||||||||||||||||
0x60 | __le32 | s_feature_incompat | Incompatible feature set. If the kernel or fsck doesn't understand one of these bits, it should stop. Any of:
| ||||||||||||||||||||||
0x64 | __le32 | s_feature_ro_compat | Readonly-compatible feature set. If the kernel doesn't understand one of these bits, it can still mount read-only. Any of:
| ||||||||||||||||||||||
68 | u8 | s_uuid[16] | 128-bit uuid for volume | ||||||||||||||||||||||
78 char | s_volume_name[16] | volume name | |||||||||||||||||||||||
88 char | s_last_mounted[64] | directory where last mounted | |||||||||||||||||||||||
C8 | __le32 | s_algorithm_usage_bitmap | For compression | ||||||||||||||||||||||
Performance hints. Directory preallocation should only happen if the EXT4_FEATURE_COMPAT_DIR_PREALLOC flag is on. | |||||||||||||||||||||||||
u8 | s_prealloc_blocks | Nr of blocks to try to preallocate | |||||||||||||||||||||||
u8 | s_prealloc_dir_blocks | Nr to preallocate for dirs | |||||||||||||||||||||||
__le16 | s_reserved_gdt_blocks | Per group desc for online growth | |||||||||||||||||||||||
Journaling support valid if EXT4_FEATURE_COMPAT_HAS_JOURNAL set. | |||||||||||||||||||||||||
D0 | u8 | s_journal_uuid[16] | uuid of journal superblock | ||||||||||||||||||||||
E0 | __le32 | s_journal_inum | inode number of journal file | ||||||||||||||||||||||
__le32 | s_journal_dev | device number of journal file | |||||||||||||||||||||||
__le32 | s_last_orphan | start of list of inodes to delete | |||||||||||||||||||||||
__le32 | s_hash_seed[4] | HTREE hash seed | |||||||||||||||||||||||
u8 | s_def_hash_version | Default hash version to use | |||||||||||||||||||||||
u8 | s_jnl_backup_type; | ||||||||||||||||||||||||
__le16 | s_desc_size | size of group descriptor | |||||||||||||||||||||||
100 | __le32 | s_default_mount_opts; | |||||||||||||||||||||||
__le32 | s_first_meta_bg | First metablock block group | |||||||||||||||||||||||
__le32 | s_mkfs_time | When the filesystem was created | |||||||||||||||||||||||
__le32 | s_jnl_blocks[17] | Backup of the journal inode | |||||||||||||||||||||||
64bit support valid if EXT4_FEATURE_COMPAT_64BIT | |||||||||||||||||||||||||
150 | __le32 | s_blocks_count_hi | Blocks count | ||||||||||||||||||||||
__le32 | s_r_blocks_count_hi | Reserved blocks count | |||||||||||||||||||||||
__le32 | s_free_blocks_count_hi | Free blocks count | |||||||||||||||||||||||
__le16 | s_min_extra_isize | All inodes have at least # bytes | |||||||||||||||||||||||
__le16 | s_want_extra_isize | New inodes should reserve # bytes | |||||||||||||||||||||||
160 | __le32 | s_flags | Miscellaneous flags | ||||||||||||||||||||||
__le16 | s_raid_stride | RAID stride | |||||||||||||||||||||||
__le16 | s_mmp_interval | # seconds to wait in MMP checking | |||||||||||||||||||||||
__le64 | s_mmp_block | Block for multi-mount protection | |||||||||||||||||||||||
170 | __le32 | s_raid_stripe_width | blocks on all data disks (N*stride) | ||||||||||||||||||||||
u8 | s_log_groups_per_flex | FLEX_BG group size | |||||||||||||||||||||||
u8 | s_reserved_char_pad; | ||||||||||||||||||||||||
__le16 | s_reserved_pad; | ||||||||||||||||||||||||
__le64 | s_kbytes_written | nr of lifetime kilobytes written | |||||||||||||||||||||||
180 | __le32 | s_snapshot_inum | Inode number of active snapshot | ||||||||||||||||||||||
__le32 | s_snapshot_id | sequential ID of active snapshot | |||||||||||||||||||||||
__le64 | s_snapshot_r_blocks_count | reserved blocks for active snapshot's future use | |||||||||||||||||||||||
190 | __le32 | s_snapshot_list | inode number of the head of the on-disk snapshot list | ||||||||||||||||||||||
__le32 | s_error_count | number of fs errors | |||||||||||||||||||||||
__le32 | s_first_error_time | first time an error happened | |||||||||||||||||||||||
__le32 | s_first_error_ino | inode involved in first error | |||||||||||||||||||||||
1A0 | __le64 | s_first_error_block | block involved of first error | ||||||||||||||||||||||
u8 | s_first_error_func[32] | function where the error happened | |||||||||||||||||||||||
1C8 | __le32 | s_first_error_line | line number where error happened | ||||||||||||||||||||||
__le32 | s_last_error_time | most recent time of an error | |||||||||||||||||||||||
1D0 | __le32 | s_last_error_ino | inode involved in last error | ||||||||||||||||||||||
__le32 | s_last_error_line | line number where error happened | |||||||||||||||||||||||
__le64 | s_last_error_block | block involved of last error | |||||||||||||||||||||||
1E0 | u8 | s_last_error_func[32] | function where the error happened | ||||||||||||||||||||||
200 | u8 | s_mount_opts[64]; | |||||||||||||||||||||||
240 | __le32 | s_reserved[112] | Padding to the end of the block |
Block Group Descriptors
struct ext4_group_desc { /*0x0*/ __le32 bg_block_bitmap_lo; /* Blocks bitmap block */ __le32 bg_inode_bitmap_lo; /* Inodes bitmap block */ __le32 bg_inode_table_lo; /* Inodes table block */ __le16 bg_free_blocks_count_lo;/* Free blocks count */ __le16 bg_free_inodes_count_lo;/* Free inodes count */ /*10*/ __le16 bg_used_dirs_count_lo; /* Directories count */ __le16 bg_flags; /* EXT4_BG_flags (INODE_UNINIT, etc) */ __u32 bg_reserved[2]; /* Likely block/inode bitmap checksum */ __le16 bg_itable_unused_lo; /* Unused inodes count */ __le16 bg_checksum; /* crc16(sb_uuid+group+desc) */ /*20*/ __le32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */ __le32 bg_inode_bitmap_hi; /* Inodes bitmap block MSB */ __le32 bg_inode_table_hi; /* Inodes table block MSB */ __le16 bg_free_blocks_count_hi;/* Free blocks count MSB */ __le16 bg_free_inodes_count_hi;/* Free inodes count MSB */ /*30*/ __le16 bg_used_dirs_count_hi; /* Directories count MSB */ __le16 bg_itable_unused_hi; /* Unused inodes count MSB */ __u32 bg_reserved2[3]; /*40*/ };
Block and inode Bitmaps
The data block bitmap tracks the usage of data blocks within the block group.
The inode bitmap records which entries in the inode table are in use.
Inode Table
The inode table is a linear array of struct ext4_inode
. The table is sized to have enough blocks to cover sb.s_inode_size
* sb.s_inodes_per_group
bytes.
struct ext4_inode { /*0x0*/ __le16 i_mode; /* File mode */ __le16 i_uid; /* Low 16 bits of Owner Uid */ __le32 i_size_lo; /* Size in bytes */ __le32 i_atime; /* Access time */ __le32 i_ctime; /* Inode Change time */ /*10*/ __le32 i_mtime; /* Modification time */ __le32 i_dtime; /* Deletion Time */ __le16 i_gid; /* Low 16 bits of Group Id */ __le16 i_links_count; /* Links count */ __le32 i_blocks_lo; /* Blocks count */ /*20*/ __le32 i_flags; /* File flags */ union { struct { __le32 l_i_version; } linux1; struct { __u32 h_i_translator; } hurd1; struct { __u32 m_i_reserved1; } masix1; } osd1; /* OS dependent 1 */ /*28*/ __le32 i_block[EXT4_N_BLOCKS];/* Pointers to blocks */ __le32 i_generation; /* File version (for NFS) */ __le32 i_file_acl_lo; /* File ACL */ __le32 i_size_high; __le32 i_obso_faddr; /* Obsoleted fragment address */ union { struct { __le16 l_i_blocks_high; /* were l_i_reserved1 */ __le16 l_i_file_acl_high; __le16 l_i_uid_high; /* these 2 fields */ __le16 l_i_gid_high; /* were reserved2[0] */ __u32 l_i_reserved2; } linux2; struct { __le16 h_i_reserved1; /* Obsoleted fragment number/size which are removed in ext4 */ __u16 h_i_mode_high; __u16 h_i_uid_high; __u16 h_i_gid_high; __u32 h_i_author; } hurd2; struct { __le16 h_i_reserved1; /* Obsoleted fragment number/size which are removed in ext4 */ __le16 m_i_file_acl_high; __u32 m_i_reserved2[2]; } masix2; } osd2; /* OS dependent 2 */ __le16 i_extra_isize; __le16 i_pad1; __le32 i_ctime_extra; /* extra Change time (nsec << 2 | epoch) */ __le32 i_mtime_extra; /* extra Modification time(nsec << 2 | epoch) */ __le32 i_atime_extra; /* extra Access time (nsec << 2 | epoch) */ __le32 i_crtime; /* File Creation time */ __le32 i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */ __le32 i_version_hi; /* high 32 bits for 64-bit version */ };
Directory Entries
struct ext4_dir_entry { __le32 inode; /* Inode number */ __le16 rec_len; /* Directory entry length */ __le16 name_len; /* Name length */ char name[EXT4_NAME_LEN]; /* File name */ }; /* * The new version of the directory entry. Since EXT4 structures are * stored in intel byte order, and the name_len field could never be * bigger than 255 chars, it's safe to reclaim the extra byte for the * file_type field. */ struct ext4_dir_entry_2 { __le32 inode; /* Inode number */ __le16 rec_len; /* Directory entry length */ __u8 name_len; /* Name length */ __u8 file_type; char name[EXT4_NAME_LEN]; /* File name */ };
Extent Tree
(WIP)