This document attempts to describe the on-disk format for ext4 filesystems. The same general ideas should apply to ext2/3 filesystems as well, though they do not support all the features that ext4 supports, and the fields will be shorter.

NOTE: This is a work in progress, based on notes that the author (Djwong) made while picking apart a filesystem by hand. The data structure definitions were pulled out of fs/ext4/ext4.h in 2.6.38.

Miscellany

ext4 divides a storage device into an array of logical blocks both to reduce bookkeeping overhead and to increase throughput by forcing larger transfer sizes. Generally, the block size will be 4KiB (coincidentally, the same size as pages on x86 and the block layer's default block size), though the actual size is calculated as 2 ^ (10 + sb.s_log_block_size) bytes. Throughout this document, disk locations are given in terms of these logical blocks, not raw LBAs, and not 1024-byte blocks.

Block Groups

An ext4 file system is split into a series of block groups. To reduce performance difficulties due to fragmentation, the block allocator tries very hard to keep each file's blocks within the same group, thereby reducing seek times. The size of a block group is specified in sb.s_blocks_per_group blocks, though it can also calculated as 8 * block_size_in_bytes. With the default block size of 4KiB, each group will contain 32,768 blocks, for a length of 128MB. The number of block groups is the size of the device divided by the size of a block group.

Layout

The layout of a standard block group is approximately as follows (each of these fields is discussed in a separate section below):

Group 0 Padding	ext4 Super Block	Group Descriptors	Reserved GDT Blocks	Data Block Bitmap	inode Bitmap	inode Table	Data Blocks
1024 bytes	1 block	many blocks	many blocks	1 block	1 block	many blocks	many more blocks

For the special case of block group 0, the first 1024 bytes are unused, to allow for the installation of x86 boot sectors and other oddities. The superblock will start at offset 1024 bytes, whichever block that happens to be (usually 0). However, if for some reason the block size = 1024, then block 0 is marked in use and the superblock goes in block 1. For all other block groups, there is no padding.

The ext4 driver primarily works with the superblock and the group descriptors that are found in block group 0. Redundant copies of the superblock and group descriptors are written to some of the block groups across the disk in case the beginning of the disk gets trashed, though not all block groups necessarily host a redundant copy (see following paragraph for more details). If the group does not have a redundant copy, the block group begins with the data block bitmap. Note also that when the filesystem is freshly formatted, mkfs will allocate "reserve GDT block" space after the block group descriptors and before the start of the block bitmaps to allow for future expansion of the filesystem. By default, a filesystem is allowed to increase in size by a factor of 1024x over the original filesystem size.

Flexible Block Groups

Starting in ext4, there is a new feature called flexible block groups (flex_bg). In a flex_bg, several block groups are tied together as one logical block group; the bitmap spaces and the inode table space in the first block group of the flex_bg are expanded to include the bitmaps and inode tables of all other block groups in the flex_bg. For example, if the flex_bg size is 4, then group 0 will contain (in order) the superblock, group descriptors, data block bitmaps for groups 0-3, inode bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining space in group 0 is for file data. The effect of this is to group the block metadata close together for faster loading, and to enable large files to be continuous on disk. Backup copies of the superblock and group descriptors are always at the beginning of block groups, even if flex_bg is enabled. The number of block groups that make up a flex_bg is given by 2 ^ sb.s_log_groups_per_flex.

Meta Block Groups

Normally, a complete copy of the entire block group descriptor table is recorded after every copy of the superblock. Assuming the default group size of 2^27 bytes (128MiB) and 64-byte group descriptors, this imposes a limitation of 2^21 block groups, or 256TB. With the meta block group feature enabled, each block group contains redundant copies of the block group descriptor for that group, thereby enabling the creation of the full 2^32 block groups, for a total size of 512EiB.

Lazy Block Group Initialization

New also for ext4, the inode bitmap and inode tables in a group are uninitialized if the corresponding flag is set in the group descriptor. This is to reduce mkfs time considerably. If the group descriptor checksum feature is enabled, then even the group descriptors can be uninitialized.

The Super Block

The superblock records various information about the enclosing filesystem, such as block counts, inode counts, supported features, maintenance information, and more.

If the sparse_super feature flag is set, redundant copies of the superblock and group descriptors are kept only in the groups whose group number is either 0 or a power of 3, 5, or 7. If the flag is not set, redundant copies are kept in all groups.

The ext4 superblock is laid out as follows: struct ext4_super_block {

Offset

Field Type

Name

Description

0x0

__le32

s_inodes_count

Total inode count.

0x4

__le32

s_blocks_count_lo

Total block count.

0x8

__le32

s_r_blocks_count_lo

Reserved block count.

0xC

__le32

s_free_blocks_count_lo

Free block count.

0x10

__le32

s_free_inodes_count

Free inode count.

0x14

__le32

s_first_data_block

First data block.

0x18

__le32

s_log_block_size

Block size is 2 ^ (10 + s_log_block_size).

0x1C

__le32

s_obso_log_frag_size

(Obsolete) fragment size.

0x20

__le32

s_blocks_per_group

Blocks per group.

0x24

__le32

s_obso_frags_per_group

(Obsolete) fragments per group.

0x28

__le32

s_inodes_per_group

Inodes per group.

0x2C

__le32

s_mtime

Mount time, in seconds since the epoch.

0x30

__le32

s_wtime

Write time, in seconds since the epoch.

0x34

__le16

s_mnt_count

Number of mounts since the last fsck.

0x36

__le16

s_max_mnt_count

Number of mounts beyond which a fsck is needed.

0x38

__le16

s_magic

Magic signature, 0xEF53

0x3A

__le16

s_state

File system state. Valid values are:

0x0001	Cleanly umounted
0x0002	Errors detected
0x0004	Orphans being recovered

0x3C

__le16

s_errors

Behaviour when detecting errors. One of:

1	Continue
2	Remount read-only
3	Panic

0x3E

__le16

s_minor_rev_level

Minor revision level.

0x40

__le32

s_lastcheck

Time of last check, in seconds since the epoch.

0x44

__le32

s_checkinterval

Maximum time between checks, in seconds.

0x48

__le32

s_creator_os

OS. One of:

0	Linux
1	Hurd
2	Masix
3	FreeBSD
4	Lites

0x4C

__le32

s_rev_level

Revision level. One of:

0	Original format
1	v2 format w/ dynamic inode sizes

0x50

__le16

s_def_resuid

Default uid for reserved blocks.

0x52

__le16

s_def_resgid

Default gid for reserved blocks.

These fields are for EXT4_DYNAMIC_REV superblocks only.

Note: the difference between the compatible feature set and the incompatible feature set is that if there is a bit set in the incompatible feature set that the kernel doesn't know about, it should refuse to mount the filesystem.

e2fsck's requirements are more strict; if it doesn't know about a feature in either the compatible or incompatible feature set, it must abort and not try to meddle with things it doesn't understand...

0x54

__le32

s_first_ino

First non-reserved inode.

0x58

__le16

s_inode_size

Size of inode structure, in bytes.

0x5A

__le16

s_block_group_nr

Block group # of this superblock.

0x5C

__le32

s_feature_compat

Compatible feature set flags. Kernel can still read/write this fs even if it doesn't understand a flag; fsck should not do that. Any of:

0x1	Directory preallocation.
0x2	"imagic inodes". Not clear from the code what this does.
0x4	Has a journal.
0x8	Supports extended attributes.
0x10	Has reserved GDT blocks for filesystem expansion.
0x20	Has directory indices.
0x40	"Lazy BG". Not in 2.6.38, seems to have been for uninitialized block groups?
0x80	"Exclude inode". Not documented or used outside of e2fsprogs.

0x60

__le32

s_feature_incompat

Incompatible feature set. If the kernel or fsck doesn't understand one of these bits, it should stop. Any of:

0x1	Compression.
0x2	Directory entries record the file type. See ext4_dir_entry_2 below.
0x4	Filesystem needs recovery.
0x8	Filesystem has a separate journal device.
0x10	Meta block groups. See the earlier discussion of this feature.
0x40	Files in this filesystem use extents.
0x80	Enable a filesystem size of 2^64 blocks.
0x100	Multiple mount protection. Not implemented.
0x200	Flexible block groups. See the earlier discussion of this feature.
0x400	Inodes can be used for large extended attributes. (Not implemented?)
0x1000	Data in directory entry. (Not implemented?)

0x64

__le32

s_feature_ro_compat

Readonly-compatible feature set. If the kernel doesn't understand one of these bits, it can still mount read-only. Any of:

0x1	Sparse superblocks. See the earlier discussion of this feature.
0x2	This filesystem has been used to store a file greater than 2GB.
0x8	This filesystem has files whose sizes are represented in units of logical blocks, not 512-byte sectors. This implies a very large file indeed!
0x10	Group descriptors have checksums. In addition to detecting corruption, this is useful for lazy formatting with uninitialized groups.
0x20	Indicates that the old ext3 32,000 subdirectory limit no longer applies.
0x40	Indicates that large inodes exist on this filesystem.
0x80	This filesystem has a snapshot.

0x68

__u8

s_uuid[16]

128-bit UUID for volume.

0x78

char

s_volume_name[16]

Volume label.

0x88

char

s_last_mounted[64]

Directory where filesystem was last mounted.

0xC8

__le32

s_algorithm_usage_bitmap

For compression

Performance hints. Directory preallocation should only happen if the EXT4_FEATURE_COMPAT_DIR_PREALLOC flag is on.

0xCC

__u8

s_prealloc_blocks

# of blocks to try to preallocate for ... files?

0xCD

__u8

s_prealloc_dir_blocks

# of blocks to preallocate for directories.

0xCE

__le16

s_reserved_gdt_blocks

Number of reserved GDT entries for future filesystem expansion.

Journaling support valid if EXT4_FEATURE_COMPAT_HAS_JOURNAL set.

0xD0

__u8

s_journal_uuid[16]

UUID of journal superblock

0xE0

__le32

s_journal_inum

inode number of journal file.

0xE4

__le32

s_journal_dev

Device number of journal file, if the external journal feature flag is set.

0xE8

__le32

s_last_orphan

Start of list of orphaned inodes to delete.

0xEC

__le32

s_hash_seed[4]

HTREE hash seed.

0xFC

__u8

s_def_hash_version

Default hash algorithm to use for directory hashes. One of:

0x0	Legacy.
0x1	Half MD4.
0x2	Tea.
0x3	Legacy, unsigned.
0x4	Half MD4, unsigned.
0x5	Tea, unsigned.

0xFD

__u8

s_jnl_backup_type

?

0xFE

__le16

s_desc_size

Size of group descriptors, in bytes, if the 64bit incompat feature flag is set.

0x100

__le32

s_default_mount_opts

Default mount options. Any of:

0x0001	Print debugging info upon (re)mount.
0x0002	New files take the gid of the containing directory (instead of the fsgid of the current process).
0x0004	Support userspace-provided extended attributes.
0x0008	Support POSIX access control lists (ACLs).
0x0010	Do not support 32-bit UIDs.
0x0020	All data and metadata are commited to the journal.
0x0040	All data are flushed to the disk before metadata are committed to the journal.
0x0060	Data ordering is not preserved; data may be written after the metadata has been written.
0x0100	Disable write flushes.
0x0200	Track which blocks in a filesystem are metadata and therefore should not be used as data blocks.
0x0400	Enable DISCARD support, where the storage device is told about blocks becoming unused.
0x0800	Disable delayed allocation.

0x104

__le32

s_first_meta_bg

First metablock block group, if the meta_bg feature is enabled.

0x108

__le32

s_mkfs_time

When the filesystem was created, in seconds since the epoch.

0x10C

__le32

s_jnl_blocks[17]

Backup copy of the first 68 bytes of the journal inode.

64bit support valid if EXT4_FEATURE_COMPAT_64BIT

0x150

__le32

s_blocks_count_hi

High 32-bits of the block count.

0x154

__le32

s_r_blocks_count_hi

High 32-bits of the reserved block count.

0x158

__le32

s_free_blocks_count_hi

High 32-bits of the free block count.

0x15C

__le16

s_min_extra_isize

All inodes have at least # bytes.

0x15E

__le16

s_want_extra_isize

New inodes should reserve # bytes.

0x160

__le32

s_flags

Miscellaneous flags??

0x164

__le16

s_raid_stride

RAID stride. This is the number of logical blocks read from or written to the disk before moving to the next disk. This affects the placement of filesystem metadata, which will hopefully make RAID storage faster.

0x166

__le16

s_mmp_interval

# seconds to wait in multi-mount prevention (MMP) checking. In theory, MMP is a mechanism to record in the superblock which host and device have mounted the filesystem, in order to prevent multiple mounts. This feature does not seem to be implemented...

0x168

__le64

s_mmp_block

Block # for multi-mount protection data.

0x170

__le32

s_raid_stripe_width

RAID stripe width. This is the number of logical blocks read from or written to the disk before coming back to the current disk. This is used by the block allocator to try to reduce the number of read-modify-write operations in a RAID5/6.

u8

s_log_groups_per_flex

FLEX_BG group size

u8

s_reserved_char_pad;

__le16

s_reserved_pad;

__le64

s_kbytes_written

nr of lifetime kilobytes written

180

__le32

s_snapshot_inum

Inode number of active snapshot

__le32

s_snapshot_id

sequential ID of active snapshot

__le64

s_snapshot_r_blocks_count

reserved blocks for active snapshot's future use

190

__le32

s_snapshot_list

inode number of the head of the on-disk snapshot list

__le32

s_error_count

number of fs errors

__le32

s_first_error_time

first time an error happened

__le32

s_first_error_ino

inode involved in first error

1A0

__le64

s_first_error_block

block involved of first error

u8

s_first_error_func[32]

function where the error happened

1C8

__le32

s_first_error_line

line number where error happened

__le32

s_last_error_time

most recent time of an error

1D0

__le32

s_last_error_ino

inode involved in last error

__le32

s_last_error_line

line number where error happened

__le64

s_last_error_block

block involved of last error

1E0

u8

s_last_error_func[32]

function where the error happened

200

u8

s_mount_opts[64];

240

__le32

s_reserved[112]

Padding to the end of the block

|-

Block Group Descriptors

struct ext4_group_desc
{
/*0x0*/ __le32  bg_block_bitmap_lo;     /* Blocks bitmap block */
        __le32  bg_inode_bitmap_lo;     /* Inodes bitmap block */
        __le32  bg_inode_table_lo;      /* Inodes table block */
        __le16  bg_free_blocks_count_lo;/* Free blocks count */
        __le16  bg_free_inodes_count_lo;/* Free inodes count */
/*10*/  __le16  bg_used_dirs_count_lo;  /* Directories count */
        __le16  bg_flags;               /* EXT4_BG_flags (INODE_UNINIT, etc) */
        __u32   bg_reserved[2];         /* Likely block/inode bitmap checksum */
        __le16  bg_itable_unused_lo;    /* Unused inodes count */
        __le16  bg_checksum;            /* crc16(sb_uuid+group+desc) */
/*20*/  __le32  bg_block_bitmap_hi;     /* Blocks bitmap block MSB */
        __le32  bg_inode_bitmap_hi;     /* Inodes bitmap block MSB */
        __le32  bg_inode_table_hi;      /* Inodes table block MSB */
        __le16  bg_free_blocks_count_hi;/* Free blocks count MSB */
        __le16  bg_free_inodes_count_hi;/* Free inodes count MSB */
/*30*/  __le16  bg_used_dirs_count_hi;  /* Directories count MSB */
        __le16  bg_itable_unused_hi;    /* Unused inodes count MSB */
        __u32   bg_reserved2[3];
/*40*/
};

Block and inode Bitmaps

The data block bitmap tracks the usage of data blocks within the block group.

The inode bitmap records which entries in the inode table are in use.

Inode Table

The inode table is a linear array of struct ext4_inode. The table is sized to have enough blocks to cover sb.s_inode_size * sb.s_inodes_per_group bytes.

struct ext4_inode {
/*0x0*/ __le16  i_mode;         /* File mode */
        __le16  i_uid;          /* Low 16 bits of Owner Uid */
        __le32  i_size_lo;      /* Size in bytes */
        __le32  i_atime;        /* Access time */
        __le32  i_ctime;        /* Inode Change time */
/*10*/  __le32  i_mtime;        /* Modification time */
        __le32  i_dtime;        /* Deletion Time */
        __le16  i_gid;          /* Low 16 bits of Group Id */
        __le16  i_links_count;  /* Links count */
        __le32  i_blocks_lo;    /* Blocks count */
/*20*/  __le32  i_flags;        /* File flags */
        union {
                struct {
                        __le32  l_i_version;
                } linux1;
                struct {
                        __u32  h_i_translator;
                } hurd1;
                struct {
                        __u32  m_i_reserved1;
                } masix1;
        } osd1;                         /* OS dependent 1 */
/*28*/  __le32  i_block[EXT4_N_BLOCKS];/* Pointers to blocks */
        __le32  i_generation;   /* File version (for NFS) */
        __le32  i_file_acl_lo;  /* File ACL */
        __le32  i_size_high;
        __le32  i_obso_faddr;   /* Obsoleted fragment address */
        union {
                struct {
                        __le16  l_i_blocks_high; /* were l_i_reserved1 */
                        __le16  l_i_file_acl_high;
                        __le16  l_i_uid_high;   /* these 2 fields */
                        __le16  l_i_gid_high;   /* were reserved2[0] */
                        __u32   l_i_reserved2;
                } linux2;
                struct {
                        __le16  h_i_reserved1;  /* Obsoleted fragment number/size which are removed in ext4 */
                        __u16   h_i_mode_high;
                        __u16   h_i_uid_high;
                        __u16   h_i_gid_high;
                        __u32   h_i_author;
                } hurd2;
                struct {
                        __le16  h_i_reserved1;  /* Obsoleted fragment number/size which are removed in ext4 */
                        __le16  m_i_file_acl_high;
                        __u32   m_i_reserved2[2];
                } masix2;
        } osd2;                         /* OS dependent 2 */
        __le16  i_extra_isize;
        __le16  i_pad1;
        __le32  i_ctime_extra;  /* extra Change time      (nsec << 2 | epoch) */
        __le32  i_mtime_extra;  /* extra Modification time(nsec << 2 | epoch) */
        __le32  i_atime_extra;  /* extra Access time      (nsec << 2 | epoch) */
        __le32  i_crtime;       /* File Creation time */
        __le32  i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
        __le32  i_version_hi;   /* high 32 bits for 64-bit version */
};

Directory Entries

struct ext4_dir_entry {
        __le32  inode;                  /* Inode number */
        __le16  rec_len;                /* Directory entry length */
        __le16  name_len;               /* Name length */
        char    name[EXT4_NAME_LEN];    /* File name */
};

/*
 * The new version of the directory entry.  Since EXT4 structures are
 * stored in intel byte order, and the name_len field could never be
 * bigger than 255 chars, it's safe to reclaim the extra byte for the
 * file_type field.
 */
struct ext4_dir_entry_2 {
        __le32  inode;                  /* Inode number */
        __le16  rec_len;                /* Directory entry length */
        __u8    name_len;               /* Name length */
        __u8    file_type;
        char    name[EXT4_NAME_LEN];    /* File name */
};

Extent Tree

(WIP)

Ext4 Disk Layout

Contents

Miscellany

Block Groups

Layout

Flexible Block Groups

Meta Block Groups

Lazy Block Group Initialization

The Super Block

Block Group Descriptors

Block and inode Bitmaps

Inode Table

Directory Entries

Extent Tree

Views

Personal tools

Navigation

Search

Tools