Ext4 Disk Layout
(→Inode Table) |
|||
Line 5: | Line 5: | ||
= Miscellany = | = Miscellany = | ||
− | ext4 divides a storage device into an array of logical blocks both to reduce bookkeeping overhead and to increase throughput by forcing larger transfer sizes. Generally, the block size will be 4KiB (coincidentally, the same size as pages on x86 and the block layer's default block size), though the actual size is calculated as 2 ^ (10 + <code>sb.s_log_block_size</code>) bytes. Throughout this document, disk locations are given in terms of these logical blocks, not raw LBAs, and not 1024-byte blocks. | + | ext4 divides a storage device into an array of logical blocks both to reduce bookkeeping overhead and to increase throughput by forcing larger transfer sizes. Generally, the block size will be 4KiB (coincidentally, the same size as pages on x86 and the block layer's default block size), though the actual size is calculated as 2 ^ (10 + <code>sb.s_log_block_size</code>) bytes. Throughout this document, disk locations are given in terms of these logical blocks, not raw LBAs, and not 1024-byte blocks. For the sake of convenience, the logical block size will be referred to as <code>$block_size</code> throughout the rest of the document. |
= Block Groups = | = Block Groups = | ||
Line 624: | Line 624: | ||
Note that the size of the structure is 156 bytes, though the standard inode size in ext4 is 256 bytes. It was 128 previously. I think(?) the extra space can be used for extended attributes. | Note that the size of the structure is 156 bytes, though the standard inode size in ext4 is 256 bytes. It was 128 previously. I think(?) the extra space can be used for extended attributes. | ||
− | = | + | = Direct/Indirect Block Addressing = |
− | + | In ext2/3, file block numbers were mapped to logical block numbers by means of an (up to) three level 1-1 block map. To find the logical block that stores a particular file block, the code would navigate through this increasingly complicated structure. | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
+ | {|border="1" | ||
+ | !i.i_block Offset||Where It Points| | ||
|- | |- | ||
− | | | + | |0 -> 11||Direct map to the first 12 blocks of the file. |
|- | |- | ||
− | | | + | |12||Indirect block: |
+ | {|border="1" | ||
+ | !indirect Offset||Where It Points| | ||
|- | |- | ||
− | | | + | |0 -> (<code>$block_size</code> / 4)||Direct map to blocks 12 -> (12 + (<code>$block_size</code> / 4)) (12 -> 1036 if 4KiB blocks) |
+ | |} | ||
|- | |- | ||
− | | | + | |13||Double-indirect block: |
|- | |- | ||
− | | | + | |14||Triple-indirect block: |
− | + | |} | |
− | | | + | |
− | + | ||
− | + | ||
= Extent Tree = | = Extent Tree = | ||
Line 706: | Line 688: | ||
}; | }; | ||
</nowiki> | </nowiki> | ||
+ | |||
+ | = Directory Entries = | ||
+ | |||
+ | <nowiki> | ||
+ | struct ext4_dir_entry { | ||
+ | __le32 inode; /* Inode number */ | ||
+ | __le16 rec_len; /* Directory entry length */ | ||
+ | __le16 name_len; /* Name length */ | ||
+ | char name[EXT4_NAME_LEN]; /* File name */ | ||
+ | }; | ||
+ | |||
+ | /* | ||
+ | * The new version of the directory entry. Since EXT4 structures are | ||
+ | * stored in intel byte order, and the name_len field could never be | ||
+ | * bigger than 255 chars, it's safe to reclaim the extra byte for the | ||
+ | * file_type field. | ||
+ | */ | ||
+ | struct ext4_dir_entry_2 { | ||
+ | __le32 inode; /* Inode number */ | ||
+ | __le16 rec_len; /* Directory entry length */ | ||
+ | __u8 name_len; /* Name length */ | ||
+ | __u8 file_type; | ||
+ | char name[EXT4_NAME_LEN]; /* File name */ | ||
+ | };</nowiki> |
Revision as of 23:45, 30 March 2011
This document attempts to describe the on-disk format for ext4 filesystems. The same general ideas should apply to ext2/3 filesystems as well, though they do not support all the features that ext4 supports, and the fields will be shorter.
NOTE: This is a work in progress, based on notes that the author (Djwong) made while picking apart a filesystem by hand. The data structure definitions were pulled out of fs/ext4/ext4.h in 2.6.38.
Contents |
Miscellany
ext4 divides a storage device into an array of logical blocks both to reduce bookkeeping overhead and to increase throughput by forcing larger transfer sizes. Generally, the block size will be 4KiB (coincidentally, the same size as pages on x86 and the block layer's default block size), though the actual size is calculated as 2 ^ (10 + sb.s_log_block_size
) bytes. Throughout this document, disk locations are given in terms of these logical blocks, not raw LBAs, and not 1024-byte blocks. For the sake of convenience, the logical block size will be referred to as $block_size
throughout the rest of the document.
Block Groups
An ext4 file system is split into a series of block groups. To reduce performance difficulties due to fragmentation, the block allocator tries very hard to keep each file's blocks within the same group, thereby reducing seek times. The size of a block group is specified in sb.s_blocks_per_group
blocks, though it can also calculated as 8 * block_size_in_bytes
. With the default block size of 4KiB, each group will contain 32,768 blocks, for a length of 128MB. The number of block groups is the size of the device divided by the size of a block group.
Layout
The layout of a standard block group is approximately as follows (each of these fields is discussed in a separate section below):
Group 0 Padding | ext4 Super Block | Group Descriptors | Reserved GDT Blocks | Data Block Bitmap | inode Bitmap | inode Table | Data Blocks |
1024 bytes | 1 block | many blocks | many blocks | 1 block | 1 block | many blocks | many more blocks |
For the special case of block group 0, the first 1024 bytes are unused, to allow for the installation of x86 boot sectors and other oddities. The superblock will start at offset 1024 bytes, whichever block that happens to be (usually 0). However, if for some reason the block size = 1024, then block 0 is marked in use and the superblock goes in block 1. For all other block groups, there is no padding.
The ext4 driver primarily works with the superblock and the group descriptors that are found in block group 0. Redundant copies of the superblock and group descriptors are written to some of the block groups across the disk in case the beginning of the disk gets trashed, though not all block groups necessarily host a redundant copy (see following paragraph for more details). If the group does not have a redundant copy, the block group begins with the data block bitmap. Note also that when the filesystem is freshly formatted, mkfs will allocate "reserve GDT block" space after the block group descriptors and before the start of the block bitmaps to allow for future expansion of the filesystem. By default, a filesystem is allowed to increase in size by a factor of 1024x over the original filesystem size.
Flexible Block Groups
Starting in ext4, there is a new feature called flexible block groups (flex_bg). In a flex_bg, several block groups are tied together as one logical block group; the bitmap spaces and the inode table space in the first block group of the flex_bg are expanded to include the bitmaps and inode tables of all other block groups in the flex_bg. For example, if the flex_bg size is 4, then group 0 will contain (in order) the superblock, group descriptors, data block bitmaps for groups 0-3, inode bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining space in group 0 is for file data. The effect of this is to group the block metadata close together for faster loading, and to enable large files to be continuous on disk. Backup copies of the superblock and group descriptors are always at the beginning of block groups, even if flex_bg is enabled. The number of block groups that make up a flex_bg is given by 2 ^ sb.s_log_groups_per_flex
.
Meta Block Groups
Normally, a complete copy of the entire block group descriptor table is recorded after every copy of the superblock. Assuming the default group size of 2^27 bytes (128MiB) and 64-byte group descriptors, this imposes a limitation of 2^21 block groups, or 256TB. With the meta block group feature enabled, each block group contains redundant copies of the block group descriptor for that group, thereby enabling the creation of the full 2^32 block groups, for a total size of 512EiB.
Lazy Block Group Initialization
New also for ext4, the inode bitmap and inode tables in a group are uninitialized if the corresponding flag is set in the group descriptor. This is to reduce mkfs time considerably. If the group descriptor checksum feature is enabled, then even the group descriptors can be uninitialized.
The Super Block
The superblock records various information about the enclosing filesystem, such as block counts, inode counts, supported features, maintenance information, and more.
If the sparse_super feature flag is set, redundant copies of the superblock and group descriptors are kept only in the groups whose group number is either 0 or a power of 3, 5, or 7. If the flag is not set, redundant copies are kept in all groups.
The ext4 superblock is laid out as follows in struct ext4_super_block
:
Offset | Size | Name | Description | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0x0 | __le32 | s_inodes_count | Total inode count. | ||||||||||||||||||||||||
0x4 | __le32 | s_blocks_count_lo | Total block count. | ||||||||||||||||||||||||
0x8 | __le32 | s_r_blocks_count_lo | Reserved block count. | ||||||||||||||||||||||||
0xC | __le32 | s_free_blocks_count_lo | Free block count. | ||||||||||||||||||||||||
0x10 | __le32 | s_free_inodes_count | Free inode count. | ||||||||||||||||||||||||
0x14 | __le32 | s_first_data_block | First data block. | ||||||||||||||||||||||||
0x18 | __le32 | s_log_block_size | Block size is 2 ^ (10 + s_log_block_size). | ||||||||||||||||||||||||
0x1C | __le32 | s_obso_log_frag_size | (Obsolete) fragment size. | ||||||||||||||||||||||||
0x20 | __le32 | s_blocks_per_group | Blocks per group. | ||||||||||||||||||||||||
0x24 | __le32 | s_obso_frags_per_group | (Obsolete) fragments per group. | ||||||||||||||||||||||||
0x28 | __le32 | s_inodes_per_group | Inodes per group. | ||||||||||||||||||||||||
0x2C | __le32 | s_mtime | Mount time, in seconds since the epoch. | ||||||||||||||||||||||||
0x30 | __le32 | s_wtime | Write time, in seconds since the epoch. | ||||||||||||||||||||||||
0x34 | __le16 | s_mnt_count | Number of mounts since the last fsck. | ||||||||||||||||||||||||
0x36 | __le16 | s_max_mnt_count | Number of mounts beyond which a fsck is needed. | ||||||||||||||||||||||||
0x38 | __le16 | s_magic | Magic signature, 0xEF53 | ||||||||||||||||||||||||
0x3A | __le16 | s_state | File system state. Valid values are:
| ||||||||||||||||||||||||
0x3C | __le16 | s_errors | Behaviour when detecting errors. One of:
| ||||||||||||||||||||||||
0x3E | __le16 | s_minor_rev_level | Minor revision level. | ||||||||||||||||||||||||
0x40 | __le32 | s_lastcheck | Time of last check, in seconds since the epoch. | ||||||||||||||||||||||||
0x44 | __le32 | s_checkinterval | Maximum time between checks, in seconds. | ||||||||||||||||||||||||
0x48 | __le32 | s_creator_os | OS. One of:
| ||||||||||||||||||||||||
0x4C | __le32 | s_rev_level | Revision level. One of:
| ||||||||||||||||||||||||
0x50 | __le16 | s_def_resuid | Default uid for reserved blocks. | ||||||||||||||||||||||||
0x52 | __le16 | s_def_resgid | Default gid for reserved blocks. | ||||||||||||||||||||||||
These fields are for EXT4_DYNAMIC_REV superblocks only.
Note: the difference between the compatible feature set and the incompatible feature set is that if there is a bit set in the incompatible feature set that the kernel doesn't know about, it should refuse to mount the filesystem. e2fsck's requirements are more strict; if it doesn't know about a feature in either the compatible or incompatible feature set, it must abort and not try to meddle with things it doesn't understand... | |||||||||||||||||||||||||||
0x54 | __le32 | s_first_ino | First non-reserved inode. | ||||||||||||||||||||||||
0x58 | __le16 | s_inode_size | Size of inode structure, in bytes. | ||||||||||||||||||||||||
0x5A | __le16 | s_block_group_nr | Block group # of this superblock. | ||||||||||||||||||||||||
0x5C | __le32 | s_feature_compat | Compatible feature set flags. Kernel can still read/write this fs even if it doesn't understand a flag; fsck should not do that. Any of:
| ||||||||||||||||||||||||
0x60 | __le32 | s_feature_incompat | Incompatible feature set. If the kernel or fsck doesn't understand one of these bits, it should stop. Any of:
| ||||||||||||||||||||||||
0x64 | __le32 | s_feature_ro_compat | Readonly-compatible feature set. If the kernel doesn't understand one of these bits, it can still mount read-only. Any of:
| ||||||||||||||||||||||||
0x68 | __u8 | s_uuid[16] | 128-bit UUID for volume. | ||||||||||||||||||||||||
0x78 | char | s_volume_name[16] | Volume label. | ||||||||||||||||||||||||
0x88 | char | s_last_mounted[64] | Directory where filesystem was last mounted. | ||||||||||||||||||||||||
0xC8 | __le32 | s_algorithm_usage_bitmap | For compression | ||||||||||||||||||||||||
Performance hints. Directory preallocation should only happen if the EXT4_FEATURE_COMPAT_DIR_PREALLOC flag is on. | |||||||||||||||||||||||||||
0xCC | __u8 | s_prealloc_blocks | # of blocks to try to preallocate for ... files? | ||||||||||||||||||||||||
0xCD | __u8 | s_prealloc_dir_blocks | # of blocks to preallocate for directories. | ||||||||||||||||||||||||
0xCE | __le16 | s_reserved_gdt_blocks | Number of reserved GDT entries for future filesystem expansion. | ||||||||||||||||||||||||
Journaling support valid if EXT4_FEATURE_COMPAT_HAS_JOURNAL set. | |||||||||||||||||||||||||||
0xD0 | __u8 | s_journal_uuid[16] | UUID of journal superblock | ||||||||||||||||||||||||
0xE0 | __le32 | s_journal_inum | inode number of journal file. | ||||||||||||||||||||||||
0xE4 | __le32 | s_journal_dev | Device number of journal file, if the external journal feature flag is set. | ||||||||||||||||||||||||
0xE8 | __le32 | s_last_orphan | Start of list of orphaned inodes to delete. | ||||||||||||||||||||||||
0xEC | __le32 | s_hash_seed[4] | HTREE hash seed. | ||||||||||||||||||||||||
0xFC | __u8 | s_def_hash_version | Default hash algorithm to use for directory hashes. One of:
| ||||||||||||||||||||||||
0xFD | __u8 | s_jnl_backup_type | ? | ||||||||||||||||||||||||
0xFE | __le16 | s_desc_size | Size of group descriptors, in bytes, if the 64bit incompat feature flag is set. | ||||||||||||||||||||||||
0x100 | __le32 | s_default_mount_opts | Default mount options. Any of:
| ||||||||||||||||||||||||
0x104 | __le32 | s_first_meta_bg | First metablock block group, if the meta_bg feature is enabled. | ||||||||||||||||||||||||
0x108 | __le32 | s_mkfs_time | When the filesystem was created, in seconds since the epoch. | ||||||||||||||||||||||||
0x10C | __le32 | s_jnl_blocks[17] | Backup copy of the first 68 bytes of the journal inode. | ||||||||||||||||||||||||
64bit support valid if EXT4_FEATURE_COMPAT_64BIT | |||||||||||||||||||||||||||
0x150 | __le32 | s_blocks_count_hi | High 32-bits of the block count. | ||||||||||||||||||||||||
0x154 | __le32 | s_r_blocks_count_hi | High 32-bits of the reserved block count. | ||||||||||||||||||||||||
0x158 | __le32 | s_free_blocks_count_hi | High 32-bits of the free block count. | ||||||||||||||||||||||||
0x15C | __le16 | s_min_extra_isize | All inodes have at least # bytes. | ||||||||||||||||||||||||
0x15E | __le16 | s_want_extra_isize | New inodes should reserve # bytes. | ||||||||||||||||||||||||
0x160 | __le32 | s_flags | Miscellaneous flags. Any of:
| ||||||||||||||||||||||||
0x164 | __le16 | s_raid_stride | RAID stride. This is the number of logical blocks read from or written to the disk before moving to the next disk. This affects the placement of filesystem metadata, which will hopefully make RAID storage faster. | ||||||||||||||||||||||||
0x166 | __le16 | s_mmp_interval | # seconds to wait in multi-mount prevention (MMP) checking. In theory, MMP is a mechanism to record in the superblock which host and device have mounted the filesystem, in order to prevent multiple mounts. This feature does not seem to be implemented... | ||||||||||||||||||||||||
0x168 | __le64 | s_mmp_block | Block # for multi-mount protection data. | ||||||||||||||||||||||||
0x170 | __le32 | s_raid_stripe_width | RAID stripe width. This is the number of logical blocks read from or written to the disk before coming back to the current disk. This is used by the block allocator to try to reduce the number of read-modify-write operations in a RAID5/6. | ||||||||||||||||||||||||
0x174 | __u8 | s_log_groups_per_flex | Size of a flexible block group is 2 ^ s_log_groups_per_flex .
| ||||||||||||||||||||||||
0x175 | __u8 | s_reserved_char_pad | |||||||||||||||||||||||||
0x176 | __le16 | s_reserved_pad | |||||||||||||||||||||||||
0x178 | __le64 | s_kbytes_written | Number of KiB written to this filesystem over its lifetime. | ||||||||||||||||||||||||
0x180 | __le32 | s_snapshot_inum | inode number of active snapshot. | ||||||||||||||||||||||||
0x184 | __le32 | s_snapshot_id | Sequential ID of active snapshot. | ||||||||||||||||||||||||
0x188 | __le64 | s_snapshot_r_blocks_count | Number of blocks reserved for active snapshot's future use. | ||||||||||||||||||||||||
0x190 | __le32 | s_snapshot_list | inode number of the head of the on-disk snapshot list. | ||||||||||||||||||||||||
0x194 | __le32 | s_error_count | Number of errors seen. | ||||||||||||||||||||||||
0x198 | __le32 | s_first_error_time | First time an error happened, in seconds since the epoch. | ||||||||||||||||||||||||
0x19C | __le32 | s_first_error_ino | inode involved in first error. | ||||||||||||||||||||||||
0x1A0 | __le64 | s_first_error_block | Number of block involved of first error. | ||||||||||||||||||||||||
0x1A8 | __u8 | s_first_error_func[32] | Name of function where the error happened. | ||||||||||||||||||||||||
0x1C8 | __le32 | s_first_error_line | Line number where error happened. | ||||||||||||||||||||||||
0x1CC | __le32 | s_last_error_time | Time of most recent error, in seconds since the epoch. | ||||||||||||||||||||||||
0x1D0 | __le32 | s_last_error_ino | inode involved in most recent error. | ||||||||||||||||||||||||
0x1D4 | __le32 | s_last_error_line | Line number where most recent error happened. | ||||||||||||||||||||||||
0x1D8 | __le64 | s_last_error_block | Number of block involved in most recent error. | ||||||||||||||||||||||||
0x1E0 | __u8 | s_last_error_func[32] | Name of function where the most recent error happened. | ||||||||||||||||||||||||
0x200 | __u8 | s_mount_opts[64] | ASCIIZ string of mount options. | ||||||||||||||||||||||||
0x240 | __le32 | s_reserved[112] | Padding to the end of the block. |
Total size is 1024 bytes.
Block Group Descriptors
Each block group on the filesystem has one of these descriptors associated with it. As noted in the Layout section above, the group descriptors (if present) are the second item in the block group. The standard configuration is for each block group to contain a full copy of the block group descriptor table unless the sparse_super feature flag is set.
Notice how the group descriptor records the location of both bitmaps and the inode table (i.e. they can float). This means that within a block group, the only data structures with fixed locations are the superblock and the group descriptor table. The flex_bg mechanism uses this property to group several block groups into a flex group and lay out all of the groups' bitmaps and inode tables into one long run in the first group of the flex group.
If the meta_bg feature flag is set, then several block groups are grouped together into a meta group. Note that in the meta_bg case, however, the first and last two block groups within the larger meta group contain only group descriptors for the groups inside the meta group.
flex_bg and meta_bg do not appear to be mutually exclusive features.
The block group descriptor is laid out in struct ext4_group_desc
.
Offset | Size | Name | Description | ||||||
---|---|---|---|---|---|---|---|---|---|
0x0 | __le32 | bg_block_bitmap_lo | Lower 32-bits of location of block bitmap. | ||||||
0x4 | __le32 | bg_inode_bitmap_lo | Lower 32-bits of location of inode bitmap. | ||||||
0x8 | __le32 | bg_inode_table_lo | Lower 32-bits of location of inode table. | ||||||
0xC | __le16 | bg_free_blocks_count_lo | Lower 32-bits of free block count. | ||||||
0xE | __le16 | bg_free_inodes_count_lo | Lower 32-bits of free inode count. | ||||||
0x10 | __le16 | bg_used_dirs_count_lo | Lower 32-bits of directory count. | ||||||
0x12 | __le16 | bg_flags | Block group flags. Any of:
| ||||||
0x14 | __u32 | bg_reserved[2] | Likely block/inode bitmap checksum. | ||||||
0x1C | __le16 | bg_itable_unused_lo | Lower 16-bits of unused inode count. | ||||||
0x1E | __le16 | bg_checksum | Group descriptor checksum; crc16(sb_uuid+group+desc). Probably only calculated if the rocompat bg_checksum feature flag is set. | ||||||
0x20 | __le32 | bg_block_bitmap_hi | Upper 32-bits of location of block bitmap. | ||||||
0x24 | __le32 | bg_inode_bitmap_hi | Upper 32-bits of location of inodes bitmap. | ||||||
0x28 | __le32 | bg_inode_table_hi | Upper 32-bits of location of inodes table. | ||||||
0x2C | __le16 | bg_free_blocks_count_hi | Upper 32-bits of free block count. | ||||||
0x2E | __le16 | bg_free_inodes_count_hi | Upper 32-bits of free inode count. | ||||||
0x30 | __le16 | bg_used_dirs_count_hi | Upper 32-bits of directory count. | ||||||
0x32 | __le16 | bg_itable_unused_hi | Upper 32-bits of unused inode count. | ||||||
0x34 | __u32 | bg_reserved2[3] | Padding to 64 bytes. |
Total size is 64 bytes.
Block and inode Bitmaps
The data block bitmap tracks the usage of data blocks within the block group.
The inode bitmap records which entries in the inode table are in use.
As with most bitmaps, one bit represents the usage status of one data block or inode table entry. This implies a block group size of 8 * number_of_bytes_in_a_logical_block.
Inode Table
The inode table is a linear array of struct ext4_inode
. The table is sized to have enough blocks to store at least sb.s_inode_size
* sb.s_inodes_per_group
bytes. The number of the block group containing an inode can be calculated as inode_number / sb.s_inodes_per_group
, and the offset into the group's table is inode_number % sb.s_inodes_per_group
.
The inode table entry is laid out in struct ext4_inode
.
Offset | Size | Name | Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0x0 | __le16 | i_mode | File mode. Any of:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x2 | __le16 | i_uid | Lower 16-bits of Owner UID. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x4 | __le32 | i_size_lo | Lower 32-bits of size in bytes. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x8 | __le32 | i_atime | Last access time, in seconds since the epoch. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0xC | __le32 | i_ctime | Last inode change time, in seconds since the epoch. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x10 | __le32 | i_mtime | Last data modification time, in seconds since the epoch. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x14 | __le32 | i_dtime | Deletion Time, in seconds since the epoch. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x18 | __le16 | i_gid | Lower 16-bits of GID. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x1A | __le16 | i_links_count | Hard link count. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x1C | __le32 | i_blocks_lo | Lower 32-bits of block count. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x20 | __le32 | i_flags | Inode flags. Any of:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x24 | 4 bytes |
Union osd1:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x28 | __le32 | i_block[EXT4_N_BLOCKS=15] | Block map or extent tree. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x64 | __le32 | i_generation | File version (for NFS). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x68 | __le32 | i_file_acl_lo | Lower 32-bits of file ACL location. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x6C | __le32 | i_size_high | Upper 32-bits of file size. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x70 | __le32 | i_obso_faddr | (Obsolete) fragment address. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x74 | 12 bytes |
Union osd2:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x80 | __le16 | i_extra_isize | Size of this inode - 128. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x82 | __le16 | i_pad1 | ?? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x84 | __le32 | i_ctime_extra | Extra change time bits. This provides sub-second precision. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x88 | __le32 | i_mtime_extra | Extra modification time bits. This provides sub-second precision. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x8C | __le32 | i_atime_extra | Extra access time bits. This provides sub-second precision. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x90 | __le32 | i_crtime | File creation time, in seconds since the epoch. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x94 | __le32 | i_crtime_extra | Extra file creation time bits. This provides sub-second precision. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0x98 | __le32 | i_version_hi | Upper 32-bits for version number. |
Note that the size of the structure is 156 bytes, though the standard inode size in ext4 is 256 bytes. It was 128 previously. I think(?) the extra space can be used for extended attributes.
Direct/Indirect Block Addressing
In ext2/3, file block numbers were mapped to logical block numbers by means of an (up to) three level 1-1 block map. To find the logical block that stores a particular file block, the code would navigate through this increasingly complicated structure.
i.i_block Offset | |||||
---|---|---|---|---|---|
0 -> 11 | Direct map to the first 12 blocks of the file. | ||||
12 | Indirect block:
| ||||
13 | Double-indirect block: | ||||
14 | Triple-indirect block: |
Extent Tree
/* * ext4_inode has i_block array (60 bytes total). * The first 12 bytes store ext4_extent_header; * the remainder stores an array of ext4_extent. */ /* * This is the extent on-disk structure. * It's used at the bottom of the tree. */ struct ext4_extent { __le32 ee_block; /* first logical block extent covers */ __le16 ee_len; /* number of blocks covered by extent */ __le16 ee_start_hi; /* high 16 bits of physical block */ __le32 ee_start_lo; /* low 32 bits of physical block */ }; /* * This is index on-disk structure. * It's used at all the levels except the bottom. */ struct ext4_extent_idx { __le32 ei_block; /* index covers logical blocks from 'block' */ __le32 ei_leaf_lo; /* pointer to the physical block of the next * * level. leaf or next index could be there */ __le16 ei_leaf_hi; /* high 16 bits of physical block */ __u16 ei_unused; }; /* * Each block (leaves and indexes), even inode-stored has header. */ struct ext4_extent_header { __le16 eh_magic; /* probably will support different formats */ __le16 eh_entries; /* number of valid entries */ __le16 eh_max; /* capacity of store in entries */ __le16 eh_depth; /* has tree real underlying blocks? */ __le32 eh_generation; /* generation of the tree */ };
Directory Entries
struct ext4_dir_entry { __le32 inode; /* Inode number */ __le16 rec_len; /* Directory entry length */ __le16 name_len; /* Name length */ char name[EXT4_NAME_LEN]; /* File name */ }; /* * The new version of the directory entry. Since EXT4 structures are * stored in intel byte order, and the name_len field could never be * bigger than 255 chars, it's safe to reclaim the extra byte for the * file_type field. */ struct ext4_dir_entry_2 { __le32 inode; /* Inode number */ __le16 rec_len; /* Directory entry length */ __u8 name_len; /* Name length */ __u8 file_type; char name[EXT4_NAME_LEN]; /* File name */ };