OLS-bof-2007-minutes OLS 2007

Ext4 BOF: Ottawa Linux Symposium
June 28, 2007

STATUS OF CURRENT WORK:

Delayed Allocation:

Alex's ext4 patches are in the ext4-patch-queue
There has been discussion of moving this feature to the VFS layer so other filesystems can also use it, mainly XFS.
Christoph Hellwig had a patch for delayed allocation in VFS, but it was very much like the XFS implementation and XFS specific.
Alex has started working on his own implementation of delalloc at the VFS layer.
Badari can help in completing Alex's VFS delallocpatches. Andreas will ask Alex to send his current vfs delayed allocation patches to Badari.
With delayed allocation moved to the VFS, it will be harder to control the feature because it is no longer an ext4 project.
Currently Delalloc only works for writeback mode. Implementation for ordered mode would be tricky, because need to use bufferheads.

Multiple Block Allocation:

Aneesh is working on porting Alex's patches to current ext4 patch queue (he has since posted this patch).
Currently, when mballoc is on, reservation is turned off. In-core preallocation is a new way to implement reservation on top of extents. In ext4 we will use in-core preallocation, and old block reservation for ext3.

Large Extended Attributes:

Kalpak has a preliminary patch that suports large single attributes. Still run out of space with many small attributes.
Support for forked files is possible, but there is not much demand for this anymore

Online Defragmentation:

The current implementation uses ioctls for the interface. Discussed other options for a better interface
Need interface to perform two functions in migration:
- Find target area
- migrate (punch operation)
The current fallocate syscall would not be able to support these functions, and adding more modes to this interface would probably not be accepted.

WHAT'S NEXT IN LINE

Big Block Group Support or Relocation of Metadata

Need this to be in Ext4 before the format freeze
Bull had patches for big block groups
Metablock group feature currently in ext3/4 needs work to actually perform functions specified. Moving metadata from each block group descriptor to a single descriptor, which holds all metadata for a set of block groups.
Need a combination of the BIG_BG and Metablock group patches; in order to fully support large filesystems.
After adding new metablock group support, migration from ext3->ext4 would still be possible because metadata would just be moved out of the group descriptors. Moving backwards from ext4->ext3 would not be possible.

Fsck Scalability:

The uninitialized block groups feature shows major e2fsck performance improvements for filesystems with small numbers of files in use.
There is still scope for making fsck faster in general - Some ideas:
- partial fsck: Idea is to try to contain a fault to a single file or group, and then only check that file or group.
- Create a per-block-group error flag, to detect local corruption. This is using the idea of chunkfs. But can be hard to implement because inodes can be pointing to blocks in a different block group. Also add a dirty bit to the each group.
- If the kernel detects a corrupt extents tree, then just check the groups that has blocks for that file.
- If a certain bitmap is marked bad, stop allocating from that group, but don't take the entire filesystem offline. This prevents error marking the entire filesystem as Read-only.
- Add more metadata checksums: checksumming bitmaps would immediately tell us if the group is corrupted.

More Metadata Checksums:

Add checksums to the inode, extents and bitmaps.

FUTURE WORK:

Dynamic Inode Tables:

Static tables in ext3 make the filesystem stable.
The price of static inode tables is waste of space allocated to unused inodes, and high fsck time.
Challenges:
- We need an efficient way to map inode numbers to block numbers.
- Detection of metadata corruption: If the directory is corrupt you must scan the entire filesystem to find inodes with dynamic inode tables.
  - Linking inodes together can help in finding corrupt inodes. JFS has a double-linked list of inodes; this is used to reconstruct inode mapping table on the fly upon corruption.
- Compatibility of 64-bit inode numbers
  - Discussed testing compatibility with a test mode in which the high 32-bits of the inode numbers are constant (or perhaps the reverse of the low 32-bits), and testing if anything breaks.

Inode-in-Dir
- On Data Corruption could use a linked list linking all inodes together or also store one inode per filesystem block (this is done in gfs).
  - the problem with inode sized blocks, is when there are larger blocks. You can currently store pwer of two number of inodes each block.
- Can combine inode-in-dir with 64bit inode numbers and storing block number in inode number for easier look-up
- There are complication with hard links: if the inode is in muliple directories, then multiple directories are referencing one block.
  - to address this we can store the filename and parent directory in EAs
  - Have a parallel object with each directory to store the inodes. This would require seeking back and forth between dir and parallel object
- inode relocation: If the data is being moved, it is still desired to keep the inode near the data blocks.
  - proposed solution is to have a per group inode re-mapping table. When you look-up an inode and can't find it, look at hte re-mapping table to find the new location. The common case would be that the re-mapping table is empty.

Block Number in Inode Number:

It could take 1-2 years to design and implement dynamic inode tables for ext4. We want ext4 format set before then - so this feature will have to be a future consideration, perhaps we will need a new filesystem in order maintain compatibility.

Data Checksumming

DISCUSSIONS:

Real-world problems in ext3 that are not yet addressed in Ext4:

Better support for small files: The average file size on latops is 1.5 k.
Dealing with filesystem aging
Intelligence of block allocator, and readahead of metadata could be better implemented.

Journal in middle of filesystem:

Andreas read an article which discusses 5-10% filesystem performance improvements when the journal is stored in the middle of the filesystem. This is a trivial change that could be added to ext4. The performance gain is still not as much as if the journal is stored on a separate device.

Online Resizing:

There is currently a 2 TB limit on online resizing. This issue could be addressed in ext4. No one has heard any complaints about the 2 TB resize limit.

More than one block allocation policy:

Idea is having separate block allocation policies for different workloads
Could make the filesystem too complex
Pluggable allocators.

Add disk awareness in the filesystem

There is no standard in RAID arrays to see how the disk is layed out.

Inode Layout:

With larger sized inodes in ext4, the original 128 byte ext3 inode is left unchanged, and new fields are added below it.
It would make sense to change the layout of the ext4_inode:
- Create more space to store extents. and create large extents structures supporting 64-bit logical block numbers. This would support larger file sizes.
- move timestamp fields together (low and hi fields)
- Move extents to end of the inode, next to the extra EA space, the space can be shared. if the space is not used for extents, it can be used to store EAs.
This cannot be done due to need for compatibility with ext3.

We should not set arbitrary limits which will restrict the filesystem in the future. Such as 16 TB max file size limit.

FEATURES THAT NEED TO BE IN EXT4 BEFORE FORMAT FREEZE:

Delayed Allocation
Multiple block allocation
large files
extents support in e2fsprogs (e2fsck etc)
BIG_BG or metablock group support

FEATURES THAT WILL NOT MAKE IT INTO EXT4 BEFORE FORMAT FREEZE:

Dynamic Inode Tables: This feature would take some time (2 years?) to design and implement, and also cause major format changes.
Data Checksumming

OLS-bof-2007-minutes OLS 2007

Views

Personal tools

Navigation

Search

Tools