Ext4 Developer Interlock Call: 03-07-07 Minutes
Attendees: Mingming Cao, Ted Ts'o, Suparna Bhattacharya, Dave Kleikamp, Jean Noel Cordenner, Eric Sandeen, Akira Fujita, Avantika Mathur
Minutes can be accessed at: http://ext4.wiki.kernel.org/index.php/Ext4_Developer%27s_Conference_Call
Ext4 git tree:
- Andrew Morton asked about updated Ext4 patches on kernel.org tree; last update was 2/18
- Ted plans to test the current ext4 patch set before updating the tree
Preallocation fallocate interface:
- There has been a lot of discussion on the mailing list about the fallocate system call, the parameters to the system call (mode), and whether there should be a generic function written in kernel, or the libc function should be used for filesystems that don't have there own fs specific function.
- Generic Function: After much discussion during the call, it was concluded that it would be desirable to have a generic function in the VFS; but that is not a priority.
- Mode bit: The mode bit seems like a good way to support preallocate, unpreallocate and other types of allocation within the fallocate system call. Having the mode bit would cause the syscall to have different parameters for each mode, making it more like ioctl. This may be undesirable by some.
- Policy: Ted proposed an idea of having an integer value which represents which allocation policy is to be used. This value would be set by interpreting the parameters sent by some interface (syscall, ioctl), and the filesystem would then perform allocation based on the policy (prealloc, reserve, unalloc, punch). The default value for normal allocation would be 0.
- The general opinion on the call was that there should be a separate system call for fallocate and punch operations.
Block Group Number type:
- Avantika is working on patches to change all block group numbers to type unsigned long. Currently there are many locations where block group numbers are type int, and sometimes assigned negative values. In the patches there will be a new ext4_grpnum_t type added.
Metadata block groups:
- At the filesystem and storage workshop, it was decided that metadata block groups will be turned on by default in Ext4 to support larger filesystem size. With current format where group descriptors are saved in the first block group, filesystem size is limited to 256 TB.
- Metadata is stored in one group. Data is stored a the first, second and last block group of the meta block group. Relaxed restrictions on where inode table block have to be, they are put at the beginning of every metadata block group.
- Jean Noel is working on a new version of the patches.
- In an earlier e-mail he had mentioned high CPU utilization with the patches, but this is not the case.
- He will publish the new version of the patches and test results to the mailing list. He has been testing on iozone and looking at oprofile data.
- NFSv4 requires a 64 bit i_version field. The current patches have 32 bit field, we need to have consensus on where the high 32 bits of the field will come from
- Andreas Dilger had suggested using bits from i_extra_isize.
- Jean Noel will send out an RFC to start discussion on the mailing list.
- Lustre had an additional request; that the i-version amount is updated by a global counter. Ted is concerned about bottlenecks on metadata intensive benchmarks, because of the globally accessed incremental counter.
- There hasn't been any decision made on this issue.
Ext3->Ext4 Migration Tool:
- Aneesh Veetil has been working on a migration tool from block based to extents allocation. He is looking at two options.
- Offline Migration: Modify e2fsprogs code to actually be able to create extents. This involves a lot of duplication of ext4 code (btree). e2fsprogs has code for interpreting extents, but code for creating them would have to be duplicated.
- Online Migration: Use existing filesystem code to convert to extents - similar to online defragmentation.
- Mingming suggested looking into doing a cp; but this involves data movement. Aneesh's approaches are performing migration in place.
- Migration from block based->extents can be done online or offline; but the migration tool will also include migration from 128 byte inode to large inode, which should be done offline.