Life of an ext4 write request

From Ext4
Revision as of 17:37, 10 May 2011 by Nauman (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search



This article describes how various different ext4 write requests are handled in Linux 2.6.34 -- 2.6.39. Special attention will be paid to how ext4_map_blocks() is called, and with what EXT4_GET_BLOCKS_* flags. In addition how quota is reserved, claimed, and released will be discussed.

Flags passed to ext4_map_blocks()

The primary function of the ext4_map_blocks() function is to translate an inode and logical block number to a physical block number. If a mapping between a particular logical block number and physical block number does not exist, ext4_map_blocks() may create such a mapping, allocating blocks as necessary. However, there are a number of other things which ext4_map_blocks() can do, based on a flags bitmap which is passed to it. In some cases, a particular flag to ext4_map_blocks() can radically change its behavior. So it's important to document the current ext4_map_blocks flags and what they do.

At one point, ext4_map_blocks() had previously been called ext4_get_blocks(), which is the reason for the naming convention of these flags:

to be written
to be written
to be written
to be written
to be written
to be written
to be written
to be written

Life of a nodelalloc buffered write

The write request

Description of how a write request happens from userspace (i.e., the codepath from generic_file_buffered_write() calling ext4_write_begin() and ext4_{writeback,ordered,journalled}_write_end() and/or the codepath from page_mkwrite() calling ext4_page_mkwrite()).

I/O submission

What happens when generic_writepages() calls ext4_write_cache_pages() which then calls ext4_writepage()

Life of a delalloc buffered write

The write request

Description of how a write request happens from userspace (i.e., the codepath from generic_file_buffered_write() calling ext4_da_write_begin() and ext4_da_write_end() and/or the codepath from page_mkwrite() calling ext4_page_mkwrite()).

sys_write() calls the file system's write_begin() function, which will map in the page structure for it to modify. It then will memcpy the buffer from userspace into the memory page that was returned by write_begin(). It calls write_end(). So at fs/ext4/inode.c, around line 4069, one can find the definition for the write_begin() and write_end() functions when we are in delayed allocation mode. If we are not doing delayed allocations, then there are three struct address_space_operations which can be used depending on which journaled mode we are in: ext4_ordered_aops, ext4_writeback_aops, ext4_journalled_aops. We end up using ext4_write_begin() in nodelalloc mode. The tricky bit happens when we are in delalloc mode, and the write_begin function is ext4_da_write_begin().

In ext4_da_write_begin(), there's a potential fallback to nodelalloc mode. That happens if we are low on space (and possibly low on quota; but not sure). That's because when we estimate how much space is needed, we can guess wrong, especially as it comes to metadata allocation. We tend to guess high, because in particular for ENOSPC, we don't want to run out of space when we need to allocate an extent tree block. That's because in the delalloc write request, we don't actually do the block allocation until the writeback time --- and at that point we can't return an error to userspace. If we fail to allocate space at writeback time, data can potentially be lost without the calling application knowing about it. This is not the case for direct I/O, of course, since it doesn't use the writeback; but delalloc is all about what happens for buffered writes.) So when we come close to running out of disk space, we will turn off delayed allocation.

A lot of the common case handling for write_begin in terms of allocating pages, etc., happens in library functions which are the same for most file systems. grab_cache_page_write_begin() at line 3228, and block_write_begin() at line 3236 are both VFS library functions. All the file system has to do is add a "get_block" function. In this case, the "get_block" function is ext4_da_get_block_prep() This function will take a logical block number (the sector_t iblock), and fill in a buffer_head structure with the physical block number. But in the delalloc case, there's one other thing it can do; which is it can reserve space so we can do the allocation at writeback time. (In the case of ext2, we would be doing the block allocation in the "get_block" function.) So this is where we call ext4_da_reserve_space(), and this is where we reserve space with respect to the quota system. The problem is that at this point, we know that this is a newly allocated block --- this is a block which currently isn't in the inode's extent tree.

But what we don't know is whether the block is part of a cluster (assuming big alloc case) which has already been allocated or not. We're going to have to keep track of whether we've already accounted for this cluster with respect to the quota system. Otherwise, when we write into a new cluster, the first time we call ext4_da_get_block for the cluster, we need to reserve a full cluster worth of blocks (say that's 1MB). But then when we call ext4_da_get_block() for the next block in the cluster, we don't want to reserve another cluster's worth of blocks. We do the allocation at writeback time. The whole point of delayed allocation is we don't know what block (or cluster) we will actually use when we first write data into the page cache we wait until the last possible minute to decide where on disk we will actually locate the data. We can't actually set the bit in the block/cluster allocation bitmap until that point, because we don't know the block number.

So that's why the quota system works this way. First you reserve space. ext4_da_get_block_prep() calls ext4_da_reserve_space(), which calls dquot_reserve_block() then in the writeback code, we call dquot_claim_block(). dquot_reserve_block() block can fail. If it fails, then we return ENOSPC to the user. dquot_claim_block() won't ever fail, since we've already reserved the space via dquot_reserve_block() This is to fulfill the requirement that we aren't allowed to fail, ever, in the writeback path. (except for physical I/O errors from the hard drive).

To finish up sys_write(), once write_begin() has set up the buffer_head mappings (or reserved space in the delalloc case) to the page, then it will write the user data buffer to the page, and then call write_end(). write_end() does a bunch of accounting stuff, including updating i_size, most of this isn't relevant for delayed allocation, and a lot of it gets done in the VFS library function generic_write_end().

Now we come to writeback code path. There's lots and lots of complexity about what happens in the writeback daemons. fs/fs-writeback.c and mm/page-writeback.c are the best sources for that functionality, partially because it's been changing a lot, and partially because there's something like a 8 or 10 deep function call chain, and it's *extremely* confusing. Once writeback decides which inode to do writeback, and how many pages to ask the file system to write back, though, it will call the filesystem's writepages() function. If the file system doesn't have a writepages() function, it must have a writepage() function, and then the writeback daemon will call writepage() for each page that it wants written back. If the file system does have a writepages() function, then the writepage() function will only be called by the MM subsystem when it needs to do "direct reclaim".

File system people don't like direct reclaim, since it tends to blow out the stack, and cause recursive loops, and lots of other nastiness. In the case of ext4, our writepage() function will refuse to write back a page that is under delayed allocation. That is, if we don't know where on disk the page is supposed to go, then we will call redirty_page_for_writepage() and return. So the page gets marked as dirty, and the VM won't be able to drop the page from the page cache. The VM folks don't like that we do this, but the FS position is "tough". Long run, we hope to get rid of direct reclaim altogether. ext4_da_writepages() is what gets used at writeback time. At the high level, it does two things:

(a) figure out how many pages it really should writeback.   We don't
trust the writeback subsystem, because it tends to request too few
pages to be written back.  So we do our own calculation.
(b) We look at all of the pages in the range of pages for which
writeback was requested (in the struct writeback_control structure),
and see how many of them are dirty, and unlocked.

We will gather up a contiguous range of pages which are dirty and unlocked, and lock the pages. All of that is done in the function write_cache_pages_da() in the 2.6.34 tree. The function write_cache_pages_da() is misnamed, and this is one area which has changed a lot between 2.6.34 and 2.6.39 (the code has been reorganized significantly). After we get the region of pages that we're going to write, we then call mpage_da_map_and_submit(). This is where we call ext4_map_blocks(), and this is where we actually do the block allocation. At that point, we know how many pages we want to write back, so the file system can do a better job deciding which blocks to use/allocate. That's the big win of delalloc mode. Also, it's faster since we do one allocation for a large number of pages, instead of doing the block allocation one block at a time, with no context of how many blocks we will end up needing (which is what ext2 did). In ext4_map_blocks(), we calldquot_claim_blocks().

ext4_map_blocks() has lots of different flags because it gets called many different ways.

I/O submission

What happens in ext4_writepages()

Modifications with dioread_nolock

What changes if dioread_nolock is enabled.

Life of a direct I/O write

(Edited by Nauman. Ted or Jiaying, please verify.)

Ext4 has specially optimized way of doing direct IO writes to the files, as long as the writes are not size extending writes.

The most relevant logic can be found at ext4_ext_direct_IO function in fs/ext4/inode.c. At the start of a write, we set the inode state bit to EXT4_STATE_DIO_UNWRITTEN. That indicates that writes are under way for this file. The extents on which the write happens are not marked initialized, to prevent the stale data from being exposed to parallel buffered reads. If extents are being allocated for write, they are marked uninitialized.

For (non-AIO) direct IO, submit and completion coincide, so we immediately convert newly written extents to initialized by calling the function ext4_map_blocks, with flags set to EXT4_GET_BLOCKS_IO_CONVERT_EXT.

If writes are extending file size, they actually get handled by ext4_ind_direct_IO. The name is misleading ('ind' is supposed to indicate indirect mapping, not extent mapping). This function adds inode to orphan list, for the case when there is a crash in the middle of write.

Life of an async direct I/O write

For (AIO) direct IO case, we defer the conversion of extents to be done by ext4_end_io_dio which gets called when IO completion happens.

Personal tools