Life of an ext4 write request

From Ext4
(Difference between revisions)
Jump to: navigation, search
(Life of a delalloc buffered write)
(The write request)
 
(2 intermediate revisions by one user not shown)
Line 32: Line 32:
 
== The write request ==
 
== The write request ==
  
Description of how a write request happens from userspace (i.e., the codepath from generic_file_buffered_write() calling ext4_da_write_begin() and ext4_da_write_end() and/or the codepath from page_mkwrite() calling ext4_page_mkwrite()).
+
Description of how a write request happens from userspace (i.e., the
 +
codepath from generic_file_buffered_write() calling
 +
ext4_da_write_begin() and ext4_da_write_end() and/or the codepath from
 +
page_mkwrite() calling ext4_page_mkwrite()).
  
(15:38:56) tytso@google.com/Home840D61C5: (This is where I wish we had written up the "Life of a write request" :-)
+
sys_write() calls the file system's write_begin() function, which will
(15:39:34) Aditya Kali: :) not much in detail .. i just know the theory :)
+
map in the page structure for it to modify. It then will memcpy the
(15:40:34) tytso@google.com/Home840D61C5: OK, so high level.    sys_write() basically does this.  It calls the file system's write_begin() function, which will map in the page structure for it to modify.
+
buffer from userspace into the memory page that was returned by
(15:41:02) tytso@google.com/Home840D61C5: It then will memcpy the buffer from userspace into the memory page that was returned by write_begin().
+
write_begin().
(15:41:14) tytso@google.com/Home840D61C5: It calls write_end().
+
It calls write_end(). So at fs/ext4/inode.c, around line 4069, one can
(15:41:53) tytso@google.com/Home840D61C5: So if you look at fs/ext4/inode.c, around line 4069, you'll see the definition for the write_begin() and write_end() functions when we are in delayed allocation mode.
+
find the definition for the write_begin() and write_end() functions
(15:42:51) tytso@google.com/Home840D61C5: If we are not doing delayed allocations, then there are three struct address_space_operations which can be used depending on which journaled mode we are in: ext4_ordered_aops, ext4_writeback_aops, ext4_journalled_aops
+
when we are in delayed allocation mode.
(15:43:22) tytso@google.com/Home840D61C5: I don't remember which one gets used if we are in the !nojournal && nodealloc mode, but in any case, we end up using ext4_write_begin(), which in nodelalloc mode.
+
If we are not doing delayed allocations, then there are three struct
(15:43:45) tytso@google.com/Home840D61C5: the tricky bit happens when we are in delalloc mode, and the write_begin function is ext4_da_write_begin().
+
address_space_operations which can be used depending on which
(15:43:52) tytso@google.com/Home840D61C5: with me so far?
+
journaled mode we are in: ext4_ordered_aops, ext4_writeback_aops,
(15:44:10) Aditya Kali: yeah .. i am following
+
ext4_journalled_aops. We end up using ext4_write_begin() in nodelalloc
(15:44:29) tytso@google.com/Home840D61C5: ok, feel free to interrupt at any time if something doesn't make sense or you need some time to find some stuff in the code....
+
mode. The tricky bit happens when we are in delalloc mode, and the
(15:44:45) Aditya Kali: should i give you a call ?
+
write_begin function is ext4_da_write_begin().
(15:44:50) Aditya Kali: on you phone ? would it be easier ?
+
 
(15:44:55) tytso@google.com/Home840D61C5: up to you.... 
+
In ext4_da_write_begin(), there's a potential fallback to nodelalloc
(15:45:00) tytso@google.com/Home840D61C5: I can type pretty fast.  :-)
+
mode. That happens if we are low on space (and possibly low on quota;
(15:45:04) Aditya Kali: ok..
+
but not sure). That's because when we estimate how much space is
(15:45:16) Aditya Kali: qq: the 'dealloc' mode is for what ?
+
needed, we can guess wrong, especially as it comes to metadata
(15:45:22) tytso@google.com/Home840D61C5: and if we do in IM, I can cut and paste this and see if I can get someone to translate this into a wiki update for "Life of a write request"  :-)
+
allocation. We tend to guess high, because in particular for ENOSPC,
(15:45:39) tytso@google.com/Home840D61C5: delalloc == delayed allocation
+
we don't want to run out of space when we need to allocate an extent
(15:46:05) Aditya Kali: ahh .. i read it as 'dealloc' .. sorry.. please continue
+
tree block. That's because in the delalloc write request, we don't
(15:46:14) tytso@google.com/Home840D61C5: I sometimes mistype it that way, sorry.
+
actually do the block allocation until the writeback time --- and at
(15:47:05) tytso@google.com/Home840D61C5: OK, so if you look at ext4_da_write_begin(), you'll see there's a potential fallback to nodelalloc mode
+
that point we can't return an error to userspace. If we fail to
(15:47:31) tytso@google.com/Home840D61C5: that happens if we are low on space (and possibly low on quota; I can't remember if we do it in a low quota situation)
+
allocate space at writeback time, data can potentially be lost without
(15:48:00) tytso@google.com/Home840D61C5: that's because when we estimate how much space is needed, we can guess wrong.  Especially as it comes to metadata allocation.
+
the calling application knowing about it. This is not the case for
(15:48:25) tytso@google.com/Home840D61C5: We tend to guess high, because in particular for ENOSPC, we don't want to run out of space when we need to allocate an extent tree block.
+
direct I/O, of course, since it doesn't use the writeback; but
(15:48:57) tytso@google.com/Home840D61C5: That's because in the delalloc write request, we don't actually do the block allocation until the writeback time --- and at that point we can't return an error to userspace.
+
delalloc is all about what happens for buffered writes.) So when we
(15:49:29) tytso@google.com/Home840D61C5: If we fail to allocate space at writeback time, data can potentially be lost without the calling application knowing about it.
+
come close to running out of disk space, we will turn off delayed
(15:49:58) tytso@google.com/Home840D61C5: (This is not the case for direct I/O, of course, since it doesn't use the writeback; but delalloc is all about what happens for buffered writes.)
+
allocation.
(15:50:26) Aditya Kali: right
+
 
(15:50:26) tytso@google.com/Home840D61C5: OK, so when we come close to running out of disk space, we will turn off delayed allocation
+
A lot of the common case handling for write_begin in terms of
(15:51:16) tytso@google.com/Home840D61C5: OK, so moving on.  A lot of the common case handling for write_begin in terms of allocating pages, etc., happens in library functions which are the same for most file systems.
+
allocating pages, etc., happens in library functions which are the
(15:52:00) tytso@google.com/Home840D61C5: So grab_cache_page_write_begin() at line 3228, and block_write_begin() at line 3236 are both VFS library functions.
+
same for most file systems. grab_cache_page_write_begin() at line
(15:52:12) tytso@google.com/Home840D61C5: All the file system has to do is add a "get_block" function.
+
3228, and block_write_begin() at line 3236 are both VFS library
(15:52:30) tytso@google.com/Home840D61C5: In this case, the "get_block" function is ext4_da_get_block_prep()
+
functions. All the file system has to do is add a "get_block"
(15:53:29) tytso@google.com/Home840D61C5: This function will take a logical block number (the sector_t iblock), and fill in a buffer_head structure with the physical block number.
+
function. In this case, the "get_block" function is
(15:54:03) tytso@google.com/Home840D61C5: but in the delalloc case, there's one other thing it can do; which is it can reserve space so we can do the allocation at writeback time.
+
ext4_da_get_block_prep() This function will take a logical block
(15:54:15) tytso@google.com/Home840D61C5: (In the case of ext2, we would be doing the block allocation in the "get_block" function.)
+
number (the sector_t iblock), and fill in a buffer_head structure with
(15:54:43) tytso@google.com/Home840D61C5: so this is where we call ext4_da_reserve_space(), and this is where we reserve space with respect to the quota system.
+
the physical block number. But in the delalloc case, there's one
(15:55:20) tytso@google.com/Home840D61C5: The problem is that at this point, we know that this is a newly allocated block --- which is to say, this is a block which currently isn't in the inode's extent tree.
+
other thing it can do; which is it can reserve space so we can do the
(15:55:37) tytso@google.com/Home840D61C5: But what we don't know is whether the block is part of a cluster which has already been allocated or not.
+
allocation at writeback time. (In the case of ext2, we would be doing
(15:56:54) tytso@google.com/Home840D61C5: So we need to define some new function which checks to see whether or not the cluster which is part of the specific block number has been allocated or not.
+
the block allocation in the "get_block" function.) So this is where
(15:57:15) tytso@google.com/Home840D61C5: what's more, we're going to have to keep track of whether we've already accounted for this cluster with respect to the quota system.
+
we call ext4_da_reserve_space(), and this is where we reserve space
(15:58:08) tytso@google.com/Home840D61C5: Otherwise, when we write into a new cluster, the first time we call ext4_da_get_block for the cluster, we need to reserve a full cluster worth of blocks (say that's 1MB).  
+
with respect to the quota system. The problem is that at this point,
(15:58:31) tytso@google.com/Home840D61C5: But then when we call ext4_da_get_block() for the next block in the cluster, we don't want to reserve another cluster's worth of blocks.
+
we know that this is a newly allocated block --- this is a block which
(15:59:17) Aditya Kali: at what point is the cluster reserved ?
+
currently isn't in the inode's extent tree.  
(15:59:42) tytso@google.com/Home840D61C5: do you mean reserved with respect to the quota system?
+
 
(16:00:02) tytso@google.com/Home840D61C5: or do you mean allocated with respect to the allocation bitmap in the file system?
+
But what we don't know is
(16:00:06) Aditya Kali: i mean, at what point do we allocate the first block in the cluster ?
+
whether the block is part of a cluster (assuming big alloc case) which
(16:00:15) Aditya Kali: wrt allocation bitmap in the filesystem
+
has already been allocated or not. We're going to have to keep track
(16:00:19) tytso@google.com/Home840D61C5: we do the allocation at writeback time.
+
of whether we've already accounted for this cluster with respect to
(16:00:48) tytso@google.com/Home840D61C5: the whole point of delayed allocation is we don't know what block (or cluster) we will actually use when we first write data into the page cache
+
the quota system. Otherwise, when we write into a new cluster, the
(16:01:09) tytso@google.com/Home840D61C5: we wait until the last possible minute to decide where on disk we will actually locate the data
+
first time we call ext4_da_get_block for the cluster, we need to
(16:01:28) tytso@google.com/Home840D61C5: so of course we can't actually set the bit in the block/cluster allocation bitmap until that point, because we don't know the block number.
+
reserve a full cluster worth of blocks (say that's 1MB). But then when
(16:02:07) Aditya Kali: humm .. and we cant do the quota accounting at writeback time
+
we call ext4_da_get_block() for the next block in the cluster, we
(16:02:47) tytso@google.com/Home840D61C5: so that's why the quota system works this way.  First you reserve space.  
+
don't want to reserve another cluster's worth of blocks.
(16:03:14) tytso@google.com/Home840D61C5: ext4_da_get_block_prep() calls ext4_da_reserve_space(), which calls dquot_reserve_block()
+
We do the allocation at writeback time. The whole point of delayed
(16:03:42) tytso@google.com/Home840D61C5: then in the writeback code, we call dquot_claim_block().
+
allocation is we don't know what block (or cluster) we will actually
(16:04:10) tytso@google.com/Home840D61C5: dquot_reserve_block() block can fail.  If it fails, then we return ENOSPC to the user.
+
use when we first write data into the page cache we wait until the
(16:04:28) tytso@google.com/Home840D61C5: dquot_claim_block() won't ever fail, since we've already reserved the space via dquot_reserve_block()
+
last possible minute to decide where on disk we will actually locate
(16:04:42) tytso@google.com/Home840D61C5: This is to fulfill the requirement that we aren't allowed to fail, ever, in the writeback path.
+
the data. We can't actually set the bit in the block/cluster
(16:04:58) tytso@google.com/Home840D61C5: (well, except for physical I/O errors from the hard drive :-)
+
allocation bitmap until that point, because we don't know the block
(16:05:31) Aditya Kali: humm ... i understand it now
+
number.  
(16:06:17) tytso@google.com/Home840D61C5: OK, so to finish up sys_write(), once write_begin() has set up the buffer_head mappings (or reserved space in the delalloc case) to the page,
+
 
(16:06:29) tytso@google.com/Home840D61C5: then it will write the user data buffer to the page, and then call write_end().
+
So that's why the quota system works this way.  First you
(16:07:17) tytso@google.com/Home840D61C5: write_end() does a bunch of accounting stuff, including updating i_size
+
reserve space. ext4_da_get_block_prep() calls ext4_da_reserve_space(),
(16:07:42) tytso@google.com/Home840D61C5: most of this isn't relevant for delayed allocation, and a lot of it gets done in the VFS library function generic_write_end()
+
which calls dquot_reserve_block() then in the writeback code, we call dquot_claim_block().
(16:07:59) tytso@google.com/Home840D61C5: OK, so now comes what happens in the writeback code path
+
dquot_reserve_block() block can fail.  If it fails, then we return ENOSPC to the user.
(16:08:15) tytso@google.com/Home840D61C5: any questions before we start talking about ext4_da_writepages()?
+
dquot_claim_block() won't ever fail, since we've already reserved the space via dquot_reserve_block()
(16:08:32) Aditya Kali: nop .. please go ahead
+
This is to fulfill the requirement that we aren't allowed to fail, ever, in the writeback path.
(16:09:40) tytso@google.com/Home840D61C5: OK, so there's lots and lots of complexity about what happens in the writeback daemons, and most of the time I look it up in fs/fs-writeback.c and mm/page-writeback.c each time I need to check.
+
(except for physical I/O errors from the hard drive).
(16:10:10) tytso@google.com/Home840D61C5: partially that's because it's been changing a lot, and partially because there's something like a 8 or 10 deep function call chain, and it's *extremely* confusing.
+
 
(16:10:35) tytso@google.com/Home840D61C5: once it decides which inode to do writeback, and how many pages to ask the file system to write back, though, it will call the filesystem's writepages() function.
+
To finish up sys_write(), once write_begin() has set up the
(16:11:06) tytso@google.com/Home840D61C5: If the file system doesn't have a writepages() function, it must have a writepage() function, and then the writeback daemon will call writepage() for each page that it wants written back.
+
buffer_head mappings (or reserved space in the delalloc case) to the
(16:11:31) tytso@google.com/Home840D61C5: If the file system does have a writepages() function, then the writepage() function will only be called by the MM subsystem when it needs to do "direct reclaim".
+
page, then it will write the user data buffer to the page, and then call write_end().
(16:12:05) tytso@google.com/Home840D61C5: File system people hate direct reclaim, since it tends to blow out the stack, and cause recursive loops, and lots of other nastiness.
+
write_end() does a bunch of accounting stuff, including updating
(16:13:14) tytso@google.com/Home840D61C5: In the case of ext4, our writepage() function will refuse to write back a page that is under delayed allocation.   Which is to say, if we don't know where on disk the page is supposed to go, then we will call redirty_page_for_writepage() and return
+
i_size, most of this isn't relevant for delayed allocation, and a lot
(16:13:37) tytso@google.com/Home840D61C5: So the page gets marked as dirty, so the VM won't be able to drop the page from the page cache.
+
of it gets done in the VFS library function generic_write_end().
(16:13:55) tytso@google.com/Home840D61C5: the VM folks don't like that we do this, but the FS position is "tough".
+
 
(16:14:13) tytso@google.com/Home840D61C5: Long run, we hope to get rid of direct reclaim altogether, but that's a topic for another day....
+
Now we come to writeback code path. There's lots and lots of
(16:14:37) tytso@google.com/Home840D61C5: OK, so ext4_da_writepages() is what gets used at writeback time.
+
complexity about what happens in the writeback
(16:15:12) tytso@google.com/Home840D61C5: At the high level, what it does is (a) figure out how many pages it really should writeback.  We don't trust the writeback subsystem, because it tends to request too few pages to be written back.  So we do our own calculation.
+
daemons. fs/fs-writeback.c and mm/page-writeback.c are the best
(16:15:57) tytso@google.com/Home840D61C5: next, we look at all of the pages in the range of pages for which writeback was requested (in the struct writeback_control structure), and see how many of them are dirty, and unlocked.
+
sources for that functionality, partially because it's been changing a
(16:16:14) tytso@google.com/Home840D61C5: We will gather up a contiguous range of pages which are dirty and unlocked, and lock the pages.
+
lot, and partially because there's something like a 8 or 10 deep
(16:16:55) tytso@google.com/Home840D61C5: all of that is done in the function write_cache_pages_da() in the 2.6.34 tree
+
function call chain, and it's *extremely* confusing.
(16:17:23) tytso@google.com/Home840D61C5: the function write_cache_pages_da() is misnamed, and this is one area which has changed a lot between 2.6.34 and 2.6.39
+
Once writeback decides which inode to do writeback, and how many pages to ask the file system to write back, though, it will call the filesystem's writepages() function.
(16:17:30) tytso@google.com/Home840D61C5: the code has been reorganized significantly.
+
If the file system doesn't have a writepages() function, it must have
(16:18:21) tytso@google.com/Home840D61C5: anyway, after we get the region of pages that we're going to write, we then call mpage_da_map_and_submit()
+
a writepage() function, and then the writeback daemon will call
(16:18:32) tytso@google.com/Home840D61C5: you have a question?
+
writepage() for each page that it wants written back. If the file system does have a writepages() function, then the writepage() function will only be called by the MM subsystem when it needs to do "direct reclaim".
(16:18:44) Aditya Kali: nop
+
 
(16:18:53) Aditya Kali: i am following the code
+
File system people don't like direct reclaim, since it tends to blow out the stack, and cause recursive loops, and lots of other nastiness.
(16:19:13) Aditya Kali: alongside
+
In the case of ext4, our writepage() function will refuse to write
(16:19:14) tytso@google.com/Home840D61C5: OK, so this is where we call ext4_map_blocks(), and this is where we actually do the block allocation.
+
back a page that is under delayed allocation. That is, if we don't
(16:19:34) tytso@google.com/Home840D61C5: at that point, we know how many pages we want to write back, so the file system can do a better job deciding which blocks to use/allocate.  
+
know where on disk the page is supposed to go, then we will call
(16:19:38) tytso@google.com/Home840D61C5: that's the big win of delalloc mode.
+
redirty_page_for_writepage() and return. So the page gets marked as dirty, and the VM won't be able to drop the page from the page cache.
(16:20:14) tytso@google.com/Home840D61C5: also, it's faster since we do one allocation for a large number of pages, instead of doing the block allocation one block at a time, with no context of how many blocks we will end up needing (which is what ext2 did)
+
The VM folks don't like that we do this, but the FS position is "tough".
(16:20:34) tytso@google.com/Home840D61C5: and in ext4_map_blocks(), that's where we will end up calling dquot_claim_blocks()
+
Long run, we hope to get rid of direct reclaim altogether.
(16:20:52) tytso@google.com/Home840D61C5: ext4_map_blocks() has lots of different flags because it gets called many different ways.
+
ext4_da_writepages() is what gets used at writeback time.
(16:21:15) tytso@google.com/Home840D61C5: In some code paths, such as the nodelalloc code path and what we do when we're doing direct I/O, we don't call dquot_reserve_blocks() first.
+
At the high level, it does two things:
(16:21:54) tytso@google.com/Home840D61C5: so they do quota reservation a different way.
+
 
(16:23:00) Aditya Kali: okey
+
(a) figure out how many pages it really should writeback.  We don't
(16:23:19) Aditya Kali: I will take a look at the code in detail
+
trust the writeback subsystem, because it tends to request too few
(16:24:01) tytso@google.com/Home840D61C5: so one of the things we'll need to do when we look at this more detail is how quota gets claimed in the direct_IO case, the non-delalloc case (which I believe is the same as direct_IO, we end up calling ext4_begin_write()), and the fallocate() case to make sure we cover all the basis.
+
pages to be written back.  So we do our own calculation.
(16:24:04) tytso@google.com/Home840D61C5: err, bases.
+
(b) We look at all of the pages in the range of pages for which
 +
writeback was requested (in the struct writeback_control structure),
 +
and see how many of them are dirty, and unlocked.
 +
 
 +
We will gather up a contiguous range of pages which are dirty and
 +
unlocked, and lock the pages. All of that is done in the function write_cache_pages_da() in the 2.6.34 tree.
 +
The function write_cache_pages_da() is misnamed, and this is one area
 +
which has changed a lot between 2.6.34 and 2.6.39 (the code has been reorganized significantly).
 +
After we get the region of pages that we're going to write, we then call mpage_da_map_and_submit().
 +
This is where we call ext4_map_blocks(), and this is where we actually do the block allocation.
 +
At that point, we know how many pages we want to write back, so the file system can do a better job deciding which blocks to use/allocate.  
 +
That's the big win of delalloc mode. Also, it's faster since we do one
 +
allocation for a large number of pages, instead of doing the block
 +
allocation one block at a time, with no context of how many blocks we
 +
will end up needing (which is what ext2 did). In ext4_map_blocks(), we
 +
calldquot_claim_blocks().
 +
 
 +
ext4_map_blocks() has lots of different flags because it gets called
 +
many different ways.
  
 
== I/O submission ==
 
== I/O submission ==

Latest revision as of 17:37, 10 May 2011

Contents

[edit] Introduction

This article describes how various different ext4 write requests are handled in Linux 2.6.34 -- 2.6.39. Special attention will be paid to how ext4_map_blocks() is called, and with what EXT4_GET_BLOCKS_* flags. In addition how quota is reserved, claimed, and released will be discussed.

[edit] Flags passed to ext4_map_blocks()

The primary function of the ext4_map_blocks() function is to translate an inode and logical block number to a physical block number. If a mapping between a particular logical block number and physical block number does not exist, ext4_map_blocks() may create such a mapping, allocating blocks as necessary. However, there are a number of other things which ext4_map_blocks() can do, based on a flags bitmap which is passed to it. In some cases, a particular flag to ext4_map_blocks() can radically change its behavior. So it's important to document the current ext4_map_blocks flags and what they do.

At one point, ext4_map_blocks() had previously been called ext4_get_blocks(), which is the reason for the naming convention of these flags:

EXT4_GET_BLOCKS_CREATE 
to be written
EXT4_GET_BLOCKS_UNINIT_EXT 
to be written
EXT4_GET_BLOCKS_CREATE_UNINIT_EXT 
to be written
EXT4_GET_BLOCKS_DELALLOC_RESERVE 
to be written
EXT4_GET_BLOCKS_PRE_IO 
to be written
EXT4_GET_BLOCKS_CONVERT 
to be written
EXT4_GET_BLOCKS_IO_CREATE_EXT 
to be written
EXT4_GET_BLOCKS_IO_CONVERT_EXT 
to be written

[edit] Life of a nodelalloc buffered write

[edit] The write request

Description of how a write request happens from userspace (i.e., the codepath from generic_file_buffered_write() calling ext4_write_begin() and ext4_{writeback,ordered,journalled}_write_end() and/or the codepath from page_mkwrite() calling ext4_page_mkwrite()).

[edit] I/O submission

What happens when generic_writepages() calls ext4_write_cache_pages() which then calls ext4_writepage()

[edit] Life of a delalloc buffered write

[edit] The write request

Description of how a write request happens from userspace (i.e., the codepath from generic_file_buffered_write() calling ext4_da_write_begin() and ext4_da_write_end() and/or the codepath from page_mkwrite() calling ext4_page_mkwrite()).

sys_write() calls the file system's write_begin() function, which will map in the page structure for it to modify. It then will memcpy the buffer from userspace into the memory page that was returned by write_begin(). It calls write_end(). So at fs/ext4/inode.c, around line 4069, one can find the definition for the write_begin() and write_end() functions when we are in delayed allocation mode. If we are not doing delayed allocations, then there are three struct address_space_operations which can be used depending on which journaled mode we are in: ext4_ordered_aops, ext4_writeback_aops, ext4_journalled_aops. We end up using ext4_write_begin() in nodelalloc mode. The tricky bit happens when we are in delalloc mode, and the write_begin function is ext4_da_write_begin().

In ext4_da_write_begin(), there's a potential fallback to nodelalloc mode. That happens if we are low on space (and possibly low on quota; but not sure). That's because when we estimate how much space is needed, we can guess wrong, especially as it comes to metadata allocation. We tend to guess high, because in particular for ENOSPC, we don't want to run out of space when we need to allocate an extent tree block. That's because in the delalloc write request, we don't actually do the block allocation until the writeback time --- and at that point we can't return an error to userspace. If we fail to allocate space at writeback time, data can potentially be lost without the calling application knowing about it. This is not the case for direct I/O, of course, since it doesn't use the writeback; but delalloc is all about what happens for buffered writes.) So when we come close to running out of disk space, we will turn off delayed allocation.

A lot of the common case handling for write_begin in terms of allocating pages, etc., happens in library functions which are the same for most file systems. grab_cache_page_write_begin() at line 3228, and block_write_begin() at line 3236 are both VFS library functions. All the file system has to do is add a "get_block" function. In this case, the "get_block" function is ext4_da_get_block_prep() This function will take a logical block number (the sector_t iblock), and fill in a buffer_head structure with the physical block number. But in the delalloc case, there's one other thing it can do; which is it can reserve space so we can do the allocation at writeback time. (In the case of ext2, we would be doing the block allocation in the "get_block" function.) So this is where we call ext4_da_reserve_space(), and this is where we reserve space with respect to the quota system. The problem is that at this point, we know that this is a newly allocated block --- this is a block which currently isn't in the inode's extent tree.

But what we don't know is whether the block is part of a cluster (assuming big alloc case) which has already been allocated or not. We're going to have to keep track of whether we've already accounted for this cluster with respect to the quota system. Otherwise, when we write into a new cluster, the first time we call ext4_da_get_block for the cluster, we need to reserve a full cluster worth of blocks (say that's 1MB). But then when we call ext4_da_get_block() for the next block in the cluster, we don't want to reserve another cluster's worth of blocks. We do the allocation at writeback time. The whole point of delayed allocation is we don't know what block (or cluster) we will actually use when we first write data into the page cache we wait until the last possible minute to decide where on disk we will actually locate the data. We can't actually set the bit in the block/cluster allocation bitmap until that point, because we don't know the block number.

So that's why the quota system works this way. First you reserve space. ext4_da_get_block_prep() calls ext4_da_reserve_space(), which calls dquot_reserve_block() then in the writeback code, we call dquot_claim_block(). dquot_reserve_block() block can fail. If it fails, then we return ENOSPC to the user. dquot_claim_block() won't ever fail, since we've already reserved the space via dquot_reserve_block() This is to fulfill the requirement that we aren't allowed to fail, ever, in the writeback path. (except for physical I/O errors from the hard drive).

To finish up sys_write(), once write_begin() has set up the buffer_head mappings (or reserved space in the delalloc case) to the page, then it will write the user data buffer to the page, and then call write_end(). write_end() does a bunch of accounting stuff, including updating i_size, most of this isn't relevant for delayed allocation, and a lot of it gets done in the VFS library function generic_write_end().

Now we come to writeback code path. There's lots and lots of complexity about what happens in the writeback daemons. fs/fs-writeback.c and mm/page-writeback.c are the best sources for that functionality, partially because it's been changing a lot, and partially because there's something like a 8 or 10 deep function call chain, and it's *extremely* confusing. Once writeback decides which inode to do writeback, and how many pages to ask the file system to write back, though, it will call the filesystem's writepages() function. If the file system doesn't have a writepages() function, it must have a writepage() function, and then the writeback daemon will call writepage() for each page that it wants written back. If the file system does have a writepages() function, then the writepage() function will only be called by the MM subsystem when it needs to do "direct reclaim".

File system people don't like direct reclaim, since it tends to blow out the stack, and cause recursive loops, and lots of other nastiness. In the case of ext4, our writepage() function will refuse to write back a page that is under delayed allocation. That is, if we don't know where on disk the page is supposed to go, then we will call redirty_page_for_writepage() and return. So the page gets marked as dirty, and the VM won't be able to drop the page from the page cache. The VM folks don't like that we do this, but the FS position is "tough". Long run, we hope to get rid of direct reclaim altogether. ext4_da_writepages() is what gets used at writeback time. At the high level, it does two things:

(a) figure out how many pages it really should writeback.   We don't
trust the writeback subsystem, because it tends to request too few
pages to be written back.  So we do our own calculation.
(b) We look at all of the pages in the range of pages for which
writeback was requested (in the struct writeback_control structure),
and see how many of them are dirty, and unlocked.

We will gather up a contiguous range of pages which are dirty and unlocked, and lock the pages. All of that is done in the function write_cache_pages_da() in the 2.6.34 tree. The function write_cache_pages_da() is misnamed, and this is one area which has changed a lot between 2.6.34 and 2.6.39 (the code has been reorganized significantly). After we get the region of pages that we're going to write, we then call mpage_da_map_and_submit(). This is where we call ext4_map_blocks(), and this is where we actually do the block allocation. At that point, we know how many pages we want to write back, so the file system can do a better job deciding which blocks to use/allocate. That's the big win of delalloc mode. Also, it's faster since we do one allocation for a large number of pages, instead of doing the block allocation one block at a time, with no context of how many blocks we will end up needing (which is what ext2 did). In ext4_map_blocks(), we calldquot_claim_blocks().

ext4_map_blocks() has lots of different flags because it gets called many different ways.

[edit] I/O submission

What happens in ext4_writepages()

[edit] Modifications with dioread_nolock

What changes if dioread_nolock is enabled.

[edit] Life of a direct I/O write

(Edited by Nauman. Ted or Jiaying, please verify.)

Ext4 has specially optimized way of doing direct IO writes to the files, as long as the writes are not size extending writes.

The most relevant logic can be found at ext4_ext_direct_IO function in fs/ext4/inode.c. At the start of a write, we set the inode state bit to EXT4_STATE_DIO_UNWRITTEN. That indicates that writes are under way for this file. The extents on which the write happens are not marked initialized, to prevent the stale data from being exposed to parallel buffered reads. If extents are being allocated for write, they are marked uninitialized.

For (non-AIO) direct IO, submit and completion coincide, so we immediately convert newly written extents to initialized by calling the function ext4_map_blocks, with flags set to EXT4_GET_BLOCKS_IO_CONVERT_EXT.

If writes are extending file size, they actually get handled by ext4_ind_direct_IO. The name is misleading ('ind' is supposed to indicate indirect mapping, not extent mapping). This function adds inode to orphan list, for the case when there is a crash in the middle of write.

[edit] Life of an async direct I/O write

For (AIO) direct IO case, we defer the conversion of extents to be done by ext4_end_io_dio which gets called when IO completion happens.

Personal tools