https://ext4.wiki.kernel.org/index.php?title=Clarifying_Direct_IO%27s_Semantics&feed=atom&action=historyClarifying Direct IO's Semantics - Revision history2024-03-28T20:00:54ZRevision history for this page on the wikiMediaWiki 1.19.24https://ext4.wiki.kernel.org/index.php?title=Clarifying_Direct_IO%27s_Semantics&diff=1770&oldid=prevCwyang: I experimented that allocating writes does not fall back to buffered I/O in my machine, kernel 2.6.30 and ext3.2009-10-19T07:13:01Z<p>I experimented that allocating writes does not fall back to buffered I/O in my machine, kernel 2.6.30 and ext3.</p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 07:13, 19 October 2009</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 19:</td>
<td colspan="2" class="diff-lineno">Line 19:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>== Allocating writes ==</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>== Allocating writes ==</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div>Similarly unstated in the Linux man page --- or any specification I could find on the web --- is any mention about what happens if an O_DIRECT write needs to allocate blocks; for example, because the write is extending the size the file, or the write system call is writing into a sparse file's "hole" where a block had not been previously allocated.  <del class="diffchange diffchange-inline">Current </del>Linux <del class="diffchange diffchange-inline">implementation falls </del>back to buffered I/O, such that the data goes through the page cache.  The current implementation does wait until the I/O has been posted (although not necessarily with a barrier such that the data is guaranteed written to stable store by the storage device).  However, Linux does not wait until the metadata associated with the block allocation has been committed to the file system; hence, if the system crashes after an allocating write completes, there is no guarantee the data will be accessible to an application after the system reboots.  To provide this guarantee, the application must use [http://manpages.courier-mta.org/htmlman2/fsync.2.html fsync(2)], or set the O_SYNC or O_DSYNC flag on the file descriptor via [http://manpages.courier-mta.org/htmlman2/fcntl.2.html fcntl(2)].</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>Similarly unstated in the Linux man page --- or any specification I could find on the web --- is any mention about what happens if an O_DIRECT write needs to allocate blocks; for example, because the write is extending the size the file, or the write system call is writing into a sparse file's "hole" where a block had not been previously allocated.  <ins class="diffchange diffchange-inline">Certain </ins>Linux <ins class="diffchange diffchange-inline">implementations fall </ins>back to buffered I/O, such that the data goes through the page cache.  The current implementation does wait until the I/O has been posted (although not necessarily with a barrier such that the data is guaranteed written to stable store by the storage device).  However, Linux does not wait until the metadata associated with the block allocation has been committed to the file system; hence, if the system crashes after an allocating write completes, there is no guarantee the data will be accessible to an application after the system reboots.  To provide this guarantee, the application must use [http://manpages.courier-mta.org/htmlman2/fsync.2.html fsync(2)], or set the O_SYNC or O_DSYNC flag on the file descriptor via [http://manpages.courier-mta.org/htmlman2/fcntl.2.html fcntl(2)].</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Given that with an allocating write, an explicit fsync(2) (or write with O_SYNC/O_DSYNC) is required, there doesn't seem to be much point in waiting until the data I/O is complete if the O_DIRECT write has fallen back to using buffered I/O --- after all, if the data has been copied into the page cache, the data buffered passed into the [http://manpages.courier-mta.org/htmlman2/write.2.html write(2)] system call can be safely reused for other purposes, so it may be that the kernel should be allowed to return as soon as the data has been copied into the page cache.</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Given that with an allocating write, an explicit fsync(2) (or write with O_SYNC/O_DSYNC) is required, there doesn't seem to be much point in waiting until the data I/O is complete if the O_DIRECT write has fallen back to using buffered I/O --- after all, if the data has been copied into the page cache, the data buffered passed into the [http://manpages.courier-mta.org/htmlman2/write.2.html write(2)] system call can be safely reused for other purposes, so it may be that the kernel should be allowed to return as soon as the data has been copied into the page cache.</div></td></tr>
</table>Cwyanghttps://ext4.wiki.kernel.org/index.php?title=Clarifying_Direct_IO%27s_Semantics&diff=1765&oldid=prevPepsiman: /* Writes into preallocated space */2009-09-03T16:07:07Z<p><span dir="auto"><span class="autocomment">Writes into preallocated space</span></span></p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 16:07, 3 September 2009</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 29:</td>
<td colspan="2" class="diff-lineno">Line 29:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>In recent Linux kernels, it is possible to request that the file system allocate blocks without initializing the blocks first, using the [http://manpages.courier-mta.org/htmlman2/fallocate.2.html fallocate(2)] system call.  Since those blocks contain previously unused data blocks, those blocks or extents must be marked as uninitialized, so that reads of these uninitialized blocks will return a zero block instead of the previous contents of those blocks (which might cause a security exposure).  The first time an application writes into preallocated block, the file system must clear the uninitialized bit, so that a subsequent read of that data block will return the written data, instead of a zero block.</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>In recent Linux kernels, it is possible to request that the file system allocate blocks without initializing the blocks first, using the [http://manpages.courier-mta.org/htmlman2/fallocate.2.html fallocate(2)] system call.  Since those blocks contain previously unused data blocks, those blocks or extents must be marked as uninitialized, so that reads of these uninitialized blocks will return a zero block instead of the previous contents of those blocks (which might cause a security exposure).  The first time an application writes into preallocated block, the file system must clear the uninitialized bit, so that a subsequent read of that data block will return the written data, instead of a zero block.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div>This requirement, when applied to a direct I/O write, has similar implications to the allocating write case, described above.  Although the space for the direct I/O has already been reserved, a change to the file system metadata is required to mark the just-written data block or extent as being initialized.  For file systems that use a journal to assure that the file system metadata is consistent, requiring direct I/O write to block until a file system commit is completed would be an unacceptable performance impact.  On the other hand, if the data is not guaranteed to be present after a system crash unless the application uses an explicit fsync(2) call, this could take some application programmers by surprise --- especially since testing that the application data can be recovered after crashes that take place immediately after an allocating write or a write into a preallocated block are cases that might not be well tested by all open source <del class="diffchange diffchange-inline">database</del>.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>This requirement, when applied to a direct I/O write, has similar implications to the allocating write case, described above.  Although the space for the direct I/O has already been reserved, a change to the file system metadata is required to mark the just-written data block or extent as being initialized.  For file systems that use a journal to assure that the file system metadata is consistent, requiring direct I/O write to block until a file system commit is completed would be an unacceptable performance impact.  On the other hand, if the data is not guaranteed to be present after a system crash unless the application uses an explicit fsync(2) call, this could take some application programmers by surprise --- especially since testing that the application data can be recovered after crashes that take place immediately after an allocating write or a write into a preallocated block are cases that might not be well tested by all open source <ins class="diffchange diffchange-inline">databases</ins>.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>The proposed solution is the same for allocating writes; that we document that O_DIRECT does not imply synchronous I/O of any file control data, and that it is unspecified whether data written into newly allocated blocks, or uninitialized regions of the file will survive a system crash until the data is explicitly flushed to disk via a system facility such as fsync(2).  For that reason, the only thing which the application can infer in the case of writes to preallocated (uninitialized) file regions or file regions which require block allocation is that when the write(2) system call, the data buffer passed to write(2) may be reused for other purposes.</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>The proposed solution is the same for allocating writes; that we document that O_DIRECT does not imply synchronous I/O of any file control data, and that it is unspecified whether data written into newly allocated blocks, or uninitialized regions of the file will survive a system crash until the data is explicitly flushed to disk via a system facility such as fsync(2).  For that reason, the only thing which the application can infer in the case of writes to preallocated (uninitialized) file regions or file regions which require block allocation is that when the write(2) system call, the data buffer passed to write(2) may be reused for other purposes.</div></td></tr>
</table>Pepsimanhttps://ext4.wiki.kernel.org/index.php?title=Clarifying_Direct_IO%27s_Semantics&diff=1764&oldid=prevPepsiman: /* Allocating writes */2009-09-03T16:05:52Z<p><span dir="auto"><span class="autocomment">Allocating writes</span></span></p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 16:05, 3 September 2009</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 19:</td>
<td colspan="2" class="diff-lineno">Line 19:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>== Allocating writes ==</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>== Allocating writes ==</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div>Similarly unstated in the Linux man page --- or any specification I could find on the web --- is any mention about what happens if an O_DIRECT write needs to allocate blocks; for example, because the write is extending the size the file, or the write system call is writing into a sparse file's "hole" where a block had not been previously allocated.  Current Linux <del class="diffchange diffchange-inline">implementations </del>falls back to buffered I/O, such that the data goes through the page cache.  The current implementation does wait until the I/O has been posted (although not necessarily with a barrier such that the data is guaranteed written to stable store by the storage device).  However, Linux does not wait until the metadata associated with the block allocation has been committed to the file system; hence, if the system crashes after an allocating write completes, there is no guarantee the data will be accessible to an application after the system reboots.  To provide this guarantee, the application must use [http://manpages.courier-mta.org/htmlman2/fsync.2.html fsync(2)], or set the O_SYNC or O_DSYNC flag on the file descriptor via [http://manpages.courier-mta.org/htmlman2/fcntl.2.html fcntl(2)].</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>Similarly unstated in the Linux man page --- or any specification I could find on the web --- is any mention about what happens if an O_DIRECT write needs to allocate blocks; for example, because the write is extending the size the file, or the write system call is writing into a sparse file's "hole" where a block had not been previously allocated.  Current Linux <ins class="diffchange diffchange-inline">implementation </ins>falls back to buffered I/O, such that the data goes through the page cache.  The current implementation does wait until the I/O has been posted (although not necessarily with a barrier such that the data is guaranteed written to stable store by the storage device).  However, Linux does not wait until the metadata associated with the block allocation has been committed to the file system; hence, if the system crashes after an allocating write completes, there is no guarantee the data will be accessible to an application after the system reboots.  To provide this guarantee, the application must use [http://manpages.courier-mta.org/htmlman2/fsync.2.html fsync(2)], or set the O_SYNC or O_DSYNC flag on the file descriptor via [http://manpages.courier-mta.org/htmlman2/fcntl.2.html fcntl(2)].</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Given that with an allocating write, an explicit fsync(2) (or write with O_SYNC/O_DSYNC) is required, there doesn't seem to be much point in waiting until the data I/O is complete if the O_DIRECT write has fallen back to using buffered I/O --- after all, if the data has been copied into the page cache, the data buffered passed into the [http://manpages.courier-mta.org/htmlman2/write.2.html write(2)] system call can be safely reused for other purposes, so it may be that the kernel should be allowed to return as soon as the data has been copied into the page cache.</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Given that with an allocating write, an explicit fsync(2) (or write with O_SYNC/O_DSYNC) is required, there doesn't seem to be much point in waiting until the data I/O is complete if the O_DIRECT write has fallen back to using buffered I/O --- after all, if the data has been copied into the page cache, the data buffered passed into the [http://manpages.courier-mta.org/htmlman2/write.2.html write(2)] system call can be safely reused for other purposes, so it may be that the kernel should be allowed to return as soon as the data has been copied into the page cache.</div></td></tr>
</table>Pepsimanhttps://ext4.wiki.kernel.org/index.php?title=Clarifying_Direct_IO%27s_Semantics&diff=1763&oldid=prevPepsiman: /* Writes into preallocated space */2009-09-03T16:03:13Z<p><span dir="auto"><span class="autocomment">Writes into preallocated space</span></span></p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 16:03, 3 September 2009</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 27:</td>
<td colspan="2" class="diff-lineno">Line 27:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>== Writes into preallocated space ==</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>== Writes into preallocated space ==</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div>In recent Linux kernels, it is possible to request that the file system allocate blocks <del class="diffchange diffchange-inline">with out </del>initializing the blocks first, using the [http://manpages.courier-mta.org/htmlman2/fallocate.2.html fallocate(2)] system call.  Since those blocks contain previously unused data blocks, those blocks or extents must be marked as uninitialized, so that reads of these uninitialized blocks will return a zero block instead of the previous contents of those blocks (which might cause a security exposure).  The first time an application writes into preallocated block, the file system must clear the uninitialized bit, so that a subsequent read of that data block will return the written data, instead of a zero block.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>In recent Linux kernels, it is possible to request that the file system allocate blocks <ins class="diffchange diffchange-inline">without </ins>initializing the blocks first, using the [http://manpages.courier-mta.org/htmlman2/fallocate.2.html fallocate(2)] system call.  Since those blocks contain previously unused data blocks, those blocks or extents must be marked as uninitialized, so that reads of these uninitialized blocks will return a zero block instead of the previous contents of those blocks (which might cause a security exposure).  The first time an application writes into preallocated block, the file system must clear the uninitialized bit, so that a subsequent read of that data block will return the written data, instead of a zero block.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>This requirement, when applied to a direct I/O write, has similar implications to the allocating write case, described above.  Although the space for the direct I/O has already been reserved, a change to the file system metadata is required to mark the just-written data block or extent as being initialized.  For file systems that use a journal to assure that the file system metadata is consistent, requiring direct I/O write to block until a file system commit is completed would be an unacceptable performance impact.  On the other hand, if the data is not guaranteed to be present after a system crash unless the application uses an explicit fsync(2) call, this could take some application programmers by surprise --- especially since testing that the application data can be recovered after crashes that take place immediately after an allocating write or a write into a preallocated block are cases that might not be well tested by all open source database.</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>This requirement, when applied to a direct I/O write, has similar implications to the allocating write case, described above.  Although the space for the direct I/O has already been reserved, a change to the file system metadata is required to mark the just-written data block or extent as being initialized.  For file systems that use a journal to assure that the file system metadata is consistent, requiring direct I/O write to block until a file system commit is completed would be an unacceptable performance impact.  On the other hand, if the data is not guaranteed to be present after a system crash unless the application uses an explicit fsync(2) call, this could take some application programmers by surprise --- especially since testing that the application data can be recovered after crashes that take place immediately after an allocating write or a write into a preallocated block are cases that might not be well tested by all open source database.</div></td></tr>
</table>Pepsimanhttps://ext4.wiki.kernel.org/index.php?title=Clarifying_Direct_IO%27s_Semantics&diff=1761&oldid=prevDavecb: /* Irix */2009-08-27T16:20:44Z<p><span dir="auto"><span class="autocomment">Irix</span></span></p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 16:20, 27 August 2009</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 98:</td>
<td colspan="2" class="diff-lineno">Line 98:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Irix does not seem to document what happens with a write(2) system call needs to allocate blocks, and file system metadata changes are required.  Irix also does not document what kind of page cache coherency guarantees it provides (if any).  The write(2) and read(2) man pages seems to hint that Irix provides at least some page cache coherency, but the exact nature of the coherency guaranteed by Irix does not seem to be stated explicitly anywhere.</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Irix does not seem to document what happens with a write(2) system call needs to allocate blocks, and file system metadata changes are required.  Irix also does not document what kind of page cache coherency guarantees it provides (if any).  The write(2) and read(2) man pages seems to hint that Irix provides at least some page cache coherency, but the exact nature of the coherency guaranteed by Irix does not seem to be stated explicitly anywhere.</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">== Solaris ==</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">Solaris provides "forcedirectio" as a mount option, and when it</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">is applied, data is transferred without being copied to the</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">buffer cache. It is recommended as a performance optimization</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">when large amounts of data are transferred sequentially, unlike</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">other discussion of direct I/O. </ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">In practice, forecedirectio indeed does appear to</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">coalesce multiple contiguous logical writes into a substantially </ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">smaller number of larger physical writes. This improves </ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">performance when doing full-table scans or other large</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">I/O operations. </ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">See man mount_ufs(1M), directio(3C)</ins></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>= Conclusion =</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>= Conclusion =</div></td></tr>
</table>Davecbhttps://ext4.wiki.kernel.org/index.php?title=Clarifying_Direct_IO%27s_Semantics&diff=1758&oldid=prevTytso: Add Linus Quote on O_DIRECT2009-08-22T11:34:46Z<p>Add Linus Quote on O_DIRECT</p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 11:34, 22 August 2009</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 102:</td>
<td colspan="2" class="diff-lineno">Line 102:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Most users of direct I/O will hopefully not be affected by the clarifications in this document.  These users tend to not to use allocating writes with Direct I/O, or are already using an explicit fsync(2) after such allocating writes.  However, if there are applications that have been making assumptions about direct I/O implying O_SYNC semantics to meet (for example) database ACID requirements, changing their application to meet the semantics documented herein (which after all, is all applications have been getting anyway) should not be difficult.</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Most users of direct I/O will hopefully not be affected by the clarifications in this document.  These users tend to not to use allocating writes with Direct I/O, or are already using an explicit fsync(2) after such allocating writes.  However, if there are applications that have been making assumptions about direct I/O implying O_SYNC semantics to meet (for example) database ACID requirements, changing their application to meet the semantics documented herein (which after all, is all applications have been getting anyway) should not be difficult.</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">Finally, anyone considering use of O_DIRECT should keep in mind Linus's comments on the subject:</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><blockquote></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances." -- Linus</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></blockquote></ins></div></td></tr>
</table>Tytsohttps://ext4.wiki.kernel.org/index.php?title=Clarifying_Direct_IO%27s_Semantics&diff=1757&oldid=prevTytso: Add "what do other operating systems do" section.2009-08-22T02:41:56Z<p>Add "what do other operating systems do" section.</p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 02:41, 22 August 2009</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 32:</td>
<td colspan="2" class="diff-lineno">Line 32:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>The proposed solution is the same for allocating writes; that we document that O_DIRECT does not imply synchronous I/O of any file control data, and that it is unspecified whether data written into newly allocated blocks, or uninitialized regions of the file will survive a system crash until the data is explicitly flushed to disk via a system facility such as fsync(2).  For that reason, the only thing which the application can infer in the case of writes to preallocated (uninitialized) file regions or file regions which require block allocation is that when the write(2) system call, the data buffer passed to write(2) may be reused for other purposes.</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>The proposed solution is the same for allocating writes; that we document that O_DIRECT does not imply synchronous I/O of any file control data, and that it is unspecified whether data written into newly allocated blocks, or uninitialized regions of the file will survive a system crash until the data is explicitly flushed to disk via a system facility such as fsync(2).  For that reason, the only thing which the application can infer in the case of writes to preallocated (uninitialized) file regions or file regions which require block allocation is that when the write(2) system call, the data buffer passed to write(2) may be reused for other purposes.</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">= What Do Other Operating Systems Do =</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">== AIX ==</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">From IBM's [http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/fileio.htm AIX 6.1 documentation] documentation:</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><blockquote></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">Although direct I/O writes are done synchronously, they do not provide synchronized I/O data integrity completion, as defined by POSIX. Applications that need this feature should use O_DSYNC in addition to O_DIRECT. O_DSYNC guarantees that all of the data and enough of the metadata (for example, indirect blocks) have written to the stable store to be able to retrieve the data after a system crash. O_DIRECT only writes the data; it does not write the metadata.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></blockquote></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">AIX does not seem to document any page cache coherency guarantees (if any).</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">== Irix ==</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">From Irix 6.5's [http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=0650&db=man&fname=/usr/share/catman/p_man/cat2/standard/open.z&srch=open open(2)] man page:</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><blockquote></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">            If set, reads and writes on the resulting file descriptor will be</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">            performed directly to or from the user program buffer, provided</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">            appropriate size and alignment restrictions are met and pages are</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">            not locked in memory by any process.  Refer to the F_SETFL and</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">            F_DIOINFO commands in the fcntl(2) manual entry for information</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">            about how to determine the alignment constraints.  Refer to the</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">            mlock(3C), read(2) and write(2) manual entries for information on</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">            direct I/O involving pages that are locked in memory.  O_DIRECT is</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">            a Silicon Graphics extension and is only supported on local EFS</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">            and XFS file systems, and remote BDS file systems. In Irix 6.5.24</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">            and beyond, O_DIRECT is also supported on remote NFS Version 3</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">            file systems.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></blockquote></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">From Irix 6.5's [http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=0650&db=man&fname=/usr/share/catman/p_man/cat2/standard/pwrite64.z&srch=write write(2)] man page:</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><blockquote></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">    When attempting to write to a file with O_DIRECT or FDIRECT set, -1 will</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">    be returned and errno will be set to EINVAL if nbyte or the current file</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">    position is not a multiple of the underlying device's blocksize, nbyte is</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">    too big or buf isn't properly aligned.  See also F_DIOINFO in fcntl(2).</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></blockquote></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><blockquote></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">    When attempting to write to a file with O_DIRECT or FDIRECT set, the</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">    portion being written can not be locked in memory by any process. In this</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">    case, -1 will be returned and errno will be set to EBUSY.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></blockquote></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">From Irix 6.5's read(2) man page:</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><blockquote></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">  When attempting to read from a file with O_DIRECT or FDIRECT set, -1 will</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">    be returned and errno will be set to EINVAL if nbyte or the current file</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">    position is not a multiple of the underlying device's blocksize, nbyte is</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">    too big or buf isn't properly aligned.  See also F_DIOINFO in the</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">    fcntl(2) manual entry.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></blockquote></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"><blockquote></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">    When attempting to read from a file with O_DIRECT or FDIRECT set, the</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">    read data will by default come from an in-memory page if the data is</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">    locked in memory by a mlock(3C) command.  If you wish to fail the read</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">    instead of having the data not be read directly from disk, set the</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">    systune(1M) variable xfs_dio_retry to zero (0) and the read will return</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">    -1 and errno will be set to EBUSY.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></blockquote></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">Irix does not seem to document what happens with a write(2) system call needs to allocate blocks, and file system metadata changes are required.  Irix also does not document what kind of page cache coherency guarantees it provides (if any).  The write(2) and read(2) man pages seems to hint that Irix provides at least some page cache coherency, but the exact nature of the coherency guaranteed by Irix does not seem to be stated explicitly anywhere.</ins></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>= Conclusion =</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>= Conclusion =</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Most users of direct I/O will hopefully not be affected by the clarifications in this document.  These users tend to not to use allocating writes with Direct I/O, or are already using an explicit fsync(2) after such allocating writes.  However, if there are applications that have been making assumptions about direct I/O implying O_SYNC semantics to meet (for example) database ACID requirements, changing their application to meet the semantics documented herein (which after all, is all applications have been getting anyway) should not be difficult.</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Most users of direct I/O will hopefully not be affected by the clarifications in this document.  These users tend to not to use allocating writes with Direct I/O, or are already using an explicit fsync(2) after such allocating writes.  However, if there are applications that have been making assumptions about direct I/O implying O_SYNC semantics to meet (for example) database ACID requirements, changing their application to meet the semantics documented herein (which after all, is all applications have been getting anyway) should not be difficult.</div></td></tr>
</table>Tytsohttps://ext4.wiki.kernel.org/index.php?title=Clarifying_Direct_IO%27s_Semantics&diff=1756&oldid=prevTytso: Add links to kernel man pages for various system calls mentioned in the article.2009-08-22T02:26:20Z<p>Add links to kernel man pages for various system calls mentioned in the article.</p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 02:26, 22 August 2009</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 4:</td>
<td colspan="2" class="diff-lineno">Line 4:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>The exact semantics of Direct I/O (O_DIRECT) are not well specified.  It is not a part of POSIX, or SUS, or any other formal standards specification.  The exact meaning of O_DIRECT has historically been negotiated in non-public discussions between powerful enterprise database companies and proprietary Unix systems, and its behaviour has generally been passed down as oral lore rather than as a formal set of requirements and specifications.</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>The exact semantics of Direct I/O (O_DIRECT) are not well specified.  It is not a part of POSIX, or SUS, or any other formal standards specification.  The exact meaning of O_DIRECT has historically been negotiated in non-public discussions between powerful enterprise database companies and proprietary Unix systems, and its behaviour has generally been passed down as oral lore rather than as a formal set of requirements and specifications.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div>The goal of this page is to summarize the current status, and to propose a more fully-fleshed out set of semantics for O_DIRECT which Linux file <del class="diffchange diffchange-inline">filesystem </del>developers can agree, and for which application programmers (especially open source database implementors who may not have had an opportunity to have the same set of discussions with OS implementors as the large enterprise database developers have had).  Once there is consensus, this wiki page should also be used as the basis for updating the [http://www.kernel.org/doc/man-pages/online/pages/man2/open.2.html Linux kernel man page for open(2)].</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>The goal of this page is to summarize the current status, and to propose a more fully-fleshed out set of semantics for O_DIRECT which Linux file <ins class="diffchange diffchange-inline">file system </ins>developers can agree, and for which application programmers (especially open source database implementors who may not have had an opportunity to have the same set of discussions with OS implementors as the large enterprise database developers have had).  Once there is consensus, this wiki page should also be used as the basis for updating the [http://www.kernel.org/doc/man-pages/online/pages/man2/open.2.html Linux kernel man page for open(2)].</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>= Ambiguities =</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>= Ambiguities =</div></td></tr>
<tr><td colspan="2" class="diff-lineno">Line 19:</td>
<td colspan="2" class="diff-lineno">Line 19:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>== Allocating writes ==</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>== Allocating writes ==</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div>Similarly unstated in the Linux man page --- or any specification I could find on the web --- is any mention about what happens if an O_DIRECT write needs to allocate blocks; for example, because the write is extending the size the file, or the write system call is writing into a sparse file's "hole" where a block had not been previously allocated.  Current Linux implementations falls back to buffered I/O, such that the data goes through the page cache.  The current implementation does wait until the I/O has been posted (although not necessarily with a barrier such that the data is guaranteed written to stable store by the storage device).  However, Linux does not wait until the metadata associated with the block allocation has been committed to the file system; hence, if the system crashes after an allocating write completes, there is no guarantee the data will be accessible to an application after the system reboots.  To provide this guarantee, the application must use fsync(2), or set the O_SYNC or O_DSYNC flag on the file descriptor via fcntl(2).</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>Similarly unstated in the Linux man page --- or any specification I could find on the web --- is any mention about what happens if an O_DIRECT write needs to allocate blocks; for example, because the write is extending the size the file, or the write system call is writing into a sparse file's "hole" where a block had not been previously allocated.  Current Linux implementations falls back to buffered I/O, such that the data goes through the page cache.  The current implementation does wait until the I/O has been posted (although not necessarily with a barrier such that the data is guaranteed written to stable store by the storage device).  However, Linux does not wait until the metadata associated with the block allocation has been committed to the file system; hence, if the system crashes after an allocating write completes, there is no guarantee the data will be accessible to an application after the system reboots.  To provide this guarantee, the application must use <ins class="diffchange diffchange-inline">[http://manpages.courier-mta.org/htmlman2/fsync.2.html </ins>fsync(2)<ins class="diffchange diffchange-inline">]</ins>, or set the O_SYNC or O_DSYNC flag on the file descriptor via <ins class="diffchange diffchange-inline">[http://manpages.courier-mta.org/htmlman2/fcntl.2.html </ins>fcntl(2)<ins class="diffchange diffchange-inline">]</ins>.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div>Given that with an allocating write, an explicit fsync(2) (or write with O_SYNC/O_DSYNC) is required, there doesn't seem to be much point in waiting until the data I/O is complete if the O_DIRECT write has fallen back to using buffered I/O --- after all, if the data has been copied into the page cache, the data buffered passed into the write(2) system call can be safely reused for other purposes, so it may be that the kernel should be allowed to return as soon as the data has been copied into the page cache.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>Given that with an allocating write, an explicit fsync(2) (or write with O_SYNC/O_DSYNC) is required, there doesn't seem to be much point in waiting until the data I/O is complete if the O_DIRECT write has fallen back to using buffered I/O --- after all, if the data has been copied into the page cache, the data buffered passed into the <ins class="diffchange diffchange-inline">[http://manpages.courier-mta.org/htmlman2/write.2.html </ins>write(2)<ins class="diffchange diffchange-inline">] </ins>system call can be safely reused for other purposes, so it may be that the kernel should be allowed to return as soon as the data has been copied into the page cache.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>From a specification point of view, the fact that allocating writes can fall back to buffered I/O should be documented, and that any file system control data associated with the block I/O will not be synchronously committed unless the application explicitly requests this via fsync(2) or O_SYNC.  If there is agreement that based on this, the kernel should be allowed to return once the data buffer passed to write(2) can be reused the application, this should be explicitly documented in the open(2) man page as well.</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>From a specification point of view, the fact that allocating writes can fall back to buffered I/O should be documented, and that any file system control data associated with the block I/O will not be synchronously committed unless the application explicitly requests this via fsync(2) or O_SYNC.  If there is agreement that based on this, the kernel should be allowed to return once the data buffer passed to write(2) can be reused the application, this should be explicitly documented in the open(2) man page as well.</div></td></tr>
<tr><td colspan="2" class="diff-lineno">Line 27:</td>
<td colspan="2" class="diff-lineno">Line 27:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>== Writes into preallocated space ==</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>== Writes into preallocated space ==</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div>In recent Linux kernels, it is possible to request that the file system allocate blocks with out initializing the blocks first.  Since those blocks contain previously unused data blocks, those blocks or extents must be marked as uninitialized, so that reads of these uninitialized blocks will return a zero block instead of the previous contents of those blocks (which might cause a security exposure).  The first time an application writes into preallocated block, the file system must clear the uninitialized bit, so that a subsequent read of that data block will return the written data, instead of a zero block.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>In recent Linux kernels, it is possible to request that the file system allocate blocks with out initializing the blocks first<ins class="diffchange diffchange-inline">, using the [http://manpages.courier-mta.org/htmlman2/fallocate.2.html fallocate(2)] system call</ins>.  Since those blocks contain previously unused data blocks, those blocks or extents must be marked as uninitialized, so that reads of these uninitialized blocks will return a zero block instead of the previous contents of those blocks (which might cause a security exposure).  The first time an application writes into preallocated block, the file system must clear the uninitialized bit, so that a subsequent read of that data block will return the written data, instead of a zero block.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>This requirement, when applied to a direct I/O write, has similar implications to the allocating write case, described above.  Although the space for the direct I/O has already been reserved, a change to the file system metadata is required to mark the just-written data block or extent as being initialized.  For file systems that use a journal to assure that the file system metadata is consistent, requiring direct I/O write to block until a file system commit is completed would be an unacceptable performance impact.  On the other hand, if the data is not guaranteed to be present after a system crash unless the application uses an explicit fsync(2) call, this could take some application programmers by surprise --- especially since testing that the application data can be recovered after crashes that take place immediately after an allocating write or a write into a preallocated block are cases that might not be well tested by all open source database.</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>This requirement, when applied to a direct I/O write, has similar implications to the allocating write case, described above.  Although the space for the direct I/O has already been reserved, a change to the file system metadata is required to mark the just-written data block or extent as being initialized.  For file systems that use a journal to assure that the file system metadata is consistent, requiring direct I/O write to block until a file system commit is completed would be an unacceptable performance impact.  On the other hand, if the data is not guaranteed to be present after a system crash unless the application uses an explicit fsync(2) call, this could take some application programmers by surprise --- especially since testing that the application data can be recovered after crashes that take place immediately after an allocating write or a write into a preallocated block are cases that might not be well tested by all open source database.</div></td></tr>
</table>Tytsohttps://ext4.wiki.kernel.org/index.php?title=Clarifying_Direct_IO%27s_Semantics&diff=1755&oldid=prevTytso: Removed fallback section, since it's not correct. Also rename extending writes to allocating writes.2009-08-22T00:21:04Z<p>Removed fallback section, since it's not correct. Also rename extending writes to allocating writes.</p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 00:21, 22 August 2009</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 17:</td>
<td colspan="2" class="diff-lineno">Line 17:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div></blockquote></div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div></blockquote></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div>== <del class="diffchange diffchange-inline">Fallback behavior </del>==</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>== <ins class="diffchange diffchange-inline">Allocating writes </ins>==</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div><del class="diffchange diffchange-inline">The </del>Linux man page <del class="diffchange diffchange-inline">does not state </del>what happens if <del class="diffchange diffchange-inline">the alignment restrictions are not met</del>; <del class="diffchange diffchange-inline">does the kernel [http://lwn.net/Articles/342691/ start running rogue or nethack]; does it send a signal such as SIGSEGV or SIGABORT</del>, <del class="diffchange diffchange-inline">and kill </del>the <del class="diffchange diffchange-inline">running process; </del>or <del class="diffchange diffchange-inline">does it fall </del>back to buffered I/O<del class="diffchange diffchange-inline">?  Today</del>, the <del class="diffchange diffchange-inline">answer </del>is the <del class="diffchange diffchange-inline">latter; but it's </del>not <del class="diffchange diffchange-inline">specified anywhere</del>.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins class="diffchange diffchange-inline">Similarly unstated in the </ins>Linux man page <ins class="diffchange diffchange-inline">--- or any specification I could find on the web --- is any mention about </ins>what happens if <ins class="diffchange diffchange-inline">an O_DIRECT write needs to allocate blocks</ins>; <ins class="diffchange diffchange-inline">for example</ins>, <ins class="diffchange diffchange-inline">because </ins>the <ins class="diffchange diffchange-inline">write is extending the size the file, </ins>or <ins class="diffchange diffchange-inline">the write system call is writing into a sparse file's "hole" where a block had not been previously allocated.  Current Linux implementations falls </ins>back to buffered I/O, <ins class="diffchange diffchange-inline">such that the data goes through the page cache.  The current implementation does wait until the I/O has been posted (although not necessarily with a barrier such that </ins>the <ins class="diffchange diffchange-inline">data </ins>is <ins class="diffchange diffchange-inline">guaranteed written to stable store by </ins>the <ins class="diffchange diffchange-inline">storage device).  However, Linux does </ins>not <ins class="diffchange diffchange-inline">wait until the metadata associated with the block allocation has been committed to the file system; hence, if the system crashes after an allocating write completes, there is no guarantee the data will be accessible to an application after the system reboots.  To provide this guarantee, the application must use fsync(2), or set the O_SYNC or O_DSYNC flag on the file descriptor via fcntl(2)</ins>.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div><del class="diffchange diffchange-inline">This </del>is <del class="diffchange diffchange-inline">relatively well understood by most implementors and users of </del>O_DIRECT <del class="diffchange diffchange-inline">as part of </del>the <del class="diffchange diffchange-inline">"oral lore"</del>, so <del class="diffchange diffchange-inline">simply updating </del>the <del class="diffchange diffchange-inline">Linux man page </del>should <del class="diffchange diffchange-inline">not </del>be <del class="diffchange diffchange-inline">controversial</del>.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins class="diffchange diffchange-inline">Given that with an allocating write, an explicit fsync(2) (or write with O_SYNC/O_DSYNC) </ins>is <ins class="diffchange diffchange-inline">required, there doesn't seem to be much point in waiting until the data I/O is complete if the </ins>O_DIRECT <ins class="diffchange diffchange-inline">write has fallen back to using buffered I/O --- after all, if </ins>the <ins class="diffchange diffchange-inline">data has been copied into the page cache, the data buffered passed into the write(2) system call can be safely reused for other purposes</ins>, so <ins class="diffchange diffchange-inline">it may be that </ins>the <ins class="diffchange diffchange-inline">kernel </ins>should be <ins class="diffchange diffchange-inline">allowed to return as soon as the data has been copied into the page cache</ins>.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div><del class="diffchange diffchange-inline">== Extending writes ==</del></div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>From a specification point of view, the fact that <ins class="diffchange diffchange-inline">allocating </ins>writes can fall back to buffered I/O should be documented, and that any file system control data associated with the block I/O will not be synchronously committed unless the application explicitly requests this via fsync(2) or O_SYNC.  If there is agreement that based on this, the kernel should be allowed to return once the data buffer passed to write(2) can be reused the application, this should be explicitly documented in the open(2) man page as well.</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div> </div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div></div></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div><del class="diffchange diffchange-inline">Similarly unstated in the Linux man page --- or any specification I could find on the web --- is any mention about what happens if an O_DIRECT write needs to allocate blocks; for example, because the write is extending the size the file, or the write system call is writing into a sparse file's "hole" where a block had not been previously allocated.  Current Linux implementations falls back to buffered I/O, such that the data goes through the page cache.  The current implementation does wait until the I/O has been posted (although not necessarily with a barrier such that the data is guaranteed written to stable store by the storage device).  However, Linux does not wait until the metadata associated with the block allocation has been committed to the filesystem; hence, if the system crashes after an extending write completes, there is no guarantee the data will be accessible to an application after the system reboots.  To provide this guarantee, the application must use fsync(2), or set the O_SYNC or O_DSYNC flag on the file descriptor via fcntl(2).</del></div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div></div></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div> </div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div></div></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div><del class="diffchange diffchange-inline">Given that with an extending write, an explicit fsync(2) (or write with O_SYNC/O_DSYNC) is required, there doesn't seem to be much point in waiting until the data I/O is complete if the O_DIRECT write has fallen back to using buffered I/O --- after all, if the data has been copied into the page cache, the data buffered passed into the write(2) system call can be safely reused for other purposes, so it may be that the kernel should be allowed to return as soon as the data has been copied into the page cache.</del></div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div></div></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div> </div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div></div></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div>From a specification point of view, the fact that <del class="diffchange diffchange-inline">extending </del>writes can fall back to buffered I/O should be documented, and that any file system control data associated with the block I/O will not be synchronously committed unless the application explicitly requests this via fsync(2) or O_SYNC.  If there is agreement that based on this, the kernel should be allowed to return once the data buffer passed to write(2) can be reused the application, this should be explicitly documented in the open(2) man page as well.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>== Writes into preallocated space ==</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>== Writes into preallocated space ==</div></td></tr>
<tr><td colspan="2" class="diff-lineno">Line 35:</td>
<td colspan="2" class="diff-lineno">Line 29:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>In recent Linux kernels, it is possible to request that the file system allocate blocks with out initializing the blocks first.  Since those blocks contain previously unused data blocks, those blocks or extents must be marked as uninitialized, so that reads of these uninitialized blocks will return a zero block instead of the previous contents of those blocks (which might cause a security exposure).  The first time an application writes into preallocated block, the file system must clear the uninitialized bit, so that a subsequent read of that data block will return the written data, instead of a zero block.</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>In recent Linux kernels, it is possible to request that the file system allocate blocks with out initializing the blocks first.  Since those blocks contain previously unused data blocks, those blocks or extents must be marked as uninitialized, so that reads of these uninitialized blocks will return a zero block instead of the previous contents of those blocks (which might cause a security exposure).  The first time an application writes into preallocated block, the file system must clear the uninitialized bit, so that a subsequent read of that data block will return the written data, instead of a zero block.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div>This requirement, when applied to a direct I/O write, has similar implications to the <del class="diffchange diffchange-inline">extending </del>write case, described above.  Although the space for the direct I/O has already been reserved, a change to the file system metadata is required to mark the just-written data block or extent as being initialized.  For file systems that use a journal to assure that the file system metadata is consistent, requiring direct I/O write to block until a file system commit is completed would be an unacceptable performance impact.  On the other hand, if the data is not guaranteed to be present after a system crash unless the application uses an explicit fsync(2) call, this could take some application programmers by surprise --- especially since testing that the application data can be recovered after crashes that take place immediately after an <del class="diffchange diffchange-inline">extending </del>write or a write into a preallocated block are cases that might not be well tested by all open source database.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>This requirement, when applied to a direct I/O write, has similar implications to the <ins class="diffchange diffchange-inline">allocating </ins>write case, described above.  Although the space for the direct I/O has already been reserved, a change to the file system metadata is required to mark the just-written data block or extent as being initialized.  For file systems that use a journal to assure that the file system metadata is consistent, requiring direct I/O write to block until a file system commit is completed would be an unacceptable performance impact.  On the other hand, if the data is not guaranteed to be present after a system crash unless the application uses an explicit fsync(2) call, this could take some application programmers by surprise --- especially since testing that the application data can be recovered after crashes that take place immediately after an <ins class="diffchange diffchange-inline">allocating </ins>write or a write into a preallocated block are cases that might not be well tested by all open source database.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div>The proposed solution is the same for <del class="diffchange diffchange-inline">the extending </del>writes; that we document that O_DIRECT does not imply synchronous I/O of any file control data, and that it is unspecified whether data written into newly allocated blocks, or uninitialized regions of the file will survive a system crash until the data is explicitly flushed to disk via a system facility such as fsync(2).  For that reason, the only thing which the application can infer in the case of writes to preallocated (uninitialized) file regions or file regions which require block allocation is that when the write(2) system call, the data buffer passed to write(2) may be reused for other purposes.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>The proposed solution is the same for <ins class="diffchange diffchange-inline">allocating </ins>writes; that we document that O_DIRECT does not imply synchronous I/O of any file control data, and that it is unspecified whether data written into newly allocated blocks, or uninitialized regions of the file will survive a system crash until the data is explicitly flushed to disk via a system facility such as fsync(2).  For that reason, the only thing which the application can infer in the case of writes to preallocated (uninitialized) file regions or file regions which require block allocation is that when the write(2) system call, the data buffer passed to write(2) may be reused for other purposes.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>= Conclusion =</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>= Conclusion =</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div>Most users of direct I/O will hopefully not be affected by the clarifications in this document.  These users tend to not to use <del class="diffchange diffchange-inline">extending </del>writes with Direct I/O, or are already using an explicit fsync(2) after such <del class="diffchange diffchange-inline">extending </del>writes.  However, if there are applications that have been making assumptions about direct I/O implying O_SYNC semantics to meet (for example) database ACID requirements, changing their application to meet the semantics documented herein (which after all, is all applications have been getting anyway) should not be difficult.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>Most users of direct I/O will hopefully not be affected by the clarifications in this document.  These users tend to not to use <ins class="diffchange diffchange-inline">allocating </ins>writes with Direct I/O, or are already using an explicit fsync(2) after such <ins class="diffchange diffchange-inline">allocating </ins>writes.  However, if there are applications that have been making assumptions about direct I/O implying O_SYNC semantics to meet (for example) database ACID requirements, changing their application to meet the semantics documented herein (which after all, is all applications have been getting anyway) should not be difficult.</div></td></tr>
</table>Tytsohttps://ext4.wiki.kernel.org/index.php?title=Clarifying_Direct_IO%27s_Semantics&diff=1754&oldid=prevTytso: "The semantics are _not_ well specified".2009-08-21T23:21:36Z<p>"The semantics are _not_ well specified".</p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 23:21, 21 August 2009</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 2:</td>
<td colspan="2" class="diff-lineno">Line 2:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>= Introduction =</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>= Introduction =</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="background: #ffa; color:black; font-size: smaller;"><div>The exact semantics of Direct I/O (O_DIRECT) are well specified.  It is not a part of POSIX, or SUS, or any other formal standards specification.  The exact meaning of O_DIRECT has historically been negotiated in non-public discussions between powerful enterprise database companies and proprietary Unix systems, and its behaviour has generally been passed down as oral lore rather than as a formal set of requirements and specifications.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>The exact semantics of Direct I/O (O_DIRECT) are <ins class="diffchange diffchange-inline">not </ins>well specified.  It is not a part of POSIX, or SUS, or any other formal standards specification.  The exact meaning of O_DIRECT has historically been negotiated in non-public discussions between powerful enterprise database companies and proprietary Unix systems, and its behaviour has generally been passed down as oral lore rather than as a formal set of requirements and specifications.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>The goal of this page is to summarize the current status, and to propose a more fully-fleshed out set of semantics for O_DIRECT which Linux file filesystem developers can agree, and for which application programmers (especially open source database implementors who may not have had an opportunity to have the same set of discussions with OS implementors as the large enterprise database developers have had).  Once there is consensus, this wiki page should also be used as the basis for updating the [http://www.kernel.org/doc/man-pages/online/pages/man2/open.2.html Linux kernel man page for open(2)].</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>The goal of this page is to summarize the current status, and to propose a more fully-fleshed out set of semantics for O_DIRECT which Linux file filesystem developers can agree, and for which application programmers (especially open source database implementors who may not have had an opportunity to have the same set of discussions with OS implementors as the large enterprise database developers have had).  Once there is consensus, this wiki page should also be used as the basis for updating the [http://www.kernel.org/doc/man-pages/online/pages/man2/open.2.html Linux kernel man page for open(2)].</div></td></tr>
</table>Tytso