Clarifying Direct IO's Semantics
The exact semantics of Direct I/O (O_DIRECT) are not well specified. It is not a part of POSIX, or SUS, or any other formal standards specification. The exact meaning of O_DIRECT has historically been negotiated in non-public discussions between powerful enterprise database companies and proprietary Unix systems, and its behaviour has generally been passed down as oral lore rather than as a formal set of requirements and specifications.
The goal of this page is to summarize the current status, and to propose a more fully-fleshed out set of semantics for O_DIRECT which Linux file file system developers can agree, and for which application programmers (especially open source database implementors who may not have had an opportunity to have the same set of discussions with OS implementors as the large enterprise database developers have had). Once there is consensus, this wiki page should also be used as the basis for updating the Linux kernel man page for open(2).
The Linux kernel man page for open(2) states:
Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The I/O is synchronous, that is, at the completion of a read(2) or write(2), data is guaranteed to have been transferred. See NOTES below for further discussion....
The O_DIRECT flag may impose alignment restrictions on the length and address of userspace buffers and the file offset of I/Os. In Linux alignment restrictions vary by file system and kernel version and might be absent entirely. However there is currently no file system-independent interface for an application to discover these restrictions for a given file or file system. Some file systems provide their own interfaces for doing so, for example the XFS_IOC_DIOINFO operation in xfsctl(3).
Similarly unstated in the Linux man page --- or any specification I could find on the web --- is any mention about what happens if an O_DIRECT write needs to allocate blocks; for example, because the write is extending the size the file, or the write system call is writing into a sparse file's "hole" where a block had not been previously allocated. Current Linux implementations falls back to buffered I/O, such that the data goes through the page cache. The current implementation does wait until the I/O has been posted (although not necessarily with a barrier such that the data is guaranteed written to stable store by the storage device). However, Linux does not wait until the metadata associated with the block allocation has been committed to the file system; hence, if the system crashes after an allocating write completes, there is no guarantee the data will be accessible to an application after the system reboots. To provide this guarantee, the application must use fsync(2), or set the O_SYNC or O_DSYNC flag on the file descriptor via fcntl(2).
Given that with an allocating write, an explicit fsync(2) (or write with O_SYNC/O_DSYNC) is required, there doesn't seem to be much point in waiting until the data I/O is complete if the O_DIRECT write has fallen back to using buffered I/O --- after all, if the data has been copied into the page cache, the data buffered passed into the write(2) system call can be safely reused for other purposes, so it may be that the kernel should be allowed to return as soon as the data has been copied into the page cache.
From a specification point of view, the fact that allocating writes can fall back to buffered I/O should be documented, and that any file system control data associated with the block I/O will not be synchronously committed unless the application explicitly requests this via fsync(2) or O_SYNC. If there is agreement that based on this, the kernel should be allowed to return once the data buffer passed to write(2) can be reused the application, this should be explicitly documented in the open(2) man page as well.
Writes into preallocated space
In recent Linux kernels, it is possible to request that the file system allocate blocks with out initializing the blocks first, using the fallocate(2) system call. Since those blocks contain previously unused data blocks, those blocks or extents must be marked as uninitialized, so that reads of these uninitialized blocks will return a zero block instead of the previous contents of those blocks (which might cause a security exposure). The first time an application writes into preallocated block, the file system must clear the uninitialized bit, so that a subsequent read of that data block will return the written data, instead of a zero block.
This requirement, when applied to a direct I/O write, has similar implications to the allocating write case, described above. Although the space for the direct I/O has already been reserved, a change to the file system metadata is required to mark the just-written data block or extent as being initialized. For file systems that use a journal to assure that the file system metadata is consistent, requiring direct I/O write to block until a file system commit is completed would be an unacceptable performance impact. On the other hand, if the data is not guaranteed to be present after a system crash unless the application uses an explicit fsync(2) call, this could take some application programmers by surprise --- especially since testing that the application data can be recovered after crashes that take place immediately after an allocating write or a write into a preallocated block are cases that might not be well tested by all open source database.
The proposed solution is the same for allocating writes; that we document that O_DIRECT does not imply synchronous I/O of any file control data, and that it is unspecified whether data written into newly allocated blocks, or uninitialized regions of the file will survive a system crash until the data is explicitly flushed to disk via a system facility such as fsync(2). For that reason, the only thing which the application can infer in the case of writes to preallocated (uninitialized) file regions or file regions which require block allocation is that when the write(2) system call, the data buffer passed to write(2) may be reused for other purposes.
What Do Other Operating Systems Do
From IBM's AIX 6.1 documentation documentation:
Although direct I/O writes are done synchronously, they do not provide synchronized I/O data integrity completion, as defined by POSIX. Applications that need this feature should use O_DSYNC in addition to O_DIRECT. O_DSYNC guarantees that all of the data and enough of the metadata (for example, indirect blocks) have written to the stable store to be able to retrieve the data after a system crash. O_DIRECT only writes the data; it does not write the metadata.
AIX does not seem to document any page cache coherency guarantees (if any).
From Irix 6.5's open(2) man page:
If set, reads and writes on the resulting file descriptor will be performed directly to or from the user program buffer, provided appropriate size and alignment restrictions are met and pages are not locked in memory by any process. Refer to the F_SETFL and F_DIOINFO commands in the fcntl(2) manual entry for information about how to determine the alignment constraints. Refer to the mlock(3C), read(2) and write(2) manual entries for information on direct I/O involving pages that are locked in memory. O_DIRECT is a Silicon Graphics extension and is only supported on local EFS and XFS file systems, and remote BDS file systems. In Irix 6.5.24 and beyond, O_DIRECT is also supported on remote NFS Version 3 file systems.
From Irix 6.5's write(2) man page:
When attempting to write to a file with O_DIRECT or FDIRECT set, -1 will be returned and errno will be set to EINVAL if nbyte or the current file position is not a multiple of the underlying device's blocksize, nbyte is too big or buf isn't properly aligned. See also F_DIOINFO in fcntl(2).
When attempting to write to a file with O_DIRECT or FDIRECT set, the portion being written can not be locked in memory by any process. In this case, -1 will be returned and errno will be set to EBUSY.
From Irix 6.5's read(2) man page:
When attempting to read from a file with O_DIRECT or FDIRECT set, -1 will be returned and errno will be set to EINVAL if nbyte or the current file position is not a multiple of the underlying device's blocksize, nbyte is too big or buf isn't properly aligned. See also F_DIOINFO in the fcntl(2) manual entry.
When attempting to read from a file with O_DIRECT or FDIRECT set, the read data will by default come from an in-memory page if the data is locked in memory by a mlock(3C) command. If you wish to fail the read instead of having the data not be read directly from disk, set the systune(1M) variable xfs_dio_retry to zero (0) and the read will return -1 and errno will be set to EBUSY.
Irix does not seem to document what happens with a write(2) system call needs to allocate blocks, and file system metadata changes are required. Irix also does not document what kind of page cache coherency guarantees it provides (if any). The write(2) and read(2) man pages seems to hint that Irix provides at least some page cache coherency, but the exact nature of the coherency guaranteed by Irix does not seem to be stated explicitly anywhere.
Most users of direct I/O will hopefully not be affected by the clarifications in this document. These users tend to not to use allocating writes with Direct I/O, or are already using an explicit fsync(2) after such allocating writes. However, if there are applications that have been making assumptions about direct I/O implying O_SYNC semantics to meet (for example) database ACID requirements, changing their application to meet the semantics documented herein (which after all, is all applications have been getting anyway) should not be difficult.