| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441 |
- .. SPDX-License-Identifier: GPL-2.0
- .. _iomap_design:
- ..
- Dumb style notes to maintain the author's sanity:
- Please try to start sentences on separate lines so that
- sentence changes don't bleed colors in diff.
- Heading decorations are documented in sphinx.rst.
- ==============
- Library Design
- ==============
- .. contents:: Table of Contents
- :local:
- Introduction
- ============
- iomap is a filesystem library for handling common file operations.
- The library has two layers:
- 1. A lower layer that provides an iterator over ranges of file offsets.
- This layer tries to obtain mappings of each file ranges to storage
- from the filesystem, but the storage information is not necessarily
- required.
- 2. An upper layer that acts upon the space mappings provided by the
- lower layer iterator.
- The iteration can involve mappings of file's logical offset ranges to
- physical extents, but the storage layer information is not necessarily
- required, e.g. for walking cached file information.
- The library exports various APIs for implementing file operations such
- as:
- * Pagecache reads and writes
- * Folio write faults to the pagecache
- * Writeback of dirty folios
- * Direct I/O reads and writes
- * fsdax I/O reads, writes, loads, and stores
- * FIEMAP
- * lseek ``SEEK_DATA`` and ``SEEK_HOLE``
- * swapfile activation
- This origins of this library is the file I/O path that XFS once used; it
- has now been extended to cover several other operations.
- Who Should Read This?
- =====================
- The target audience for this document are filesystem, storage, and
- pagecache programmers and code reviewers.
- If you are working on PCI, machine architectures, or device drivers, you
- are most likely in the wrong place.
- How Is This Better?
- ===================
- Unlike the classic Linux I/O model which breaks file I/O into small
- units (generally memory pages or blocks) and looks up space mappings on
- the basis of that unit, the iomap model asks the filesystem for the
- largest space mappings that it can create for a given file operation and
- initiates operations on that basis.
- This strategy improves the filesystem's visibility into the size of the
- operation being performed, which enables it to combat fragmentation with
- larger space allocations when possible.
- Larger space mappings improve runtime performance by amortizing the cost
- of mapping function calls into the filesystem across a larger amount of
- data.
- At a high level, an iomap operation `looks like this
- <https://lore.kernel.org/all/ZGbVaewzcCysclPt@dread.disaster.area/>`_:
- 1. For each byte in the operation range...
- 1. Obtain a space mapping via ``->iomap_begin``
- 2. For each sub-unit of work...
- 1. Revalidate the mapping and go back to (1) above, if necessary.
- So far only the pagecache operations need to do this.
- 2. Do the work
- 3. Increment operation cursor
- 4. Release the mapping via ``->iomap_end``, if necessary
- Each iomap operation will be covered in more detail below.
- This library was covered previously by an `LWN article
- <https://lwn.net/Articles/935934/>`_ and a `KernelNewbies page
- <https://kernelnewbies.org/KernelProjects/iomap>`_.
- The goal of this document is to provide a brief discussion of the
- design and capabilities of iomap, followed by a more detailed catalog
- of the interfaces presented by iomap.
- If you change iomap, please update this design document.
- File Range Iterator
- ===================
- Definitions
- -----------
- * **buffer head**: Shattered remnants of the old buffer cache.
- * ``fsblock``: The block size of a file, also known as ``i_blocksize``.
- * ``i_rwsem``: The VFS ``struct inode`` rwsemaphore.
- Processes hold this in shared mode to read file state and contents.
- Some filesystems may allow shared mode for writes.
- Processes often hold this in exclusive mode to change file state and
- contents.
- * ``invalidate_lock``: The pagecache ``struct address_space``
- rwsemaphore that protects against folio insertion and removal for
- filesystems that support punching out folios below EOF.
- Processes wishing to insert folios must hold this lock in shared
- mode to prevent removal, though concurrent insertion is allowed.
- Processes wishing to remove folios must hold this lock in exclusive
- mode to prevent insertions.
- Concurrent removals are not allowed.
- * ``dax_read_lock``: The RCU read lock that dax takes to prevent a
- device pre-shutdown hook from returning before other threads have
- released resources.
- * **filesystem mapping lock**: This synchronization primitive is
- internal to the filesystem and must protect the file mapping data
- from updates while a mapping is being sampled.
- The filesystem author must determine how this coordination should
- happen; it does not need to be an actual lock.
- * **iomap internal operation lock**: This is a general term for
- synchronization primitives that iomap functions take while holding a
- mapping.
- A specific example would be taking the folio lock while reading or
- writing the pagecache.
- * **pure overwrite**: A write operation that does not require any
- metadata or zeroing operations to perform during either submission
- or completion.
- This implies that the filesystem must have already allocated space
- on disk as ``IOMAP_MAPPED`` and the filesystem must not place any
- constraints on IO alignment or size.
- The only constraints on I/O alignment are device level (minimum I/O
- size and alignment, typically sector size).
- ``struct iomap``
- ----------------
- The filesystem communicates to the iomap iterator the mapping of
- byte ranges of a file to byte ranges of a storage device with the
- structure below:
- .. code-block:: c
- struct iomap {
- u64 addr;
- loff_t offset;
- u64 length;
- u16 type;
- u16 flags;
- struct block_device *bdev;
- struct dax_device *dax_dev;
- void *inline_data;
- void *private;
- const struct iomap_folio_ops *folio_ops;
- u64 validity_cookie;
- };
- The fields are as follows:
- * ``offset`` and ``length`` describe the range of file offsets, in
- bytes, covered by this mapping.
- These fields must always be set by the filesystem.
- * ``type`` describes the type of the space mapping:
- * **IOMAP_HOLE**: No storage has been allocated.
- This type must never be returned in response to an ``IOMAP_WRITE``
- operation because writes must allocate and map space, and return
- the mapping.
- The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
- iomap does not support writing (whether via pagecache or direct
- I/O) to a hole.
- * **IOMAP_DELALLOC**: A promise to allocate space at a later time
- ("delayed allocation").
- If the filesystem returns IOMAP_F_NEW here and the write fails, the
- ``->iomap_end`` function must delete the reservation.
- The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
- * **IOMAP_MAPPED**: The file range maps to specific space on the
- storage device.
- The device is returned in ``bdev`` or ``dax_dev``.
- The device address, in bytes, is returned via ``addr``.
- * **IOMAP_UNWRITTEN**: The file range maps to specific space on the
- storage device, but the space has not yet been initialized.
- The device is returned in ``bdev`` or ``dax_dev``.
- The device address, in bytes, is returned via ``addr``.
- Reads from this type of mapping will return zeroes to the caller.
- For a write or writeback operation, the ioend should update the
- mapping to MAPPED.
- Refer to the sections about ioends for more details.
- * **IOMAP_INLINE**: The file range maps to the memory buffer
- specified by ``inline_data``.
- For write operation, the ``->iomap_end`` function presumably
- handles persisting the data.
- The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
- * ``flags`` describe the status of the space mapping.
- These flags should be set by the filesystem in ``->iomap_begin``:
- * **IOMAP_F_NEW**: The space under the mapping is newly allocated.
- Areas that will not be written to must be zeroed.
- If a write fails and the mapping is a space reservation, the
- reservation must be deleted.
- * **IOMAP_F_DIRTY**: The inode will have uncommitted metadata needed
- to access any data written.
- fdatasync is required to commit these changes to persistent
- storage.
- This needs to take into account metadata changes that *may* be made
- at I/O completion, such as file size updates from direct I/O.
- * **IOMAP_F_SHARED**: The space under the mapping is shared.
- Copy on write is necessary to avoid corrupting other file data.
- * **IOMAP_F_BUFFER_HEAD**: This mapping requires the use of buffer
- heads for pagecache operations.
- Do not add more uses of this.
- * **IOMAP_F_MERGED**: Multiple contiguous block mappings were
- coalesced into this single mapping.
- This is only useful for FIEMAP.
- * **IOMAP_F_XATTR**: The mapping is for extended attribute data, not
- regular file data.
- This is only useful for FIEMAP.
- * **IOMAP_F_PRIVATE**: Starting with this value, the upper bits can
- be set by the filesystem for its own purposes.
- These flags can be set by iomap itself during file operations.
- The filesystem should supply an ``->iomap_end`` function if it needs
- to observe these flags:
- * **IOMAP_F_SIZE_CHANGED**: The file size has changed as a result of
- using this mapping.
- * **IOMAP_F_STALE**: The mapping was found to be stale.
- iomap will call ``->iomap_end`` on this mapping and then
- ``->iomap_begin`` to obtain a new mapping.
- Currently, these flags are only set by pagecache operations.
- * ``addr`` describes the device address, in bytes.
- * ``bdev`` describes the block device for this mapping.
- This only needs to be set for mapped or unwritten operations.
- * ``dax_dev`` describes the DAX device for this mapping.
- This only needs to be set for mapped or unwritten operations, and
- only for a fsdax operation.
- * ``inline_data`` points to a memory buffer for I/O involving
- ``IOMAP_INLINE`` mappings.
- This value is ignored for all other mapping types.
- * ``private`` is a pointer to `filesystem-private information
- <https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/>`_.
- This value will be passed unchanged to ``->iomap_end``.
- * ``folio_ops`` will be covered in the section on pagecache operations.
- * ``validity_cookie`` is a magic freshness value set by the filesystem
- that should be used to detect stale mappings.
- For pagecache operations this is critical for correct operation
- because page faults can occur, which implies that filesystem locks
- should not be held between ``->iomap_begin`` and ``->iomap_end``.
- Filesystems with completely static mappings need not set this value.
- Only pagecache operations revalidate mappings; see the section about
- ``iomap_valid`` for details.
- ``struct iomap_ops``
- --------------------
- Every iomap function requires the filesystem to pass an operations
- structure to obtain a mapping and (optionally) to release the mapping:
- .. code-block:: c
- struct iomap_ops {
- int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
- unsigned flags, struct iomap *iomap,
- struct iomap *srcmap);
- int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length,
- ssize_t written, unsigned flags,
- struct iomap *iomap);
- };
- ``->iomap_begin``
- ~~~~~~~~~~~~~~~~~
- iomap operations call ``->iomap_begin`` to obtain one file mapping for
- the range of bytes specified by ``pos`` and ``length`` for the file
- ``inode``.
- This mapping should be returned through the ``iomap`` pointer.
- The mapping must cover at least the first byte of the supplied file
- range, but it does not need to cover the entire requested range.
- Each iomap operation describes the requested operation through the
- ``flags`` argument.
- The exact value of ``flags`` will be documented in the
- operation-specific sections below.
- These flags can, at least in principle, apply generally to iomap
- operations:
- * ``IOMAP_DIRECT`` is set when the caller wishes to issue file I/O to
- block storage.
- * ``IOMAP_DAX`` is set when the caller wishes to issue file I/O to
- memory-like storage.
- * ``IOMAP_NOWAIT`` is set when the caller wishes to perform a best
- effort attempt to avoid any operation that would result in blocking
- the submitting task.
- This is similar in intent to ``O_NONBLOCK`` for network APIs - it is
- intended for asynchronous applications to keep doing other work
- instead of waiting for the specific unavailable filesystem resource
- to become available.
- Filesystems implementing ``IOMAP_NOWAIT`` semantics need to use
- trylock algorithms.
- They need to be able to satisfy the entire I/O request range with a
- single iomap mapping.
- They need to avoid reading or writing metadata synchronously.
- They need to avoid blocking memory allocations.
- They need to avoid waiting on transaction reservations to allow
- modifications to take place.
- They probably should not be allocating new space.
- And so on.
- If there is any doubt in the filesystem developer's mind as to
- whether any specific ``IOMAP_NOWAIT`` operation may end up blocking,
- then they should return ``-EAGAIN`` as early as possible rather than
- start the operation and force the submitting task to block.
- ``IOMAP_NOWAIT`` is often set on behalf of ``IOCB_NOWAIT`` or
- ``RWF_NOWAIT``.
- If it is necessary to read existing file contents from a `different
- <https://lore.kernel.org/all/20191008071527.29304-9-hch@lst.de/>`_
- device or address range on a device, the filesystem should return that
- information via ``srcmap``.
- Only pagecache and fsdax operations support reading from one mapping and
- writing to another.
- ``->iomap_end``
- ~~~~~~~~~~~~~~~
- After the operation completes, the ``->iomap_end`` function, if present,
- is called to signal that iomap is finished with a mapping.
- Typically, implementations will use this function to tear down any
- context that were set up in ``->iomap_begin``.
- For example, a write might wish to commit the reservations for the bytes
- that were operated upon and unreserve any space that was not operated
- upon.
- ``written`` might be zero if no bytes were touched.
- ``flags`` will contain the same value passed to ``->iomap_begin``.
- iomap ops for reads are not likely to need to supply this function.
- Both functions should return a negative errno code on error, or zero on
- success.
- Preparing for File Operations
- =============================
- iomap only handles mapping and I/O.
- Filesystems must still call out to the VFS to check input parameters
- and file state before initiating an I/O operation.
- It does not handle obtaining filesystem freeze protection, updating of
- timestamps, stripping privileges, or access control.
- Locking Hierarchy
- =================
- iomap requires that filesystems supply their own locking model.
- There are three categories of synchronization primitives, as far as
- iomap is concerned:
- * The **upper** level primitive is provided by the filesystem to
- coordinate access to different iomap operations.
- The exact primitive is specific to the filesystem and operation,
- but is often a VFS inode, pagecache invalidation, or folio lock.
- For example, a filesystem might take ``i_rwsem`` before calling
- ``iomap_file_buffered_write`` and ``iomap_file_unshare`` to prevent
- these two file operations from clobbering each other.
- Pagecache writeback may lock a folio to prevent other threads from
- accessing the folio until writeback is underway.
- * The **lower** level primitive is taken by the filesystem in the
- ``->iomap_begin`` and ``->iomap_end`` functions to coordinate
- access to the file space mapping information.
- The fields of the iomap object should be filled out while holding
- this primitive.
- The upper level synchronization primitive, if any, remains held
- while acquiring the lower level synchronization primitive.
- For example, XFS takes ``ILOCK_EXCL`` and ext4 takes ``i_data_sem``
- while sampling mappings.
- Filesystems with immutable mapping information may not require
- synchronization here.
- * The **operation** primitive is taken by an iomap operation to
- coordinate access to its own internal data structures.
- The upper level synchronization primitive, if any, remains held
- while acquiring this primitive.
- The lower level primitive is not held while acquiring this
- primitive.
- For example, pagecache write operations will obtain a file mapping,
- then grab and lock a folio to copy new contents.
- It may also lock an internal folio state object to update metadata.
- The exact locking requirements are specific to the filesystem; for
- certain operations, some of these locks can be elided.
- All further mentions of locking are *recommendations*, not mandates.
- Each filesystem author must figure out the locking for themself.
- Bugs and Limitations
- ====================
- * No support for fscrypt.
- * No support for compression.
- * No support for fsverity yet.
- * Strong assumptions that IO should work the way it does on XFS.
- * Does iomap *actually* work for non-regular file data?
- Patches welcome!
|