operations.rst 27 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713
  1. .. SPDX-License-Identifier: GPL-2.0
  2. .. _iomap_operations:
  3. ..
  4. Dumb style notes to maintain the author's sanity:
  5. Please try to start sentences on separate lines so that
  6. sentence changes don't bleed colors in diff.
  7. Heading decorations are documented in sphinx.rst.
  8. =========================
  9. Supported File Operations
  10. =========================
  11. .. contents:: Table of Contents
  12. :local:
  13. Below are a discussion of the high level file operations that iomap
  14. implements.
  15. Buffered I/O
  16. ============
  17. Buffered I/O is the default file I/O path in Linux.
  18. File contents are cached in memory ("pagecache") to satisfy reads and
  19. writes.
  20. Dirty cache will be written back to disk at some point that can be
  21. forced via ``fsync`` and variants.
  22. iomap implements nearly all the folio and pagecache management that
  23. filesystems have to implement themselves under the legacy I/O model.
  24. This means that the filesystem need not know the details of allocating,
  25. mapping, managing uptodate and dirty state, or writeback of pagecache
  26. folios.
  27. Under the legacy I/O model, this was managed very inefficiently with
  28. linked lists of buffer heads instead of the per-folio bitmaps that iomap
  29. uses.
  30. Unless the filesystem explicitly opts in to buffer heads, they will not
  31. be used, which makes buffered I/O much more efficient, and the pagecache
  32. maintainer much happier.
  33. ``struct address_space_operations``
  34. -----------------------------------
  35. The following iomap functions can be referenced directly from the
  36. address space operations structure:
  37. * ``iomap_dirty_folio``
  38. * ``iomap_release_folio``
  39. * ``iomap_invalidate_folio``
  40. * ``iomap_is_partially_uptodate``
  41. The following address space operations can be wrapped easily:
  42. * ``read_folio``
  43. * ``readahead``
  44. * ``writepages``
  45. * ``bmap``
  46. * ``swap_activate``
  47. ``struct iomap_folio_ops``
  48. --------------------------
  49. The ``->iomap_begin`` function for pagecache operations may set the
  50. ``struct iomap::folio_ops`` field to an ops structure to override
  51. default behaviors of iomap:
  52. .. code-block:: c
  53. struct iomap_folio_ops {
  54. struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
  55. unsigned len);
  56. void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
  57. struct folio *folio);
  58. bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
  59. };
  60. iomap calls these functions:
  61. - ``get_folio``: Called to allocate and return an active reference to
  62. a locked folio prior to starting a write.
  63. If this function is not provided, iomap will call
  64. ``iomap_get_folio``.
  65. This could be used to `set up per-folio filesystem state
  66. <https://lore.kernel.org/all/20190429220934.10415-5-agruenba@redhat.com/>`_
  67. for a write.
  68. - ``put_folio``: Called to unlock and put a folio after a pagecache
  69. operation completes.
  70. If this function is not provided, iomap will ``folio_unlock`` and
  71. ``folio_put`` on its own.
  72. This could be used to `commit per-folio filesystem state
  73. <https://lore.kernel.org/all/20180619164137.13720-6-hch@lst.de/>`_
  74. that was set up by ``->get_folio``.
  75. - ``iomap_valid``: The filesystem may not hold locks between
  76. ``->iomap_begin`` and ``->iomap_end`` because pagecache operations
  77. can take folio locks, fault on userspace pages, initiate writeback
  78. for memory reclamation, or engage in other time-consuming actions.
  79. If a file's space mapping data are mutable, it is possible that the
  80. mapping for a particular pagecache folio can `change in the time it
  81. takes
  82. <https://lore.kernel.org/all/20221123055812.747923-8-david@fromorbit.com/>`_
  83. to allocate, install, and lock that folio.
  84. For the pagecache, races can happen if writeback doesn't take
  85. ``i_rwsem`` or ``invalidate_lock`` and updates mapping information.
  86. Races can also happen if the filesytem allows concurrent writes.
  87. For such files, the mapping *must* be revalidated after the folio
  88. lock has been taken so that iomap can manage the folio correctly.
  89. fsdax does not need this revalidation because there's no writeback
  90. and no support for unwritten extents.
  91. Filesystems subject to this kind of race must provide a
  92. ``->iomap_valid`` function to decide if the mapping is still valid.
  93. If the mapping is not valid, the mapping will be sampled again.
  94. To support making the validity decision, the filesystem's
  95. ``->iomap_begin`` function may set ``struct iomap::validity_cookie``
  96. at the same time that it populates the other iomap fields.
  97. A simple validation cookie implementation is a sequence counter.
  98. If the filesystem bumps the sequence counter every time it modifies
  99. the inode's extent map, it can be placed in the ``struct
  100. iomap::validity_cookie`` during ``->iomap_begin``.
  101. If the value in the cookie is found to be different to the value
  102. the filesystem holds when the mapping is passed back to
  103. ``->iomap_valid``, then the iomap should considered stale and the
  104. validation failed.
  105. These ``struct kiocb`` flags are significant for buffered I/O with iomap:
  106. * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``.
  107. Internal per-Folio State
  108. ------------------------
  109. If the fsblock size matches the size of a pagecache folio, it is assumed
  110. that all disk I/O operations will operate on the entire folio.
  111. The uptodate (memory contents are at least as new as what's on disk) and
  112. dirty (memory contents are newer than what's on disk) status of the
  113. folio are all that's needed for this case.
  114. If the fsblock size is less than the size of a pagecache folio, iomap
  115. tracks the per-fsblock uptodate and dirty state itself.
  116. This enables iomap to handle both "bs < ps" `filesystems
  117. <https://lore.kernel.org/all/20230725122932.144426-1-ritesh.list@gmail.com/>`_
  118. and large folios in the pagecache.
  119. iomap internally tracks two state bits per fsblock:
  120. * ``uptodate``: iomap will try to keep folios fully up to date.
  121. If there are read(ahead) errors, those fsblocks will not be marked
  122. uptodate.
  123. The folio itself will be marked uptodate when all fsblocks within the
  124. folio are uptodate.
  125. * ``dirty``: iomap will set the per-block dirty state when programs
  126. write to the file.
  127. The folio itself will be marked dirty when any fsblock within the
  128. folio is dirty.
  129. iomap also tracks the amount of read and write disk IOs that are in
  130. flight.
  131. This structure is much lighter weight than ``struct buffer_head``
  132. because there is only one per folio, and the per-fsblock overhead is two
  133. bits vs. 104 bytes.
  134. Filesystems wishing to turn on large folios in the pagecache should call
  135. ``mapping_set_large_folios`` when initializing the incore inode.
  136. Buffered Readahead and Reads
  137. ----------------------------
  138. The ``iomap_readahead`` function initiates readahead to the pagecache.
  139. The ``iomap_read_folio`` function reads one folio's worth of data into
  140. the pagecache.
  141. The ``flags`` argument to ``->iomap_begin`` will be set to zero.
  142. The pagecache takes whatever locks it needs before calling the
  143. filesystem.
  144. Buffered Writes
  145. ---------------
  146. The ``iomap_file_buffered_write`` function writes an ``iocb`` to the
  147. pagecache.
  148. ``IOMAP_WRITE`` or ``IOMAP_WRITE`` | ``IOMAP_NOWAIT`` will be passed as
  149. the ``flags`` argument to ``->iomap_begin``.
  150. Callers commonly take ``i_rwsem`` in either shared or exclusive mode
  151. before calling this function.
  152. mmap Write Faults
  153. ~~~~~~~~~~~~~~~~~
  154. The ``iomap_page_mkwrite`` function handles a write fault to a folio in
  155. the pagecache.
  156. ``IOMAP_WRITE | IOMAP_FAULT`` will be passed as the ``flags`` argument
  157. to ``->iomap_begin``.
  158. Callers commonly take the mmap ``invalidate_lock`` in shared or
  159. exclusive mode before calling this function.
  160. Buffered Write Failures
  161. ~~~~~~~~~~~~~~~~~~~~~~~
  162. After a short write to the pagecache, the areas not written will not
  163. become marked dirty.
  164. The filesystem must arrange to `cancel
  165. <https://lore.kernel.org/all/20221123055812.747923-6-david@fromorbit.com/>`_
  166. such `reservations
  167. <https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/>`_
  168. because writeback will not consume the reservation.
  169. The ``iomap_write_delalloc_release`` can be called from a
  170. ``->iomap_end`` function to find all the clean areas of the folios
  171. caching a fresh (``IOMAP_F_NEW``) delalloc mapping.
  172. It takes the ``invalidate_lock``.
  173. The filesystem must supply a function ``punch`` to be called for
  174. each file range in this state.
  175. This function must *only* remove delayed allocation reservations, in
  176. case another thread racing with the current thread writes successfully
  177. to the same region and triggers writeback to flush the dirty data out to
  178. disk.
  179. Zeroing for File Operations
  180. ~~~~~~~~~~~~~~~~~~~~~~~~~~~
  181. Filesystems can call ``iomap_zero_range`` to perform zeroing of the
  182. pagecache for non-truncation file operations that are not aligned to
  183. the fsblock size.
  184. ``IOMAP_ZERO`` will be passed as the ``flags`` argument to
  185. ``->iomap_begin``.
  186. Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
  187. mode before calling this function.
  188. Unsharing Reflinked File Data
  189. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  190. Filesystems can call ``iomap_file_unshare`` to force a file sharing
  191. storage with another file to preemptively copy the shared data to newly
  192. allocate storage.
  193. ``IOMAP_WRITE | IOMAP_UNSHARE`` will be passed as the ``flags`` argument
  194. to ``->iomap_begin``.
  195. Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
  196. mode before calling this function.
  197. Truncation
  198. ----------
  199. Filesystems can call ``iomap_truncate_page`` to zero the bytes in the
  200. pagecache from EOF to the end of the fsblock during a file truncation
  201. operation.
  202. ``truncate_setsize`` or ``truncate_pagecache`` will take care of
  203. everything after the EOF block.
  204. ``IOMAP_ZERO`` will be passed as the ``flags`` argument to
  205. ``->iomap_begin``.
  206. Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
  207. mode before calling this function.
  208. Pagecache Writeback
  209. -------------------
  210. Filesystems can call ``iomap_writepages`` to respond to a request to
  211. write dirty pagecache folios to disk.
  212. The ``mapping`` and ``wbc`` parameters should be passed unchanged.
  213. The ``wpc`` pointer should be allocated by the filesystem and must
  214. be initialized to zero.
  215. The pagecache will lock each folio before trying to schedule it for
  216. writeback.
  217. It does not lock ``i_rwsem`` or ``invalidate_lock``.
  218. The dirty bit will be cleared for all folios run through the
  219. ``->map_blocks`` machinery described below even if the writeback fails.
  220. This is to prevent dirty folio clots when storage devices fail; an
  221. ``-EIO`` is recorded for userspace to collect via ``fsync``.
  222. The ``ops`` structure must be specified and is as follows:
  223. ``struct iomap_writeback_ops``
  224. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  225. .. code-block:: c
  226. struct iomap_writeback_ops {
  227. int (*map_blocks)(struct iomap_writepage_ctx *wpc, struct inode *inode,
  228. loff_t offset, unsigned len);
  229. int (*prepare_ioend)(struct iomap_ioend *ioend, int status);
  230. void (*discard_folio)(struct folio *folio, loff_t pos);
  231. };
  232. The fields are as follows:
  233. - ``map_blocks``: Sets ``wpc->iomap`` to the space mapping of the file
  234. range (in bytes) given by ``offset`` and ``len``.
  235. iomap calls this function for each dirty fs block in each dirty folio,
  236. though it will `reuse mappings
  237. <https://lore.kernel.org/all/20231207072710.176093-15-hch@lst.de/>`_
  238. for runs of contiguous dirty fsblocks within a folio.
  239. Do not return ``IOMAP_INLINE`` mappings here; the ``->iomap_end``
  240. function must deal with persisting written data.
  241. Do not return ``IOMAP_DELALLOC`` mappings here; iomap currently
  242. requires mapping to allocated space.
  243. Filesystems can skip a potentially expensive mapping lookup if the
  244. mappings have not changed.
  245. This revalidation must be open-coded by the filesystem; it is
  246. unclear if ``iomap::validity_cookie`` can be reused for this
  247. purpose.
  248. This function must be supplied by the filesystem.
  249. - ``prepare_ioend``: Enables filesystems to transform the writeback
  250. ioend or perform any other preparatory work before the writeback I/O
  251. is submitted.
  252. This might include pre-write space accounting updates, or installing
  253. a custom ``->bi_end_io`` function for internal purposes, such as
  254. deferring the ioend completion to a workqueue to run metadata update
  255. transactions from process context.
  256. This function is optional.
  257. - ``discard_folio``: iomap calls this function after ``->map_blocks``
  258. fails to schedule I/O for any part of a dirty folio.
  259. The function should throw away any reservations that may have been
  260. made for the write.
  261. The folio will be marked clean and an ``-EIO`` recorded in the
  262. pagecache.
  263. Filesystems can use this callback to `remove
  264. <https://lore.kernel.org/all/20201029163313.1766967-1-bfoster@redhat.com/>`_
  265. delalloc reservations to avoid having delalloc reservations for
  266. clean pagecache.
  267. This function is optional.
  268. Pagecache Writeback Completion
  269. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  270. To handle the bookkeeping that must happen after disk I/O for writeback
  271. completes, iomap creates chains of ``struct iomap_ioend`` objects that
  272. wrap the ``bio`` that is used to write pagecache data to disk.
  273. By default, iomap finishes writeback ioends by clearing the writeback
  274. bit on the folios attached to the ``ioend``.
  275. If the write failed, it will also set the error bits on the folios and
  276. the address space.
  277. This can happen in interrupt or process context, depending on the
  278. storage device.
  279. Filesystems that need to update internal bookkeeping (e.g. unwritten
  280. extent conversions) should provide a ``->prepare_ioend`` function to
  281. set ``struct iomap_end::bio::bi_end_io`` to its own function.
  282. This function should call ``iomap_finish_ioends`` after finishing its
  283. own work (e.g. unwritten extent conversion).
  284. Some filesystems may wish to `amortize the cost of running metadata
  285. transactions
  286. <https://lore.kernel.org/all/20220120034733.221737-1-david@fromorbit.com/>`_
  287. for post-writeback updates by batching them.
  288. They may also require transactions to run from process context, which
  289. implies punting batches to a workqueue.
  290. iomap ioends contain a ``list_head`` to enable batching.
  291. Given a batch of ioends, iomap has a few helpers to assist with
  292. amortization:
  293. * ``iomap_sort_ioends``: Sort all the ioends in the list by file
  294. offset.
  295. * ``iomap_ioend_try_merge``: Given an ioend that is not in any list and
  296. a separate list of sorted ioends, merge as many of the ioends from
  297. the head of the list into the given ioend.
  298. ioends can only be merged if the file range and storage addresses are
  299. contiguous; the unwritten and shared status are the same; and the
  300. write I/O outcome is the same.
  301. The merged ioends become their own list.
  302. * ``iomap_finish_ioends``: Finish an ioend that possibly has other
  303. ioends linked to it.
  304. Direct I/O
  305. ==========
  306. In Linux, direct I/O is defined as file I/O that is issued directly to
  307. storage, bypassing the pagecache.
  308. The ``iomap_dio_rw`` function implements O_DIRECT (direct I/O) reads and
  309. writes for files.
  310. .. code-block:: c
  311. ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
  312. const struct iomap_ops *ops,
  313. const struct iomap_dio_ops *dops,
  314. unsigned int dio_flags, void *private,
  315. size_t done_before);
  316. The filesystem can provide the ``dops`` parameter if it needs to perform
  317. extra work before or after the I/O is issued to storage.
  318. The ``done_before`` parameter tells the how much of the request has
  319. already been transferred.
  320. It is used to continue a request asynchronously when `part of the
  321. request
  322. <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c03098d4b9ad76bca2966a8769dcfe59f7f85103>`_
  323. has already been completed synchronously.
  324. The ``done_before`` parameter should be set if writes for the ``iocb``
  325. have been initiated prior to the call.
  326. The direction of the I/O is determined from the ``iocb`` passed in.
  327. The ``dio_flags`` argument can be set to any combination of the
  328. following values:
  329. * ``IOMAP_DIO_FORCE_WAIT``: Wait for the I/O to complete even if the
  330. kiocb is not synchronous.
  331. * ``IOMAP_DIO_OVERWRITE_ONLY``: Perform a pure overwrite for this range
  332. or fail with ``-EAGAIN``.
  333. This can be used by filesystems with complex unaligned I/O
  334. write paths to provide an optimised fast path for unaligned writes.
  335. If a pure overwrite can be performed, then serialisation against
  336. other I/Os to the same filesystem block(s) is unnecessary as there is
  337. no risk of stale data exposure or data loss.
  338. If a pure overwrite cannot be performed, then the filesystem can
  339. perform the serialisation steps needed to provide exclusive access
  340. to the unaligned I/O range so that it can perform allocation and
  341. sub-block zeroing safely.
  342. Filesystems can use this flag to try to reduce locking contention,
  343. but a lot of `detailed checking
  344. <https://lore.kernel.org/linux-ext4/20230314130759.642710-1-bfoster@redhat.com/>`_
  345. is required to do it `correctly
  346. <https://lore.kernel.org/linux-ext4/20230810165559.946222-1-bfoster@redhat.com/>`_.
  347. * ``IOMAP_DIO_PARTIAL``: If a page fault occurs, return whatever
  348. progress has already been made.
  349. The caller may deal with the page fault and retry the operation.
  350. If the caller decides to retry the operation, it should pass the
  351. accumulated return values of all previous calls as the
  352. ``done_before`` parameter to the next call.
  353. These ``struct kiocb`` flags are significant for direct I/O with iomap:
  354. * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``.
  355. * ``IOCB_SYNC``: Ensure that the device has persisted data to disk
  356. before completing the call.
  357. In the case of pure overwrites, the I/O may be issued with FUA
  358. enabled.
  359. * ``IOCB_HIPRI``: Poll for I/O completion instead of waiting for an
  360. interrupt.
  361. Only meaningful for asynchronous I/O, and only if the entire I/O can
  362. be issued as a single ``struct bio``.
  363. * ``IOCB_DIO_CALLER_COMP``: Try to run I/O completion from the caller's
  364. process context.
  365. See ``linux/fs.h`` for more details.
  366. Filesystems should call ``iomap_dio_rw`` from ``->read_iter`` and
  367. ``->write_iter``, and set ``FMODE_CAN_ODIRECT`` in the ``->open``
  368. function for the file.
  369. They should not set ``->direct_IO``, which is deprecated.
  370. If a filesystem wishes to perform its own work before direct I/O
  371. completion, it should call ``__iomap_dio_rw``.
  372. If its return value is not an error pointer or a NULL pointer, the
  373. filesystem should pass the return value to ``iomap_dio_complete`` after
  374. finishing its internal work.
  375. Return Values
  376. -------------
  377. ``iomap_dio_rw`` can return one of the following:
  378. * A non-negative number of bytes transferred.
  379. * ``-ENOTBLK``: Fall back to buffered I/O.
  380. iomap itself will return this value if it cannot invalidate the page
  381. cache before issuing the I/O to storage.
  382. The ``->iomap_begin`` or ``->iomap_end`` functions may also return
  383. this value.
  384. * ``-EIOCBQUEUED``: The asynchronous direct I/O request has been
  385. queued and will be completed separately.
  386. * Any of the other negative error codes.
  387. Direct Reads
  388. ------------
  389. A direct I/O read initiates a read I/O from the storage device to the
  390. caller's buffer.
  391. Dirty parts of the pagecache are flushed to storage before initiating
  392. the read io.
  393. The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT`` with
  394. any combination of the following enhancements:
  395. * ``IOMAP_NOWAIT``, as defined previously.
  396. Callers commonly hold ``i_rwsem`` in shared mode before calling this
  397. function.
  398. Direct Writes
  399. -------------
  400. A direct I/O write initiates a write I/O to the storage device from the
  401. caller's buffer.
  402. Dirty parts of the pagecache are flushed to storage before initiating
  403. the write io.
  404. The pagecache is invalidated both before and after the write io.
  405. The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT |
  406. IOMAP_WRITE`` with any combination of the following enhancements:
  407. * ``IOMAP_NOWAIT``, as defined previously.
  408. * ``IOMAP_OVERWRITE_ONLY``: Allocating blocks and zeroing partial
  409. blocks is not allowed.
  410. The entire file range must map to a single written or unwritten
  411. extent.
  412. The file I/O range must be aligned to the filesystem block size
  413. if the mapping is unwritten and the filesystem cannot handle zeroing
  414. the unaligned regions without exposing stale contents.
  415. Callers commonly hold ``i_rwsem`` in shared or exclusive mode before
  416. calling this function.
  417. ``struct iomap_dio_ops:``
  418. -------------------------
  419. .. code-block:: c
  420. struct iomap_dio_ops {
  421. void (*submit_io)(const struct iomap_iter *iter, struct bio *bio,
  422. loff_t file_offset);
  423. int (*end_io)(struct kiocb *iocb, ssize_t size, int error,
  424. unsigned flags);
  425. struct bio_set *bio_set;
  426. };
  427. The fields of this structure are as follows:
  428. - ``submit_io``: iomap calls this function when it has constructed a
  429. ``struct bio`` object for the I/O requested, and wishes to submit it
  430. to the block device.
  431. If no function is provided, ``submit_bio`` will be called directly.
  432. Filesystems that would like to perform additional work before (e.g.
  433. data replication for btrfs) should implement this function.
  434. - ``end_io``: This is called after the ``struct bio`` completes.
  435. This function should perform post-write conversions of unwritten
  436. extent mappings, handle write failures, etc.
  437. The ``flags`` argument may be set to a combination of the following:
  438. * ``IOMAP_DIO_UNWRITTEN``: The mapping was unwritten, so the ioend
  439. should mark the extent as written.
  440. * ``IOMAP_DIO_COW``: Writing to the space in the mapping required a
  441. copy on write operation, so the ioend should switch mappings.
  442. - ``bio_set``: This allows the filesystem to provide a custom bio_set
  443. for allocating direct I/O bios.
  444. This enables filesystems to `stash additional per-bio information
  445. <https://lore.kernel.org/all/20220505201115.937837-3-hch@lst.de/>`_
  446. for private use.
  447. If this field is NULL, generic ``struct bio`` objects will be used.
  448. Filesystems that want to perform extra work after an I/O completion
  449. should set a custom ``->bi_end_io`` function via ``->submit_io``.
  450. Afterwards, the custom endio function must call
  451. ``iomap_dio_bio_end_io`` to finish the direct I/O.
  452. DAX I/O
  453. =======
  454. Some storage devices can be directly mapped as memory.
  455. These devices support a new access mode known as "fsdax" that allows
  456. loads and stores through the CPU and memory controller.
  457. fsdax Reads
  458. -----------
  459. A fsdax read performs a memcpy from storage device to the caller's
  460. buffer.
  461. The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX`` with any
  462. combination of the following enhancements:
  463. * ``IOMAP_NOWAIT``, as defined previously.
  464. Callers commonly hold ``i_rwsem`` in shared mode before calling this
  465. function.
  466. fsdax Writes
  467. ------------
  468. A fsdax write initiates a memcpy to the storage device from the caller's
  469. buffer.
  470. The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX |
  471. IOMAP_WRITE`` with any combination of the following enhancements:
  472. * ``IOMAP_NOWAIT``, as defined previously.
  473. * ``IOMAP_OVERWRITE_ONLY``: The caller requires a pure overwrite to be
  474. performed from this mapping.
  475. This requires the filesystem extent mapping to already exist as an
  476. ``IOMAP_MAPPED`` type and span the entire range of the write I/O
  477. request.
  478. If the filesystem cannot map this request in a way that allows the
  479. iomap infrastructure to perform a pure overwrite, it must fail the
  480. mapping operation with ``-EAGAIN``.
  481. Callers commonly hold ``i_rwsem`` in exclusive mode before calling this
  482. function.
  483. fsdax mmap Faults
  484. ~~~~~~~~~~~~~~~~~
  485. The ``dax_iomap_fault`` function handles read and write faults to fsdax
  486. storage.
  487. For a read fault, ``IOMAP_DAX | IOMAP_FAULT`` will be passed as the
  488. ``flags`` argument to ``->iomap_begin``.
  489. For a write fault, ``IOMAP_DAX | IOMAP_FAULT | IOMAP_WRITE`` will be
  490. passed as the ``flags`` argument to ``->iomap_begin``.
  491. Callers commonly hold the same locks as they do to call their iomap
  492. pagecache counterparts.
  493. fsdax Truncation, fallocate, and Unsharing
  494. ------------------------------------------
  495. For fsdax files, the following functions are provided to replace their
  496. iomap pagecache I/O counterparts.
  497. The ``flags`` argument to ``->iomap_begin`` are the same as the
  498. pagecache counterparts, with ``IOMAP_DAX`` added.
  499. * ``dax_file_unshare``
  500. * ``dax_zero_range``
  501. * ``dax_truncate_page``
  502. Callers commonly hold the same locks as they do to call their iomap
  503. pagecache counterparts.
  504. fsdax Deduplication
  505. -------------------
  506. Filesystems implementing the ``FIDEDUPERANGE`` ioctl must call the
  507. ``dax_remap_file_range_prep`` function with their own iomap read ops.
  508. Seeking Files
  509. =============
  510. iomap implements the two iterating whence modes of the ``llseek`` system
  511. call.
  512. SEEK_DATA
  513. ---------
  514. The ``iomap_seek_data`` function implements the SEEK_DATA "whence" value
  515. for llseek.
  516. ``IOMAP_REPORT`` will be passed as the ``flags`` argument to
  517. ``->iomap_begin``.
  518. For unwritten mappings, the pagecache will be searched.
  519. Regions of the pagecache with a folio mapped and uptodate fsblocks
  520. within those folios will be reported as data areas.
  521. Callers commonly hold ``i_rwsem`` in shared mode before calling this
  522. function.
  523. SEEK_HOLE
  524. ---------
  525. The ``iomap_seek_hole`` function implements the SEEK_HOLE "whence" value
  526. for llseek.
  527. ``IOMAP_REPORT`` will be passed as the ``flags`` argument to
  528. ``->iomap_begin``.
  529. For unwritten mappings, the pagecache will be searched.
  530. Regions of the pagecache with no folio mapped, or a !uptodate fsblock
  531. within a folio will be reported as sparse hole areas.
  532. Callers commonly hold ``i_rwsem`` in shared mode before calling this
  533. function.
  534. Swap File Activation
  535. ====================
  536. The ``iomap_swapfile_activate`` function finds all the base-page aligned
  537. regions in a file and sets them up as swap space.
  538. The file will be ``fsync()``'d before activation.
  539. ``IOMAP_REPORT`` will be passed as the ``flags`` argument to
  540. ``->iomap_begin``.
  541. All mappings must be mapped or unwritten; cannot be dirty or shared, and
  542. cannot span multiple block devices.
  543. Callers must hold ``i_rwsem`` in exclusive mode; this is already
  544. provided by ``swapon``.
  545. File Space Mapping Reporting
  546. ============================
  547. iomap implements two of the file space mapping system calls.
  548. FS_IOC_FIEMAP
  549. -------------
  550. The ``iomap_fiemap`` function exports file extent mappings to userspace
  551. in the format specified by the ``FS_IOC_FIEMAP`` ioctl.
  552. ``IOMAP_REPORT`` will be passed as the ``flags`` argument to
  553. ``->iomap_begin``.
  554. Callers commonly hold ``i_rwsem`` in shared mode before calling this
  555. function.
  556. FIBMAP (deprecated)
  557. -------------------
  558. ``iomap_bmap`` implements FIBMAP.
  559. The calling conventions are the same as for FIEMAP.
  560. This function is only provided to maintain compatibility for filesystems
  561. that implemented FIBMAP prior to conversion.
  562. This ioctl is deprecated; do **not** add a FIBMAP implementation to
  563. filesystems that do not have it.
  564. Callers should probably hold ``i_rwsem`` in shared mode before calling
  565. this function, but this is unclear.