journal.rst 23 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761
  1. .. SPDX-License-Identifier: GPL-2.0
  2. Journal (jbd2)
  3. --------------
  4. Introduced in ext3, the ext4 filesystem employs a journal to protect the
  5. filesystem against metadata inconsistencies in the case of a system crash. Up
  6. to 10,240,000 file system blocks (see man mke2fs(8) for more details on journal
  7. size limits) can be reserved inside the filesystem as a place to land
  8. “important” data writes on-disk as quickly as possible. Once the important
  9. data transaction is fully written to the disk and flushed from the disk write
  10. cache, a record of the data being committed is also written to the journal. At
  11. some later point in time, the journal code writes the transactions to their
  12. final locations on disk (this could involve a lot of seeking or a lot of small
  13. read-write-erases) before erasing the commit record. Should the system
  14. crash during the second slow write, the journal can be replayed all the
  15. way to the latest commit record, guaranteeing the atomicity of whatever
  16. gets written through the journal to the disk. The effect of this is to
  17. guarantee that the filesystem does not become stuck midway through a
  18. metadata update.
  19. For performance reasons, ext4 by default only writes filesystem metadata
  20. through the journal. This means that file data blocks are /not/
  21. guaranteed to be in any consistent state after a crash. If this default
  22. guarantee level (``data=ordered``) is not satisfactory, there is a mount
  23. option to control journal behavior. If ``data=journal``, all data and
  24. metadata are written to disk through the journal. This is slower but
  25. safest. If ``data=writeback``, dirty data blocks are not flushed to the
  26. disk before the metadata are written to disk through the journal.
  27. In case of ``data=ordered`` mode, Ext4 also supports fast commits which
  28. help reduce commit latency significantly. The default ``data=ordered``
  29. mode works by logging metadata blocks to the journal. In fast commit
  30. mode, Ext4 only stores the minimal delta needed to recreate the
  31. affected metadata in fast commit space that is shared with JBD2.
  32. Once the fast commit area fills in or if fast commit is not possible
  33. or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
  34. A full commit invalidates all the fast commits that happened before
  35. it and thus it makes the fast commit area empty for further fast
  36. commits. This feature needs to be enabled at mkfs time.
  37. The journal inode is typically inode 8. The first 68 bytes of the
  38. journal inode are replicated in the ext4 superblock. The journal itself
  39. is normal (but hidden) file within the filesystem. The file usually
  40. consumes an entire block group, though mke2fs tries to put it in the
  41. middle of the disk.
  42. All fields in jbd2 are written to disk in big-endian order. This is the
  43. opposite of ext4.
  44. NOTE: Both ext4 and ocfs2 use jbd2.
  45. The maximum size of a journal embedded in an ext4 filesystem is 2^32
  46. blocks. jbd2 itself does not seem to care.
  47. Layout
  48. ~~~~~~
  49. Generally speaking, the journal has this format:
  50. .. list-table::
  51. :widths: 16 48 16
  52. :header-rows: 1
  53. * - Superblock
  54. - descriptor_block (data_blocks or revocation_block) [more data or
  55. revocations] commmit_block
  56. - [more transactions...]
  57. * -
  58. - One transaction
  59. -
  60. Notice that a transaction begins with either a descriptor and some data,
  61. or a block revocation list. A finished transaction always ends with a
  62. commit. If there is no commit record (or the checksums don't match), the
  63. transaction will be discarded during replay.
  64. External Journal
  65. ~~~~~~~~~~~~~~~~
  66. Optionally, an ext4 filesystem can be created with an external journal
  67. device (as opposed to an internal journal, which uses a reserved inode).
  68. In this case, on the filesystem device, ``s_journal_inum`` should be
  69. zero and ``s_journal_uuid`` should be set. On the journal device there
  70. will be an ext4 super block in the usual place, with a matching UUID.
  71. The journal superblock will be in the next full block after the
  72. superblock.
  73. .. list-table::
  74. :widths: 12 12 12 32 12
  75. :header-rows: 1
  76. * - 1024 bytes of padding
  77. - ext4 Superblock
  78. - Journal Superblock
  79. - descriptor_block (data_blocks or revocation_block) [more data or
  80. revocations] commmit_block
  81. - [more transactions...]
  82. * -
  83. -
  84. -
  85. - One transaction
  86. -
  87. Block Header
  88. ~~~~~~~~~~~~
  89. Every block in the journal starts with a common 12-byte header
  90. ``struct journal_header_s``:
  91. .. list-table::
  92. :widths: 8 8 24 40
  93. :header-rows: 1
  94. * - Offset
  95. - Type
  96. - Name
  97. - Description
  98. * - 0x0
  99. - __be32
  100. - h_magic
  101. - jbd2 magic number, 0xC03B3998.
  102. * - 0x4
  103. - __be32
  104. - h_blocktype
  105. - Description of what this block contains. See the jbd2_blocktype_ table
  106. below.
  107. * - 0x8
  108. - __be32
  109. - h_sequence
  110. - The transaction ID that goes with this block.
  111. .. _jbd2_blocktype:
  112. The journal block type can be any one of:
  113. .. list-table::
  114. :widths: 16 64
  115. :header-rows: 1
  116. * - Value
  117. - Description
  118. * - 1
  119. - Descriptor. This block precedes a series of data blocks that were
  120. written through the journal during a transaction.
  121. * - 2
  122. - Block commit record. This block signifies the completion of a
  123. transaction.
  124. * - 3
  125. - Journal superblock, v1.
  126. * - 4
  127. - Journal superblock, v2.
  128. * - 5
  129. - Block revocation records. This speeds up recovery by enabling the
  130. journal to skip writing blocks that were subsequently rewritten.
  131. Super Block
  132. ~~~~~~~~~~~
  133. The super block for the journal is much simpler as compared to ext4's.
  134. The key data kept within are size of the journal, and where to find the
  135. start of the log of transactions.
  136. The journal superblock is recorded as ``struct journal_superblock_s``,
  137. which is 1024 bytes long:
  138. .. list-table::
  139. :widths: 8 8 24 40
  140. :header-rows: 1
  141. * - Offset
  142. - Type
  143. - Name
  144. - Description
  145. * -
  146. -
  147. -
  148. - Static information describing the journal.
  149. * - 0x0
  150. - journal_header_t (12 bytes)
  151. - s_header
  152. - Common header identifying this as a superblock.
  153. * - 0xC
  154. - __be32
  155. - s_blocksize
  156. - Journal device block size.
  157. * - 0x10
  158. - __be32
  159. - s_maxlen
  160. - Total number of blocks in this journal.
  161. * - 0x14
  162. - __be32
  163. - s_first
  164. - First block of log information.
  165. * -
  166. -
  167. -
  168. - Dynamic information describing the current state of the log.
  169. * - 0x18
  170. - __be32
  171. - s_sequence
  172. - First commit ID expected in log.
  173. * - 0x1C
  174. - __be32
  175. - s_start
  176. - Block number of the start of log. Contrary to the comments, this field
  177. being zero does not imply that the journal is clean!
  178. * - 0x20
  179. - __be32
  180. - s_errno
  181. - Error value, as set by jbd2_journal_abort().
  182. * -
  183. -
  184. -
  185. - The remaining fields are only valid in a v2 superblock.
  186. * - 0x24
  187. - __be32
  188. - s_feature_compat;
  189. - Compatible feature set. See the table jbd2_compat_ below.
  190. * - 0x28
  191. - __be32
  192. - s_feature_incompat
  193. - Incompatible feature set. See the table jbd2_incompat_ below.
  194. * - 0x2C
  195. - __be32
  196. - s_feature_ro_compat
  197. - Read-only compatible feature set. There aren't any of these currently.
  198. * - 0x30
  199. - __u8
  200. - s_uuid[16]
  201. - 128-bit uuid for journal. This is compared against the copy in the ext4
  202. super block at mount time.
  203. * - 0x40
  204. - __be32
  205. - s_nr_users
  206. - Number of file systems sharing this journal.
  207. * - 0x44
  208. - __be32
  209. - s_dynsuper
  210. - Location of dynamic super block copy. (Not used?)
  211. * - 0x48
  212. - __be32
  213. - s_max_transaction
  214. - Limit of journal blocks per transaction. (Not used?)
  215. * - 0x4C
  216. - __be32
  217. - s_max_trans_data
  218. - Limit of data blocks per transaction. (Not used?)
  219. * - 0x50
  220. - __u8
  221. - s_checksum_type
  222. - Checksum algorithm used for the journal. See jbd2_checksum_type_ for
  223. more info.
  224. * - 0x51
  225. - __u8[3]
  226. - s_padding2
  227. -
  228. * - 0x54
  229. - __be32
  230. - s_num_fc_blocks
  231. - Number of fast commit blocks in the journal.
  232. * - 0x58
  233. - __be32
  234. - s_head
  235. - Block number of the head (first unused block) of the journal, only
  236. up-to-date when the journal is empty.
  237. * - 0x5C
  238. - __u32
  239. - s_padding[40]
  240. -
  241. * - 0xFC
  242. - __be32
  243. - s_checksum
  244. - Checksum of the entire superblock, with this field set to zero.
  245. * - 0x100
  246. - __u8
  247. - s_users[16*48]
  248. - ids of all file systems sharing the log. e2fsprogs/Linux don't allow
  249. shared external journals, but I imagine Lustre (or ocfs2?), which use
  250. the jbd2 code, might.
  251. .. _jbd2_compat:
  252. The journal compat features are any combination of the following:
  253. .. list-table::
  254. :widths: 16 64
  255. :header-rows: 1
  256. * - Value
  257. - Description
  258. * - 0x1
  259. - Journal maintains checksums on the data blocks.
  260. (JBD2_FEATURE_COMPAT_CHECKSUM)
  261. .. _jbd2_incompat:
  262. The journal incompat features are any combination of the following:
  263. .. list-table::
  264. :widths: 16 64
  265. :header-rows: 1
  266. * - Value
  267. - Description
  268. * - 0x1
  269. - Journal has block revocation records. (JBD2_FEATURE_INCOMPAT_REVOKE)
  270. * - 0x2
  271. - Journal can deal with 64-bit block numbers.
  272. (JBD2_FEATURE_INCOMPAT_64BIT)
  273. * - 0x4
  274. - Journal commits asynchronously. (JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)
  275. * - 0x8
  276. - This journal uses v2 of the checksum on-disk format. Each journal
  277. metadata block gets its own checksum, and the block tags in the
  278. descriptor table contain checksums for each of the data blocks in the
  279. journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2)
  280. * - 0x10
  281. - This journal uses v3 of the checksum on-disk format. This is the same as
  282. v2, but the journal block tag size is fixed regardless of the size of
  283. block numbers. (JBD2_FEATURE_INCOMPAT_CSUM_V3)
  284. * - 0x20
  285. - Journal has fast commit blocks. (JBD2_FEATURE_INCOMPAT_FAST_COMMIT)
  286. .. _jbd2_checksum_type:
  287. Journal checksum type codes are one of the following. crc32 or crc32c are the
  288. most likely choices.
  289. .. list-table::
  290. :widths: 16 64
  291. :header-rows: 1
  292. * - Value
  293. - Description
  294. * - 1
  295. - CRC32
  296. * - 2
  297. - MD5
  298. * - 3
  299. - SHA1
  300. * - 4
  301. - CRC32C
  302. Descriptor Block
  303. ~~~~~~~~~~~~~~~~
  304. The descriptor block contains an array of journal block tags that
  305. describe the final locations of the data blocks that follow in the
  306. journal. Descriptor blocks are open-coded instead of being completely
  307. described by a data structure, but here is the block structure anyway.
  308. Descriptor blocks consume at least 36 bytes, but use a full block:
  309. .. list-table::
  310. :widths: 8 8 24 40
  311. :header-rows: 1
  312. * - Offset
  313. - Type
  314. - Name
  315. - Descriptor
  316. * - 0x0
  317. - journal_header_t
  318. - (open coded)
  319. - Common block header.
  320. * - 0xC
  321. - struct journal_block_tag_s
  322. - open coded array[]
  323. - Enough tags either to fill up the block or to describe all the data
  324. blocks that follow this descriptor block.
  325. Journal block tags have any of the following formats, depending on which
  326. journal feature and block tag flags are set.
  327. If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the journal block tag is
  328. defined as ``struct journal_block_tag3_s``, which looks like the
  329. following. The size is 16 or 32 bytes.
  330. .. list-table::
  331. :widths: 8 8 24 40
  332. :header-rows: 1
  333. * - Offset
  334. - Type
  335. - Name
  336. - Descriptor
  337. * - 0x0
  338. - __be32
  339. - t_blocknr
  340. - Lower 32-bits of the location of where the corresponding data block
  341. should end up on disk.
  342. * - 0x4
  343. - __be32
  344. - t_flags
  345. - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
  346. more info.
  347. * - 0x8
  348. - __be32
  349. - t_blocknr_high
  350. - Upper 32-bits of the location of where the corresponding data block
  351. should end up on disk. This is zero if JBD2_FEATURE_INCOMPAT_64BIT is
  352. not enabled.
  353. * - 0xC
  354. - __be32
  355. - t_checksum
  356. - Checksum of the journal UUID, the sequence number, and the data block.
  357. * -
  358. -
  359. -
  360. - This field appears to be open coded. It always comes at the end of the
  361. tag, after t_checksum. This field is not present if the "same UUID" flag
  362. is set.
  363. * - 0x8 or 0xC
  364. - char
  365. - uuid[16]
  366. - A UUID to go with this tag. This field appears to be copied from the
  367. ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
  368. field.
  369. .. _jbd2_tag_flags:
  370. The journal tag flags are any combination of the following:
  371. .. list-table::
  372. :widths: 16 64
  373. :header-rows: 1
  374. * - Value
  375. - Description
  376. * - 0x1
  377. - On-disk block is escaped. The first four bytes of the data block just
  378. happened to match the jbd2 magic number.
  379. * - 0x2
  380. - This block has the same UUID as previous, therefore the UUID field is
  381. omitted.
  382. * - 0x4
  383. - The data block was deleted by the transaction. (Not used?)
  384. * - 0x8
  385. - This is the last tag in this descriptor block.
  386. If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, the journal block tag
  387. is defined as ``struct journal_block_tag_s``, which looks like the
  388. following. The size is 8, 12, 24, or 28 bytes:
  389. .. list-table::
  390. :widths: 8 8 24 40
  391. :header-rows: 1
  392. * - Offset
  393. - Type
  394. - Name
  395. - Descriptor
  396. * - 0x0
  397. - __be32
  398. - t_blocknr
  399. - Lower 32-bits of the location of where the corresponding data block
  400. should end up on disk.
  401. * - 0x4
  402. - __be16
  403. - t_checksum
  404. - Checksum of the journal UUID, the sequence number, and the data block.
  405. Note that only the lower 16 bits are stored.
  406. * - 0x6
  407. - __be16
  408. - t_flags
  409. - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
  410. more info.
  411. * -
  412. -
  413. -
  414. - This next field is only present if the super block indicates support for
  415. 64-bit block numbers.
  416. * - 0x8
  417. - __be32
  418. - t_blocknr_high
  419. - Upper 32-bits of the location of where the corresponding data block
  420. should end up on disk.
  421. * -
  422. -
  423. -
  424. - This field appears to be open coded. It always comes at the end of the
  425. tag, after t_flags or t_blocknr_high. This field is not present if the
  426. "same UUID" flag is set.
  427. * - 0x8 or 0xC
  428. - char
  429. - uuid[16]
  430. - A UUID to go with this tag. This field appears to be copied from the
  431. ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
  432. field.
  433. If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
  434. JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the block is a
  435. ``struct jbd2_journal_block_tail``, which looks like this:
  436. .. list-table::
  437. :widths: 8 8 24 40
  438. :header-rows: 1
  439. * - Offset
  440. - Type
  441. - Name
  442. - Descriptor
  443. * - 0x0
  444. - __be32
  445. - t_checksum
  446. - Checksum of the journal UUID + the descriptor block, with this field set
  447. to zero.
  448. Data Block
  449. ~~~~~~~~~~
  450. In general, the data blocks being written to disk through the journal
  451. are written verbatim into the journal file after the descriptor block.
  452. However, if the first four bytes of the block match the jbd2 magic
  453. number then those four bytes are replaced with zeroes and the “escaped”
  454. flag is set in the descriptor block tag.
  455. Revocation Block
  456. ~~~~~~~~~~~~~~~~
  457. A revocation block is used to prevent replay of a block in an earlier
  458. transaction. This is used to mark blocks that were journalled at one
  459. time but are no longer journalled. Typically this happens if a metadata
  460. block is freed and re-allocated as a file data block; in this case, a
  461. journal replay after the file block was written to disk will cause
  462. corruption.
  463. **NOTE**: This mechanism is NOT used to express “this journal block is
  464. superseded by this other journal block”, as the author (djwong)
  465. mistakenly thought. Any block being added to a transaction will cause
  466. the removal of all existing revocation records for that block.
  467. Revocation blocks are described in
  468. ``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in
  469. length, but use a full block:
  470. .. list-table::
  471. :widths: 8 8 24 40
  472. :header-rows: 1
  473. * - Offset
  474. - Type
  475. - Name
  476. - Description
  477. * - 0x0
  478. - journal_header_t
  479. - r_header
  480. - Common block header.
  481. * - 0xC
  482. - __be32
  483. - r_count
  484. - Number of bytes used in this block.
  485. * - 0x10
  486. - __be32 or __be64
  487. - blocks[0]
  488. - Blocks to revoke.
  489. After r_count is a linear array of block numbers that are effectively
  490. revoked by this transaction. The size of each block number is 8 bytes if
  491. the superblock advertises 64-bit block number support, or 4 bytes
  492. otherwise.
  493. If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
  494. JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the revocation
  495. block is a ``struct jbd2_journal_revoke_tail``, which has this format:
  496. .. list-table::
  497. :widths: 8 8 24 40
  498. :header-rows: 1
  499. * - Offset
  500. - Type
  501. - Name
  502. - Description
  503. * - 0x0
  504. - __be32
  505. - r_checksum
  506. - Checksum of the journal UUID + revocation block
  507. Commit Block
  508. ~~~~~~~~~~~~
  509. The commit block is a sentry that indicates that a transaction has been
  510. completely written to the journal. Once this commit block reaches the
  511. journal, the data stored with this transaction can be written to their
  512. final locations on disk.
  513. The commit block is described by ``struct commit_header``, which is 32
  514. bytes long (but uses a full block):
  515. .. list-table::
  516. :widths: 8 8 24 40
  517. :header-rows: 1
  518. * - Offset
  519. - Type
  520. - Name
  521. - Descriptor
  522. * - 0x0
  523. - journal_header_s
  524. - (open coded)
  525. - Common block header.
  526. * - 0xC
  527. - unsigned char
  528. - h_chksum_type
  529. - The type of checksum to use to verify the integrity of the data blocks
  530. in the transaction. See jbd2_checksum_type_ for more info.
  531. * - 0xD
  532. - unsigned char
  533. - h_chksum_size
  534. - The number of bytes used by the checksum. Most likely 4.
  535. * - 0xE
  536. - unsigned char
  537. - h_padding[2]
  538. -
  539. * - 0x10
  540. - __be32
  541. - h_chksum[JBD2_CHECKSUM_BYTES]
  542. - 32 bytes of space to store checksums. If
  543. JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3
  544. are set, the first ``__be32`` is the checksum of the journal UUID and
  545. the entire commit block, with this field zeroed. If
  546. JBD2_FEATURE_COMPAT_CHECKSUM is set, the first ``__be32`` is the
  547. crc32 of all the blocks already written to the transaction.
  548. * - 0x30
  549. - __be64
  550. - h_commit_sec
  551. - The time that the transaction was committed, in seconds since the epoch.
  552. * - 0x38
  553. - __be32
  554. - h_commit_nsec
  555. - Nanoseconds component of the above timestamp.
  556. Fast commits
  557. ~~~~~~~~~~~~
  558. Fast commit area is organized as a log of tag length values. Each TLV has
  559. a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
  560. of the entire field. It is followed by variable length tag specific value.
  561. Here is the list of supported tags and their meanings:
  562. .. list-table::
  563. :widths: 8 20 20 32
  564. :header-rows: 1
  565. * - Tag
  566. - Meaning
  567. - Value struct
  568. - Description
  569. * - EXT4_FC_TAG_HEAD
  570. - Fast commit area header
  571. - ``struct ext4_fc_head``
  572. - Stores the TID of the transaction after which these fast commits should
  573. be applied.
  574. * - EXT4_FC_TAG_ADD_RANGE
  575. - Add extent to inode
  576. - ``struct ext4_fc_add_range``
  577. - Stores the inode number and extent to be added in this inode
  578. * - EXT4_FC_TAG_DEL_RANGE
  579. - Remove logical offsets to inode
  580. - ``struct ext4_fc_del_range``
  581. - Stores the inode number and the logical offset range that needs to be
  582. removed
  583. * - EXT4_FC_TAG_CREAT
  584. - Create directory entry for a newly created file
  585. - ``struct ext4_fc_dentry_info``
  586. - Stores the parent inode number, inode number and directory entry of the
  587. newly created file
  588. * - EXT4_FC_TAG_LINK
  589. - Link a directory entry to an inode
  590. - ``struct ext4_fc_dentry_info``
  591. - Stores the parent inode number, inode number and directory entry
  592. * - EXT4_FC_TAG_UNLINK
  593. - Unlink a directory entry of an inode
  594. - ``struct ext4_fc_dentry_info``
  595. - Stores the parent inode number, inode number and directory entry
  596. * - EXT4_FC_TAG_PAD
  597. - Padding (unused area)
  598. - None
  599. - Unused bytes in the fast commit area.
  600. * - EXT4_FC_TAG_TAIL
  601. - Mark the end of a fast commit
  602. - ``struct ext4_fc_tail``
  603. - Stores the TID of the commit, CRC of the fast commit of which this tag
  604. represents the end of
  605. Fast Commit Replay Idempotence
  606. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  607. Fast commits tags are idempotent in nature provided the recovery code follows
  608. certain rules. The guiding principle that the commit path follows while
  609. committing is that it stores the result of a particular operation instead of
  610. storing the procedure.
  611. Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a'
  612. was associated with inode 10. During fast commit, instead of storing this
  613. operation as a procedure "rename a to b", we store the resulting file system
  614. state as a "series" of outcomes:
  615. - Link dirent b to inode 10
  616. - Unlink dirent a
  617. - Inode 10 with valid refcount
  618. Now when recovery code runs, it needs "enforce" this state on the file
  619. system. This is what guarantees idempotence of fast commit replay.
  620. Let's take an example of a procedure that is not idempotent and see how fast
  621. commits make it idempotent. Consider following sequence of operations:
  622. 1) rm A
  623. 2) mv B A
  624. 3) read A
  625. If we store this sequence of operations as is then the replay is not idempotent.
  626. Let's say while in replay, we crash after (2). During the second replay,
  627. file A (which was actually created as a result of "mv B A" operation) would get
  628. deleted. Thus, file named A would be absent when we try to read A. So, this
  629. sequence of operations is not idempotent. However, as mentioned above, instead
  630. of storing the procedure fast commits store the outcome of each procedure. Thus
  631. the fast commit log for above procedure would be as follows:
  632. (Let's assume dirent A was linked to inode 10 and dirent B was linked to
  633. inode 11 before the replay)
  634. 1) Unlink A
  635. 2) Link A to inode 11
  636. 3) Unlink B
  637. 4) Inode 11
  638. If we crash after (3) we will have file A linked to inode 11. During the second
  639. replay, we will remove file A (inode 11). But we will create it back and make
  640. it point to inode 11. We won't find B, so we'll just skip that step. At this
  641. point, the refcount for inode 11 is not reliable, but that gets fixed by the
  642. replay of last inode 11 tag. Thus, by converting a non-idempotent procedure
  643. into a series of idempotent outcomes, fast commits ensured idempotence during
  644. the replay.
  645. Journal Checkpoint
  646. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  647. Checkpointing the journal ensures all transactions and their associated buffers
  648. are submitted to the disk. In-progress transactions are waited upon and included
  649. in the checkpoint. Checkpointing is used internally during critical updates to
  650. the filesystem including journal recovery, filesystem resizing, and freeing of
  651. the journal_t structure.
  652. A journal checkpoint can be triggered from userspace via the ioctl
  653. EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags.
  654. Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN
  655. can be used to verify input to the ioctl. It returns error if there is any
  656. invalid input, otherwise it returns success without performing
  657. any checkpointing. This can be used to check whether the ioctl exists on a
  658. system and to verify there are no issues with arguments or flags. The
  659. other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and
  660. EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to be
  661. discarded or zero-filled, respectively, after the journal checkpoint is
  662. complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT
  663. cannot both be set. The ioctl may be useful when snapshotting a system or for
  664. complying with content deletion SLOs.