journal.rst 17 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611
  1. .. SPDX-License-Identifier: GPL-2.0
  2. Journal (jbd2)
  3. --------------
  4. Introduced in ext3, the ext4 filesystem employs a journal to protect the
  5. filesystem against corruption in the case of a system crash. A small
  6. continuous region of disk (default 128MiB) is reserved inside the
  7. filesystem as a place to land “important” data writes on-disk as quickly
  8. as possible. Once the important data transaction is fully written to the
  9. disk and flushed from the disk write cache, a record of the data being
  10. committed is also written to the journal. At some later point in time,
  11. the journal code writes the transactions to their final locations on
  12. disk (this could involve a lot of seeking or a lot of small
  13. read-write-erases) before erasing the commit record. Should the system
  14. crash during the second slow write, the journal can be replayed all the
  15. way to the latest commit record, guaranteeing the atomicity of whatever
  16. gets written through the journal to the disk. The effect of this is to
  17. guarantee that the filesystem does not become stuck midway through a
  18. metadata update.
  19. For performance reasons, ext4 by default only writes filesystem metadata
  20. through the journal. This means that file data blocks are /not/
  21. guaranteed to be in any consistent state after a crash. If this default
  22. guarantee level (``data=ordered``) is not satisfactory, there is a mount
  23. option to control journal behavior. If ``data=journal``, all data and
  24. metadata are written to disk through the journal. This is slower but
  25. safest. If ``data=writeback``, dirty data blocks are not flushed to the
  26. disk before the metadata are written to disk through the journal.
  27. The journal inode is typically inode 8. The first 68 bytes of the
  28. journal inode are replicated in the ext4 superblock. The journal itself
  29. is normal (but hidden) file within the filesystem. The file usually
  30. consumes an entire block group, though mke2fs tries to put it in the
  31. middle of the disk.
  32. All fields in jbd2 are written to disk in big-endian order. This is the
  33. opposite of ext4.
  34. NOTE: Both ext4 and ocfs2 use jbd2.
  35. The maximum size of a journal embedded in an ext4 filesystem is 2^32
  36. blocks. jbd2 itself does not seem to care.
  37. Layout
  38. ~~~~~~
  39. Generally speaking, the journal has this format:
  40. .. list-table::
  41. :widths: 16 48 16
  42. :header-rows: 1
  43. * - Superblock
  44. - descriptor\_block (data\_blocks or revocation\_block) [more data or
  45. revocations] commmit\_block
  46. - [more transactions...]
  47. * -
  48. - One transaction
  49. -
  50. Notice that a transaction begins with either a descriptor and some data,
  51. or a block revocation list. A finished transaction always ends with a
  52. commit. If there is no commit record (or the checksums don't match), the
  53. transaction will be discarded during replay.
  54. External Journal
  55. ~~~~~~~~~~~~~~~~
  56. Optionally, an ext4 filesystem can be created with an external journal
  57. device (as opposed to an internal journal, which uses a reserved inode).
  58. In this case, on the filesystem device, ``s_journal_inum`` should be
  59. zero and ``s_journal_uuid`` should be set. On the journal device there
  60. will be an ext4 super block in the usual place, with a matching UUID.
  61. The journal superblock will be in the next full block after the
  62. superblock.
  63. .. list-table::
  64. :widths: 12 12 12 32 12
  65. :header-rows: 1
  66. * - 1024 bytes of padding
  67. - ext4 Superblock
  68. - Journal Superblock
  69. - descriptor\_block (data\_blocks or revocation\_block) [more data or
  70. revocations] commmit\_block
  71. - [more transactions...]
  72. * -
  73. -
  74. -
  75. - One transaction
  76. -
  77. Block Header
  78. ~~~~~~~~~~~~
  79. Every block in the journal starts with a common 12-byte header
  80. ``struct journal_header_s``:
  81. .. list-table::
  82. :widths: 8 8 24 40
  83. :header-rows: 1
  84. * - Offset
  85. - Type
  86. - Name
  87. - Description
  88. * - 0x0
  89. - \_\_be32
  90. - h\_magic
  91. - jbd2 magic number, 0xC03B3998.
  92. * - 0x4
  93. - \_\_be32
  94. - h\_blocktype
  95. - Description of what this block contains. See the jbd2_blocktype_ table
  96. below.
  97. * - 0x8
  98. - \_\_be32
  99. - h\_sequence
  100. - The transaction ID that goes with this block.
  101. .. _jbd2_blocktype:
  102. The journal block type can be any one of:
  103. .. list-table::
  104. :widths: 16 64
  105. :header-rows: 1
  106. * - Value
  107. - Description
  108. * - 1
  109. - Descriptor. This block precedes a series of data blocks that were
  110. written through the journal during a transaction.
  111. * - 2
  112. - Block commit record. This block signifies the completion of a
  113. transaction.
  114. * - 3
  115. - Journal superblock, v1.
  116. * - 4
  117. - Journal superblock, v2.
  118. * - 5
  119. - Block revocation records. This speeds up recovery by enabling the
  120. journal to skip writing blocks that were subsequently rewritten.
  121. Super Block
  122. ~~~~~~~~~~~
  123. The super block for the journal is much simpler as compared to ext4's.
  124. The key data kept within are size of the journal, and where to find the
  125. start of the log of transactions.
  126. The journal superblock is recorded as ``struct journal_superblock_s``,
  127. which is 1024 bytes long:
  128. .. list-table::
  129. :widths: 8 8 24 40
  130. :header-rows: 1
  131. * - Offset
  132. - Type
  133. - Name
  134. - Description
  135. * -
  136. -
  137. -
  138. - Static information describing the journal.
  139. * - 0x0
  140. - journal\_header\_t (12 bytes)
  141. - s\_header
  142. - Common header identifying this as a superblock.
  143. * - 0xC
  144. - \_\_be32
  145. - s\_blocksize
  146. - Journal device block size.
  147. * - 0x10
  148. - \_\_be32
  149. - s\_maxlen
  150. - Total number of blocks in this journal.
  151. * - 0x14
  152. - \_\_be32
  153. - s\_first
  154. - First block of log information.
  155. * -
  156. -
  157. -
  158. - Dynamic information describing the current state of the log.
  159. * - 0x18
  160. - \_\_be32
  161. - s\_sequence
  162. - First commit ID expected in log.
  163. * - 0x1C
  164. - \_\_be32
  165. - s\_start
  166. - Block number of the start of log. Contrary to the comments, this field
  167. being zero does not imply that the journal is clean!
  168. * - 0x20
  169. - \_\_be32
  170. - s\_errno
  171. - Error value, as set by jbd2\_journal\_abort().
  172. * -
  173. -
  174. -
  175. - The remaining fields are only valid in a v2 superblock.
  176. * - 0x24
  177. - \_\_be32
  178. - s\_feature\_compat;
  179. - Compatible feature set. See the table jbd2_compat_ below.
  180. * - 0x28
  181. - \_\_be32
  182. - s\_feature\_incompat
  183. - Incompatible feature set. See the table jbd2_incompat_ below.
  184. * - 0x2C
  185. - \_\_be32
  186. - s\_feature\_ro\_compat
  187. - Read-only compatible feature set. There aren't any of these currently.
  188. * - 0x30
  189. - \_\_u8
  190. - s\_uuid[16]
  191. - 128-bit uuid for journal. This is compared against the copy in the ext4
  192. super block at mount time.
  193. * - 0x40
  194. - \_\_be32
  195. - s\_nr\_users
  196. - Number of file systems sharing this journal.
  197. * - 0x44
  198. - \_\_be32
  199. - s\_dynsuper
  200. - Location of dynamic super block copy. (Not used?)
  201. * - 0x48
  202. - \_\_be32
  203. - s\_max\_transaction
  204. - Limit of journal blocks per transaction. (Not used?)
  205. * - 0x4C
  206. - \_\_be32
  207. - s\_max\_trans\_data
  208. - Limit of data blocks per transaction. (Not used?)
  209. * - 0x50
  210. - \_\_u8
  211. - s\_checksum\_type
  212. - Checksum algorithm used for the journal. See jbd2_checksum_type_ for
  213. more info.
  214. * - 0x51
  215. - \_\_u8[3]
  216. - s\_padding2
  217. -
  218. * - 0x54
  219. - \_\_u32
  220. - s\_padding[42]
  221. -
  222. * - 0xFC
  223. - \_\_be32
  224. - s\_checksum
  225. - Checksum of the entire superblock, with this field set to zero.
  226. * - 0x100
  227. - \_\_u8
  228. - s\_users[16\*48]
  229. - ids of all file systems sharing the log. e2fsprogs/Linux don't allow
  230. shared external journals, but I imagine Lustre (or ocfs2?), which use
  231. the jbd2 code, might.
  232. .. _jbd2_compat:
  233. The journal compat features are any combination of the following:
  234. .. list-table::
  235. :widths: 16 64
  236. :header-rows: 1
  237. * - Value
  238. - Description
  239. * - 0x1
  240. - Journal maintains checksums on the data blocks.
  241. (JBD2\_FEATURE\_COMPAT\_CHECKSUM)
  242. .. _jbd2_incompat:
  243. The journal incompat features are any combination of the following:
  244. .. list-table::
  245. :widths: 16 64
  246. :header-rows: 1
  247. * - Value
  248. - Description
  249. * - 0x1
  250. - Journal has block revocation records. (JBD2\_FEATURE\_INCOMPAT\_REVOKE)
  251. * - 0x2
  252. - Journal can deal with 64-bit block numbers.
  253. (JBD2\_FEATURE\_INCOMPAT\_64BIT)
  254. * - 0x4
  255. - Journal commits asynchronously. (JBD2\_FEATURE\_INCOMPAT\_ASYNC\_COMMIT)
  256. * - 0x8
  257. - This journal uses v2 of the checksum on-disk format. Each journal
  258. metadata block gets its own checksum, and the block tags in the
  259. descriptor table contain checksums for each of the data blocks in the
  260. journal. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2)
  261. * - 0x10
  262. - This journal uses v3 of the checksum on-disk format. This is the same as
  263. v2, but the journal block tag size is fixed regardless of the size of
  264. block numbers. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3)
  265. .. _jbd2_checksum_type:
  266. Journal checksum type codes are one of the following. crc32 or crc32c are the
  267. most likely choices.
  268. .. list-table::
  269. :widths: 16 64
  270. :header-rows: 1
  271. * - Value
  272. - Description
  273. * - 1
  274. - CRC32
  275. * - 2
  276. - MD5
  277. * - 3
  278. - SHA1
  279. * - 4
  280. - CRC32C
  281. Descriptor Block
  282. ~~~~~~~~~~~~~~~~
  283. The descriptor block contains an array of journal block tags that
  284. describe the final locations of the data blocks that follow in the
  285. journal. Descriptor blocks are open-coded instead of being completely
  286. described by a data structure, but here is the block structure anyway.
  287. Descriptor blocks consume at least 36 bytes, but use a full block:
  288. .. list-table::
  289. :widths: 8 8 24 40
  290. :header-rows: 1
  291. * - Offset
  292. - Type
  293. - Name
  294. - Descriptor
  295. * - 0x0
  296. - journal\_header\_t
  297. - (open coded)
  298. - Common block header.
  299. * - 0xC
  300. - struct journal\_block\_tag\_s
  301. - open coded array[]
  302. - Enough tags either to fill up the block or to describe all the data
  303. blocks that follow this descriptor block.
  304. Journal block tags have any of the following formats, depending on which
  305. journal feature and block tag flags are set.
  306. If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is set, the journal block tag is
  307. defined as ``struct journal_block_tag3_s``, which looks like the
  308. following. The size is 16 or 32 bytes.
  309. .. list-table::
  310. :widths: 8 8 24 40
  311. :header-rows: 1
  312. * - Offset
  313. - Type
  314. - Name
  315. - Descriptor
  316. * - 0x0
  317. - \_\_be32
  318. - t\_blocknr
  319. - Lower 32-bits of the location of where the corresponding data block
  320. should end up on disk.
  321. * - 0x4
  322. - \_\_be32
  323. - t\_flags
  324. - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
  325. more info.
  326. * - 0x8
  327. - \_\_be32
  328. - t\_blocknr\_high
  329. - Upper 32-bits of the location of where the corresponding data block
  330. should end up on disk. This is zero if JBD2\_FEATURE\_INCOMPAT\_64BIT is
  331. not enabled.
  332. * - 0xC
  333. - \_\_be32
  334. - t\_checksum
  335. - Checksum of the journal UUID, the sequence number, and the data block.
  336. * -
  337. -
  338. -
  339. - This field appears to be open coded. It always comes at the end of the
  340. tag, after t_checksum. This field is not present if the "same UUID" flag
  341. is set.
  342. * - 0x8 or 0xC
  343. - char
  344. - uuid[16]
  345. - A UUID to go with this tag. This field appears to be copied from the
  346. ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
  347. field.
  348. .. _jbd2_tag_flags:
  349. The journal tag flags are any combination of the following:
  350. .. list-table::
  351. :widths: 16 64
  352. :header-rows: 1
  353. * - Value
  354. - Description
  355. * - 0x1
  356. - On-disk block is escaped. The first four bytes of the data block just
  357. happened to match the jbd2 magic number.
  358. * - 0x2
  359. - This block has the same UUID as previous, therefore the UUID field is
  360. omitted.
  361. * - 0x4
  362. - The data block was deleted by the transaction. (Not used?)
  363. * - 0x8
  364. - This is the last tag in this descriptor block.
  365. If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is NOT set, the journal block tag
  366. is defined as ``struct journal_block_tag_s``, which looks like the
  367. following. The size is 8, 12, 24, or 28 bytes:
  368. .. list-table::
  369. :widths: 8 8 24 40
  370. :header-rows: 1
  371. * - Offset
  372. - Type
  373. - Name
  374. - Descriptor
  375. * - 0x0
  376. - \_\_be32
  377. - t\_blocknr
  378. - Lower 32-bits of the location of where the corresponding data block
  379. should end up on disk.
  380. * - 0x4
  381. - \_\_be16
  382. - t\_checksum
  383. - Checksum of the journal UUID, the sequence number, and the data block.
  384. Note that only the lower 16 bits are stored.
  385. * - 0x6
  386. - \_\_be16
  387. - t\_flags
  388. - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
  389. more info.
  390. * -
  391. -
  392. -
  393. - This next field is only present if the super block indicates support for
  394. 64-bit block numbers.
  395. * - 0x8
  396. - \_\_be32
  397. - t\_blocknr\_high
  398. - Upper 32-bits of the location of where the corresponding data block
  399. should end up on disk.
  400. * -
  401. -
  402. -
  403. - This field appears to be open coded. It always comes at the end of the
  404. tag, after t_flags or t_blocknr_high. This field is not present if the
  405. "same UUID" flag is set.
  406. * - 0x8 or 0xC
  407. - char
  408. - uuid[16]
  409. - A UUID to go with this tag. This field appears to be copied from the
  410. ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
  411. field.
  412. If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
  413. JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a
  414. ``struct jbd2_journal_block_tail``, which looks like this:
  415. .. list-table::
  416. :widths: 8 8 24 40
  417. :header-rows: 1
  418. * - Offset
  419. - Type
  420. - Name
  421. - Descriptor
  422. * - 0x0
  423. - \_\_be32
  424. - t\_checksum
  425. - Checksum of the journal UUID + the descriptor block, with this field set
  426. to zero.
  427. Data Block
  428. ~~~~~~~~~~
  429. In general, the data blocks being written to disk through the journal
  430. are written verbatim into the journal file after the descriptor block.
  431. However, if the first four bytes of the block match the jbd2 magic
  432. number then those four bytes are replaced with zeroes and the “escaped”
  433. flag is set in the descriptor block tag.
  434. Revocation Block
  435. ~~~~~~~~~~~~~~~~
  436. A revocation block is used to prevent replay of a block in an earlier
  437. transaction. This is used to mark blocks that were journalled at one
  438. time but are no longer journalled. Typically this happens if a metadata
  439. block is freed and re-allocated as a file data block; in this case, a
  440. journal replay after the file block was written to disk will cause
  441. corruption.
  442. **NOTE**: This mechanism is NOT used to express “this journal block is
  443. superseded by this other journal block”, as the author (djwong)
  444. mistakenly thought. Any block being added to a transaction will cause
  445. the removal of all existing revocation records for that block.
  446. Revocation blocks are described in
  447. ``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in
  448. length, but use a full block:
  449. .. list-table::
  450. :widths: 8 8 24 40
  451. :header-rows: 1
  452. * - Offset
  453. - Type
  454. - Name
  455. - Description
  456. * - 0x0
  457. - journal\_header\_t
  458. - r\_header
  459. - Common block header.
  460. * - 0xC
  461. - \_\_be32
  462. - r\_count
  463. - Number of bytes used in this block.
  464. * - 0x10
  465. - \_\_be32 or \_\_be64
  466. - blocks[0]
  467. - Blocks to revoke.
  468. After r\_count is a linear array of block numbers that are effectively
  469. revoked by this transaction. The size of each block number is 8 bytes if
  470. the superblock advertises 64-bit block number support, or 4 bytes
  471. otherwise.
  472. If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
  473. JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the revocation
  474. block is a ``struct jbd2_journal_revoke_tail``, which has this format:
  475. .. list-table::
  476. :widths: 8 8 24 40
  477. :header-rows: 1
  478. * - Offset
  479. - Type
  480. - Name
  481. - Description
  482. * - 0x0
  483. - \_\_be32
  484. - r\_checksum
  485. - Checksum of the journal UUID + revocation block
  486. Commit Block
  487. ~~~~~~~~~~~~
  488. The commit block is a sentry that indicates that a transaction has been
  489. completely written to the journal. Once this commit block reaches the
  490. journal, the data stored with this transaction can be written to their
  491. final locations on disk.
  492. The commit block is described by ``struct commit_header``, which is 32
  493. bytes long (but uses a full block):
  494. .. list-table::
  495. :widths: 8 8 24 40
  496. :header-rows: 1
  497. * - Offset
  498. - Type
  499. - Name
  500. - Descriptor
  501. * - 0x0
  502. - journal\_header\_s
  503. - (open coded)
  504. - Common block header.
  505. * - 0xC
  506. - unsigned char
  507. - h\_chksum\_type
  508. - The type of checksum to use to verify the integrity of the data blocks
  509. in the transaction. See jbd2_checksum_type_ for more info.
  510. * - 0xD
  511. - unsigned char
  512. - h\_chksum\_size
  513. - The number of bytes used by the checksum. Most likely 4.
  514. * - 0xE
  515. - unsigned char
  516. - h\_padding[2]
  517. -
  518. * - 0x10
  519. - \_\_be32
  520. - h\_chksum[JBD2\_CHECKSUM\_BYTES]
  521. - 32 bytes of space to store checksums. If
  522. JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3
  523. are set, the first ``__be32`` is the checksum of the journal UUID and
  524. the entire commit block, with this field zeroed. If
  525. JBD2\_FEATURE\_COMPAT\_CHECKSUM is set, the first ``__be32`` is the
  526. crc32 of all the blocks already written to the transaction.
  527. * - 0x30
  528. - \_\_be64
  529. - h\_commit\_sec
  530. - The time that the transaction was committed, in seconds since the epoch.
  531. * - 0x38
  532. - \_\_be32
  533. - h\_commit\_nsec
  534. - Nanoseconds component of the above timestamp.