inodes.rst 17 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576
  1. .. SPDX-License-Identifier: GPL-2.0
  2. Index Nodes
  3. -----------
  4. In a regular UNIX filesystem, the inode stores all the metadata
  5. pertaining to the file (time stamps, block maps, extended attributes,
  6. etc), not the directory entry. To find the information associated with a
  7. file, one must traverse the directory files to find the directory entry
  8. associated with a file, then load the inode to find the metadata for
  9. that file. ext4 appears to cheat (for performance reasons) a little bit
  10. by storing a copy of the file type (normally stored in the inode) in the
  11. directory entry. (Compare all this to FAT, which stores all the file
  12. information directly in the directory entry, but does not support hard
  13. links and is in general more seek-happy than ext4 due to its simpler
  14. block allocator and extensive use of linked lists.)
  15. The inode table is a linear array of ``struct ext4_inode``. The table is
  16. sized to have enough blocks to store at least
  17. ``sb.s_inode_size * sb.s_inodes_per_group`` bytes. The number of the
  18. block group containing an inode can be calculated as
  19. ``(inode_number - 1) / sb.s_inodes_per_group``, and the offset into the
  20. group's table is ``(inode_number - 1) % sb.s_inodes_per_group``. There
  21. is no inode 0.
  22. The inode checksum is calculated against the FS UUID, the inode number,
  23. and the inode structure itself.
  24. The inode table entry is laid out in ``struct ext4_inode``.
  25. .. list-table::
  26. :widths: 8 8 24 40
  27. :header-rows: 1
  28. :class: longtable
  29. * - Offset
  30. - Size
  31. - Name
  32. - Description
  33. * - 0x0
  34. - \_\_le16
  35. - i\_mode
  36. - File mode. See the table i_mode_ below.
  37. * - 0x2
  38. - \_\_le16
  39. - i\_uid
  40. - Lower 16-bits of Owner UID.
  41. * - 0x4
  42. - \_\_le32
  43. - i\_size\_lo
  44. - Lower 32-bits of size in bytes.
  45. * - 0x8
  46. - \_\_le32
  47. - i\_atime
  48. - Last access time, in seconds since the epoch. However, if the EA\_INODE
  49. inode flag is set, this inode stores an extended attribute value and
  50. this field contains the checksum of the value.
  51. * - 0xC
  52. - \_\_le32
  53. - i\_ctime
  54. - Last inode change time, in seconds since the epoch. However, if the
  55. EA\_INODE inode flag is set, this inode stores an extended attribute
  56. value and this field contains the lower 32 bits of the attribute value's
  57. reference count.
  58. * - 0x10
  59. - \_\_le32
  60. - i\_mtime
  61. - Last data modification time, in seconds since the epoch. However, if the
  62. EA\_INODE inode flag is set, this inode stores an extended attribute
  63. value and this field contains the number of the inode that owns the
  64. extended attribute.
  65. * - 0x14
  66. - \_\_le32
  67. - i\_dtime
  68. - Deletion Time, in seconds since the epoch.
  69. * - 0x18
  70. - \_\_le16
  71. - i\_gid
  72. - Lower 16-bits of GID.
  73. * - 0x1A
  74. - \_\_le16
  75. - i\_links\_count
  76. - Hard link count. Normally, ext4 does not permit an inode to have more
  77. than 65,000 hard links. This applies to files as well as directories,
  78. which means that there cannot be more than 64,998 subdirectories in a
  79. directory (each subdirectory's '..' entry counts as a hard link, as does
  80. the '.' entry in the directory itself). With the DIR\_NLINK feature
  81. enabled, ext4 supports more than 64,998 subdirectories by setting this
  82. field to 1 to indicate that the number of hard links is not known.
  83. * - 0x1C
  84. - \_\_le32
  85. - i\_blocks\_lo
  86. - Lower 32-bits of “block” count. If the huge\_file feature flag is not
  87. set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte blocks
  88. on disk. If huge\_file is set and EXT4\_HUGE\_FILE\_FL is NOT set in
  89. ``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks_hi
  90. << 32)`` 512-byte blocks on disk. If huge\_file is set and
  91. EXT4\_HUGE\_FILE\_FL IS set in ``inode.i_flags``, then this file
  92. consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on
  93. disk.
  94. * - 0x20
  95. - \_\_le32
  96. - i\_flags
  97. - Inode flags. See the table i_flags_ below.
  98. * - 0x24
  99. - 4 bytes
  100. - i\_osd1
  101. - See the table i_osd1_ for more details.
  102. * - 0x28
  103. - 60 bytes
  104. - i\_block[EXT4\_N\_BLOCKS=15]
  105. - Block map or extent tree. See the section “The Contents of inode.i\_block”.
  106. * - 0x64
  107. - \_\_le32
  108. - i\_generation
  109. - File version (for NFS).
  110. * - 0x68
  111. - \_\_le32
  112. - i\_file\_acl\_lo
  113. - Lower 32-bits of extended attribute block. ACLs are of course one of
  114. many possible extended attributes; I think the name of this field is a
  115. result of the first use of extended attributes being for ACLs.
  116. * - 0x6C
  117. - \_\_le32
  118. - i\_size\_high / i\_dir\_acl
  119. - Upper 32-bits of file/directory size. In ext2/3 this field was named
  120. i\_dir\_acl, though it was usually set to zero and never used.
  121. * - 0x70
  122. - \_\_le32
  123. - i\_obso\_faddr
  124. - (Obsolete) fragment address.
  125. * - 0x74
  126. - 12 bytes
  127. - i\_osd2
  128. - See the table i_osd2_ for more details.
  129. * - 0x80
  130. - \_\_le16
  131. - i\_extra\_isize
  132. - Size of this inode - 128. Alternately, the size of the extended inode
  133. fields beyond the original ext2 inode, including this field.
  134. * - 0x82
  135. - \_\_le16
  136. - i\_checksum\_hi
  137. - Upper 16-bits of the inode checksum.
  138. * - 0x84
  139. - \_\_le32
  140. - i\_ctime\_extra
  141. - Extra change time bits. This provides sub-second precision. See Inode
  142. Timestamps section.
  143. * - 0x88
  144. - \_\_le32
  145. - i\_mtime\_extra
  146. - Extra modification time bits. This provides sub-second precision.
  147. * - 0x8C
  148. - \_\_le32
  149. - i\_atime\_extra
  150. - Extra access time bits. This provides sub-second precision.
  151. * - 0x90
  152. - \_\_le32
  153. - i\_crtime
  154. - File creation time, in seconds since the epoch.
  155. * - 0x94
  156. - \_\_le32
  157. - i\_crtime\_extra
  158. - Extra file creation time bits. This provides sub-second precision.
  159. * - 0x98
  160. - \_\_le32
  161. - i\_version\_hi
  162. - Upper 32-bits for version number.
  163. * - 0x9C
  164. - \_\_le32
  165. - i\_projid
  166. - Project ID.
  167. .. _i_mode:
  168. The ``i_mode`` value is a combination of the following flags:
  169. .. list-table::
  170. :widths: 16 64
  171. :header-rows: 1
  172. * - Value
  173. - Description
  174. * - 0x1
  175. - S\_IXOTH (Others may execute)
  176. * - 0x2
  177. - S\_IWOTH (Others may write)
  178. * - 0x4
  179. - S\_IROTH (Others may read)
  180. * - 0x8
  181. - S\_IXGRP (Group members may execute)
  182. * - 0x10
  183. - S\_IWGRP (Group members may write)
  184. * - 0x20
  185. - S\_IRGRP (Group members may read)
  186. * - 0x40
  187. - S\_IXUSR (Owner may execute)
  188. * - 0x80
  189. - S\_IWUSR (Owner may write)
  190. * - 0x100
  191. - S\_IRUSR (Owner may read)
  192. * - 0x200
  193. - S\_ISVTX (Sticky bit)
  194. * - 0x400
  195. - S\_ISGID (Set GID)
  196. * - 0x800
  197. - S\_ISUID (Set UID)
  198. * -
  199. - These are mutually-exclusive file types:
  200. * - 0x1000
  201. - S\_IFIFO (FIFO)
  202. * - 0x2000
  203. - S\_IFCHR (Character device)
  204. * - 0x4000
  205. - S\_IFDIR (Directory)
  206. * - 0x6000
  207. - S\_IFBLK (Block device)
  208. * - 0x8000
  209. - S\_IFREG (Regular file)
  210. * - 0xA000
  211. - S\_IFLNK (Symbolic link)
  212. * - 0xC000
  213. - S\_IFSOCK (Socket)
  214. .. _i_flags:
  215. The ``i_flags`` field is a combination of these values:
  216. .. list-table::
  217. :widths: 16 64
  218. :header-rows: 1
  219. * - Value
  220. - Description
  221. * - 0x1
  222. - This file requires secure deletion (EXT4\_SECRM\_FL). (not implemented)
  223. * - 0x2
  224. - This file should be preserved, should undeletion be desired
  225. (EXT4\_UNRM\_FL). (not implemented)
  226. * - 0x4
  227. - File is compressed (EXT4\_COMPR\_FL). (not really implemented)
  228. * - 0x8
  229. - All writes to the file must be synchronous (EXT4\_SYNC\_FL).
  230. * - 0x10
  231. - File is immutable (EXT4\_IMMUTABLE\_FL).
  232. * - 0x20
  233. - File can only be appended (EXT4\_APPEND\_FL).
  234. * - 0x40
  235. - The dump(1) utility should not dump this file (EXT4\_NODUMP\_FL).
  236. * - 0x80
  237. - Do not update access time (EXT4\_NOATIME\_FL).
  238. * - 0x100
  239. - Dirty compressed file (EXT4\_DIRTY\_FL). (not used)
  240. * - 0x200
  241. - File has one or more compressed clusters (EXT4\_COMPRBLK\_FL). (not used)
  242. * - 0x400
  243. - Do not compress file (EXT4\_NOCOMPR\_FL). (not used)
  244. * - 0x800
  245. - Encrypted inode (EXT4\_ENCRYPT\_FL). This bit value previously was
  246. EXT4\_ECOMPR\_FL (compression error), which was never used.
  247. * - 0x1000
  248. - Directory has hashed indexes (EXT4\_INDEX\_FL).
  249. * - 0x2000
  250. - AFS magic directory (EXT4\_IMAGIC\_FL).
  251. * - 0x4000
  252. - File data must always be written through the journal
  253. (EXT4\_JOURNAL\_DATA\_FL).
  254. * - 0x8000
  255. - File tail should not be merged (EXT4\_NOTAIL\_FL). (not used by ext4)
  256. * - 0x10000
  257. - All directory entry data should be written synchronously (see
  258. ``dirsync``) (EXT4\_DIRSYNC\_FL).
  259. * - 0x20000
  260. - Top of directory hierarchy (EXT4\_TOPDIR\_FL).
  261. * - 0x40000
  262. - This is a huge file (EXT4\_HUGE\_FILE\_FL).
  263. * - 0x80000
  264. - Inode uses extents (EXT4\_EXTENTS\_FL).
  265. * - 0x200000
  266. - Inode stores a large extended attribute value in its data blocks
  267. (EXT4\_EA\_INODE\_FL).
  268. * - 0x400000
  269. - This file has blocks allocated past EOF (EXT4\_EOFBLOCKS\_FL).
  270. (deprecated)
  271. * - 0x01000000
  272. - Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline)
  273. * - 0x04000000
  274. - Snapshot is being deleted (``EXT4_SNAPFILE_DELETED_FL``). (not in
  275. mainline)
  276. * - 0x08000000
  277. - Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in
  278. mainline)
  279. * - 0x10000000
  280. - Inode has inline data (EXT4\_INLINE\_DATA\_FL).
  281. * - 0x20000000
  282. - Create children with the same project ID (EXT4\_PROJINHERIT\_FL).
  283. * - 0x80000000
  284. - Reserved for ext4 library (EXT4\_RESERVED\_FL).
  285. * -
  286. - Aggregate flags:
  287. * - 0x4BDFFF
  288. - User-visible flags.
  289. * - 0x4B80FF
  290. - User-modifiable flags. Note that while EXT4\_JOURNAL\_DATA\_FL and
  291. EXT4\_EXTENTS\_FL can be set with setattr, they are not in the kernel's
  292. EXT4\_FL\_USER\_MODIFIABLE mask, since it needs to handle the setting of
  293. these flags in a special manner and they are masked out of the set of
  294. flags that are saved directly to i\_flags.
  295. .. _i_osd1:
  296. The ``osd1`` field has multiple meanings depending on the creator:
  297. Linux:
  298. .. list-table::
  299. :widths: 8 8 24 40
  300. :header-rows: 1
  301. * - Offset
  302. - Size
  303. - Name
  304. - Description
  305. * - 0x0
  306. - \_\_le32
  307. - l\_i\_version
  308. - Inode version. However, if the EA\_INODE inode flag is set, this inode
  309. stores an extended attribute value and this field contains the upper 32
  310. bits of the attribute value's reference count.
  311. Hurd:
  312. .. list-table::
  313. :widths: 8 8 24 40
  314. :header-rows: 1
  315. * - Offset
  316. - Size
  317. - Name
  318. - Description
  319. * - 0x0
  320. - \_\_le32
  321. - h\_i\_translator
  322. - ??
  323. Masix:
  324. .. list-table::
  325. :widths: 8 8 24 40
  326. :header-rows: 1
  327. * - Offset
  328. - Size
  329. - Name
  330. - Description
  331. * - 0x0
  332. - \_\_le32
  333. - m\_i\_reserved
  334. - ??
  335. .. _i_osd2:
  336. The ``osd2`` field has multiple meanings depending on the filesystem creator:
  337. Linux:
  338. .. list-table::
  339. :widths: 8 8 24 40
  340. :header-rows: 1
  341. * - Offset
  342. - Size
  343. - Name
  344. - Description
  345. * - 0x0
  346. - \_\_le16
  347. - l\_i\_blocks\_high
  348. - Upper 16-bits of the block count. Please see the note attached to
  349. i\_blocks\_lo.
  350. * - 0x2
  351. - \_\_le16
  352. - l\_i\_file\_acl\_high
  353. - Upper 16-bits of the extended attribute block (historically, the file
  354. ACL location). See the Extended Attributes section below.
  355. * - 0x4
  356. - \_\_le16
  357. - l\_i\_uid\_high
  358. - Upper 16-bits of the Owner UID.
  359. * - 0x6
  360. - \_\_le16
  361. - l\_i\_gid\_high
  362. - Upper 16-bits of the GID.
  363. * - 0x8
  364. - \_\_le16
  365. - l\_i\_checksum\_lo
  366. - Lower 16-bits of the inode checksum.
  367. * - 0xA
  368. - \_\_le16
  369. - l\_i\_reserved
  370. - Unused.
  371. Hurd:
  372. .. list-table::
  373. :widths: 8 8 24 40
  374. :header-rows: 1
  375. * - Offset
  376. - Size
  377. - Name
  378. - Description
  379. * - 0x0
  380. - \_\_le16
  381. - h\_i\_reserved1
  382. - ??
  383. * - 0x2
  384. - \_\_u16
  385. - h\_i\_mode\_high
  386. - Upper 16-bits of the file mode.
  387. * - 0x4
  388. - \_\_le16
  389. - h\_i\_uid\_high
  390. - Upper 16-bits of the Owner UID.
  391. * - 0x6
  392. - \_\_le16
  393. - h\_i\_gid\_high
  394. - Upper 16-bits of the GID.
  395. * - 0x8
  396. - \_\_u32
  397. - h\_i\_author
  398. - Author code?
  399. Masix:
  400. .. list-table::
  401. :widths: 8 8 24 40
  402. :header-rows: 1
  403. * - Offset
  404. - Size
  405. - Name
  406. - Description
  407. * - 0x0
  408. - \_\_le16
  409. - h\_i\_reserved1
  410. - ??
  411. * - 0x2
  412. - \_\_u16
  413. - m\_i\_file\_acl\_high
  414. - Upper 16-bits of the extended attribute block (historically, the file
  415. ACL location).
  416. * - 0x4
  417. - \_\_u32
  418. - m\_i\_reserved2[2]
  419. - ??
  420. Inode Size
  421. ~~~~~~~~~~
  422. In ext2 and ext3, the inode structure size was fixed at 128 bytes
  423. (``EXT2_GOOD_OLD_INODE_SIZE``) and each inode had a disk record size of
  424. 128 bytes. Starting with ext4, it is possible to allocate a larger
  425. on-disk inode at format time for all inodes in the filesystem to provide
  426. space beyond the end of the original ext2 inode. The on-disk inode
  427. record size is recorded in the superblock as ``s_inode_size``. The
  428. number of bytes actually used by struct ext4\_inode beyond the original
  429. 128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each
  430. inode, which allows struct ext4\_inode to grow for a new kernel without
  431. having to upgrade all of the on-disk inodes. Access to fields beyond
  432. EXT2\_GOOD\_OLD\_INODE\_SIZE should be verified to be within
  433. ``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as
  434. of October 2013) the inode structure is 156 bytes
  435. (``i_extra_isize = 28``). The extra space between the end of the inode
  436. structure and the end of the inode record can be used to store extended
  437. attributes. Each inode record can be as large as the filesystem block
  438. size, though this is not terribly efficient.
  439. Finding an Inode
  440. ~~~~~~~~~~~~~~~~
  441. Each block group contains ``sb->s_inodes_per_group`` inodes. Because
  442. inode 0 is defined not to exist, this formula can be used to find the
  443. block group that an inode lives in:
  444. ``bg = (inode_num - 1) / sb->s_inodes_per_group``. The particular inode
  445. can be found within the block group's inode table at
  446. ``index = (inode_num - 1) % sb->s_inodes_per_group``. To get the byte
  447. address within the inode table, use
  448. ``offset = index * sb->s_inode_size``.
  449. Inode Timestamps
  450. ~~~~~~~~~~~~~~~~
  451. Four timestamps are recorded in the lower 128 bytes of the inode
  452. structure -- inode change time (ctime), access time (atime), data
  453. modification time (mtime), and deletion time (dtime). The four fields
  454. are 32-bit signed integers that represent seconds since the Unix epoch
  455. (1970-01-01 00:00:00 GMT), which means that the fields will overflow in
  456. January 2038. For inodes that are not linked from any directory but are
  457. still open (orphan inodes), the dtime field is overloaded for use with
  458. the orphan list. The superblock field ``s_last_orphan`` points to the
  459. first inode in the orphan list; dtime is then the number of the next
  460. orphaned inode, or zero if there are no more orphans.
  461. If the inode structure size ``sb->s_inode_size`` is larger than 128
  462. bytes and the ``i_inode_extra`` field is large enough to encompass the
  463. respective ``i_[cma]time_extra`` field, the ctime, atime, and mtime
  464. inode fields are widened to 64 bits. Within this “extra” 32-bit field,
  465. the lower two bits are used to extend the 32-bit seconds field to be 34
  466. bit wide; the upper 30 bits are used to provide nanosecond timestamp
  467. accuracy. Therefore, timestamps should not overflow until May 2446.
  468. dtime was not widened. There is also a fifth timestamp to record inode
  469. creation time (crtime); this field is 64-bits wide and decoded in the
  470. same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible
  471. through the regular stat() interface, though debugfs will report them.
  472. We use the 32-bit signed time value plus (2^32 \* (extra epoch bits)).
  473. In other words:
  474. .. list-table::
  475. :widths: 20 20 20 20 20
  476. :header-rows: 1
  477. * - Extra epoch bits
  478. - MSB of 32-bit time
  479. - Adjustment for signed 32-bit to 64-bit tv\_sec
  480. - Decoded 64-bit tv\_sec
  481. - valid time range
  482. * - 0 0
  483. - 1
  484. - 0
  485. - ``-0x80000000 - -0x00000001``
  486. - 1901-12-13 to 1969-12-31
  487. * - 0 0
  488. - 0
  489. - 0
  490. - ``0x000000000 - 0x07fffffff``
  491. - 1970-01-01 to 2038-01-19
  492. * - 0 1
  493. - 1
  494. - 0x100000000
  495. - ``0x080000000 - 0x0ffffffff``
  496. - 2038-01-19 to 2106-02-07
  497. * - 0 1
  498. - 0
  499. - 0x100000000
  500. - ``0x100000000 - 0x17fffffff``
  501. - 2106-02-07 to 2174-02-25
  502. * - 1 0
  503. - 1
  504. - 0x200000000
  505. - ``0x180000000 - 0x1ffffffff``
  506. - 2174-02-25 to 2242-03-16
  507. * - 1 0
  508. - 0
  509. - 0x200000000
  510. - ``0x200000000 - 0x27fffffff``
  511. - 2242-03-16 to 2310-04-04
  512. * - 1 1
  513. - 1
  514. - 0x300000000
  515. - ``0x280000000 - 0x2ffffffff``
  516. - 2310-04-04 to 2378-04-22
  517. * - 1 1
  518. - 0
  519. - 0x300000000
  520. - ``0x300000000 - 0x37fffffff``
  521. - 2378-04-22 to 2446-05-10
  522. This is a somewhat odd encoding since there are effectively seven times
  523. as many positive values as negative values. There have also been
  524. long-standing bugs decoding and encoding dates beyond 2038, which don't
  525. seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels
  526. incorrectly use the extra epoch bits 1,1 for dates between 1901 and
  527. 1970. At some point the kernel will be fixed and e2fsck will fix this
  528. situation, assuming that it is run before 2310.