idmappings.rst 41 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039
  1. .. SPDX-License-Identifier: GPL-2.0
  2. Idmappings
  3. ==========
  4. Most filesystem developers will have encountered idmappings. They are used when
  5. reading from or writing ownership to disk, reporting ownership to userspace, or
  6. for permission checking. This document is aimed at filesystem developers that
  7. want to know how idmappings work.
  8. Formal notes
  9. ------------
  10. An idmapping is essentially a translation of a range of ids into another or the
  11. same range of ids. The notational convention for idmappings that is widely used
  12. in userspace is::
  13. u:k:r
  14. ``u`` indicates the first element in the upper idmapset ``U`` and ``k``
  15. indicates the first element in the lower idmapset ``K``. The ``r`` parameter
  16. indicates the range of the idmapping, i.e. how many ids are mapped. From now
  17. on, we will always prefix ids with ``u`` or ``k`` to make it clear whether
  18. we're talking about an id in the upper or lower idmapset.
  19. To see what this looks like in practice, let's take the following idmapping::
  20. u22:k10000:r3
  21. and write down the mappings it will generate::
  22. u22 -> k10000
  23. u23 -> k10001
  24. u24 -> k10002
  25. From a mathematical viewpoint ``U`` and ``K`` are well-ordered sets and an
  26. idmapping is an order isomorphism from ``U`` into ``K``. So ``U`` and ``K`` are
  27. order isomorphic. In fact, ``U`` and ``K`` are always well-ordered subsets of
  28. the set of all possible ids usable on a given system.
  29. Looking at this mathematically briefly will help us highlight some properties
  30. that make it easier to understand how we can translate between idmappings. For
  31. example, we know that the inverse idmapping is an order isomorphism as well::
  32. k10000 -> u22
  33. k10001 -> u23
  34. k10002 -> u24
  35. Given that we are dealing with order isomorphisms plus the fact that we're
  36. dealing with subsets we can embed idmappings into each other, i.e. we can
  37. sensibly translate between different idmappings. For example, assume we've been
  38. given the three idmappings::
  39. 1. u0:k10000:r10000
  40. 2. u0:k20000:r10000
  41. 3. u0:k30000:r10000
  42. and id ``k11000`` which has been generated by the first idmapping by mapping
  43. ``u1000`` from the upper idmapset down to ``k11000`` in the lower idmapset.
  44. Because we're dealing with order isomorphic subsets it is meaningful to ask
  45. what id ``k11000`` corresponds to in the second or third idmapping. The
  46. straightforward algorithm to use is to apply the inverse of the first idmapping,
  47. mapping ``k11000`` up to ``u1000``. Afterwards, we can map ``u1000`` down using
  48. either the second idmapping mapping or third idmapping mapping. The second
  49. idmapping would map ``u1000`` down to ``21000``. The third idmapping would map
  50. ``u1000`` down to ``u31000``.
  51. If we were given the same task for the following three idmappings::
  52. 1. u0:k10000:r10000
  53. 2. u0:k20000:r200
  54. 3. u0:k30000:r300
  55. we would fail to translate as the sets aren't order isomorphic over the full
  56. range of the first idmapping anymore (However they are order isomorphic over
  57. the full range of the second idmapping.). Neither the second or third idmapping
  58. contain ``u1000`` in the upper idmapset ``U``. This is equivalent to not having
  59. an id mapped. We can simply say that ``u1000`` is unmapped in the second and
  60. third idmapping. The kernel will report unmapped ids as the overflowuid
  61. ``(uid_t)-1`` or overflowgid ``(gid_t)-1`` to userspace.
  62. The algorithm to calculate what a given id maps to is pretty simple. First, we
  63. need to verify that the range can contain our target id. We will skip this step
  64. for simplicity. After that if we want to know what ``id`` maps to we can do
  65. simple calculations:
  66. - If we want to map from left to right::
  67. u:k:r
  68. id - u + k = n
  69. - If we want to map from right to left::
  70. u:k:r
  71. id - k + u = n
  72. Instead of "left to right" we can also say "down" and instead of "right to
  73. left" we can also say "up". Obviously mapping down and up invert each other.
  74. To see whether the simple formulas above work, consider the following two
  75. idmappings::
  76. 1. u0:k20000:r10000
  77. 2. u500:k30000:r10000
  78. Assume we are given ``k21000`` in the lower idmapset of the first idmapping. We
  79. want to know what id this was mapped from in the upper idmapset of the first
  80. idmapping. So we're mapping up in the first idmapping::
  81. id - k + u = n
  82. k21000 - k20000 + u0 = u1000
  83. Now assume we are given the id ``u1100`` in the upper idmapset of the second
  84. idmapping and we want to know what this id maps down to in the lower idmapset
  85. of the second idmapping. This means we're mapping down in the second
  86. idmapping::
  87. id - u + k = n
  88. u1100 - u500 + k30000 = k30600
  89. General notes
  90. -------------
  91. In the context of the kernel an idmapping can be interpreted as mapping a range
  92. of userspace ids into a range of kernel ids::
  93. userspace-id:kernel-id:range
  94. A userspace id is always an element in the upper idmapset of an idmapping of
  95. type ``uid_t`` or ``gid_t`` and a kernel id is always an element in the lower
  96. idmapset of an idmapping of type ``kuid_t`` or ``kgid_t``. From now on
  97. "userspace id" will be used to refer to the well known ``uid_t`` and ``gid_t``
  98. types and "kernel id" will be used to refer to ``kuid_t`` and ``kgid_t``.
  99. The kernel is mostly concerned with kernel ids. They are used when performing
  100. permission checks and are stored in an inode's ``i_uid`` and ``i_gid`` field.
  101. A userspace id on the other hand is an id that is reported to userspace by the
  102. kernel, or is passed by userspace to the kernel, or a raw device id that is
  103. written or read from disk.
  104. Note that we are only concerned with idmappings as the kernel stores them not
  105. how userspace would specify them.
  106. For the rest of this document we will prefix all userspace ids with ``u`` and
  107. all kernel ids with ``k``. Ranges of idmappings will be prefixed with ``r``. So
  108. an idmapping will be written as ``u0:k10000:r10000``.
  109. For example, within this idmapping, the id ``u1000`` is an id in the upper
  110. idmapset or "userspace idmapset" starting with ``u0``. And it is mapped to
  111. ``k11000`` which is a kernel id in the lower idmapset or "kernel idmapset"
  112. starting with ``k10000``.
  113. A kernel id is always created by an idmapping. Such idmappings are associated
  114. with user namespaces. Since we mainly care about how idmappings work we're not
  115. going to be concerned with how idmappings are created nor how they are used
  116. outside of the filesystem context. This is best left to an explanation of user
  117. namespaces.
  118. The initial user namespace is special. It always has an idmapping of the
  119. following form::
  120. u0:k0:r4294967295
  121. which is an identity idmapping over the full range of ids available on this
  122. system.
  123. Other user namespaces usually have non-identity idmappings such as::
  124. u0:k10000:r10000
  125. When a process creates or wants to change ownership of a file, or when the
  126. ownership of a file is read from disk by a filesystem, the userspace id is
  127. immediately translated into a kernel id according to the idmapping associated
  128. with the relevant user namespace.
  129. For instance, consider a file that is stored on disk by a filesystem as being
  130. owned by ``u1000``:
  131. - If a filesystem were to be mounted in the initial user namespaces (as most
  132. filesystems are) then the initial idmapping will be used. As we saw this is
  133. simply the identity idmapping. This would mean id ``u1000`` read from disk
  134. would be mapped to id ``k1000``. So an inode's ``i_uid`` and ``i_gid`` field
  135. would contain ``k1000``.
  136. - If a filesystem were to be mounted with an idmapping of ``u0:k10000:r10000``
  137. then ``u1000`` read from disk would be mapped to ``k11000``. So an inode's
  138. ``i_uid`` and ``i_gid`` would contain ``k11000``.
  139. Translation algorithms
  140. ----------------------
  141. We've already seen briefly that it is possible to translate between different
  142. idmappings. We'll now take a closer look how that works.
  143. Crossmapping
  144. ~~~~~~~~~~~~
  145. This translation algorithm is used by the kernel in quite a few places. For
  146. example, it is used when reporting back the ownership of a file to userspace
  147. via the ``stat()`` system call family.
  148. If we've been given ``k11000`` from one idmapping we can map that id up in
  149. another idmapping. In order for this to work both idmappings need to contain
  150. the same kernel id in their kernel idmapsets. For example, consider the
  151. following idmappings::
  152. 1. u0:k10000:r10000
  153. 2. u20000:k10000:r10000
  154. and we are mapping ``u1000`` down to ``k11000`` in the first idmapping . We can
  155. then translate ``k11000`` into a userspace id in the second idmapping using the
  156. kernel idmapset of the second idmapping::
  157. /* Map the kernel id up into a userspace id in the second idmapping. */
  158. from_kuid(u20000:k10000:r10000, k11000) = u21000
  159. Note, how we can get back to the kernel id in the first idmapping by inverting
  160. the algorithm::
  161. /* Map the userspace id down into a kernel id in the second idmapping. */
  162. make_kuid(u20000:k10000:r10000, u21000) = k11000
  163. /* Map the kernel id up into a userspace id in the first idmapping. */
  164. from_kuid(u0:k10000:r10000, k11000) = u1000
  165. This algorithm allows us to answer the question what userspace id a given
  166. kernel id corresponds to in a given idmapping. In order to be able to answer
  167. this question both idmappings need to contain the same kernel id in their
  168. respective kernel idmapsets.
  169. For example, when the kernel reads a raw userspace id from disk it maps it down
  170. into a kernel id according to the idmapping associated with the filesystem.
  171. Let's assume the filesystem was mounted with an idmapping of
  172. ``u0:k20000:r10000`` and it reads a file owned by ``u1000`` from disk. This
  173. means ``u1000`` will be mapped to ``k21000`` which is what will be stored in
  174. the inode's ``i_uid`` and ``i_gid`` field.
  175. When someone in userspace calls ``stat()`` or a related function to get
  176. ownership information about the file the kernel can't simply map the id back up
  177. according to the filesystem's idmapping as this would give the wrong owner if
  178. the caller is using an idmapping.
  179. So the kernel will map the id back up in the idmapping of the caller. Let's
  180. assume the caller has the somewhat unconventional idmapping
  181. ``u3000:k20000:r10000`` then ``k21000`` would map back up to ``u4000``.
  182. Consequently the user would see that this file is owned by ``u4000``.
  183. Remapping
  184. ~~~~~~~~~
  185. It is possible to translate a kernel id from one idmapping to another one via
  186. the userspace idmapset of the two idmappings. This is equivalent to remapping
  187. a kernel id.
  188. Let's look at an example. We are given the following two idmappings::
  189. 1. u0:k10000:r10000
  190. 2. u0:k20000:r10000
  191. and we are given ``k11000`` in the first idmapping. In order to translate this
  192. kernel id in the first idmapping into a kernel id in the second idmapping we
  193. need to perform two steps:
  194. 1. Map the kernel id up into a userspace id in the first idmapping::
  195. /* Map the kernel id up into a userspace id in the first idmapping. */
  196. from_kuid(u0:k10000:r10000, k11000) = u1000
  197. 2. Map the userspace id down into a kernel id in the second idmapping::
  198. /* Map the userspace id down into a kernel id in the second idmapping. */
  199. make_kuid(u0:k20000:r10000, u1000) = k21000
  200. As you can see we used the userspace idmapset in both idmappings to translate
  201. the kernel id in one idmapping to a kernel id in another idmapping.
  202. This allows us to answer the question what kernel id we would need to use to
  203. get the same userspace id in another idmapping. In order to be able to answer
  204. this question both idmappings need to contain the same userspace id in their
  205. respective userspace idmapsets.
  206. Note, how we can easily get back to the kernel id in the first idmapping by
  207. inverting the algorithm:
  208. 1. Map the kernel id up into a userspace id in the second idmapping::
  209. /* Map the kernel id up into a userspace id in the second idmapping. */
  210. from_kuid(u0:k20000:r10000, k21000) = u1000
  211. 2. Map the userspace id down into a kernel id in the first idmapping::
  212. /* Map the userspace id down into a kernel id in the first idmapping. */
  213. make_kuid(u0:k10000:r10000, u1000) = k11000
  214. Another way to look at this translation is to treat it as inverting one
  215. idmapping and applying another idmapping if both idmappings have the relevant
  216. userspace id mapped. This will come in handy when working with idmapped mounts.
  217. Invalid translations
  218. ~~~~~~~~~~~~~~~~~~~~
  219. It is never valid to use an id in the kernel idmapset of one idmapping as the
  220. id in the userspace idmapset of another or the same idmapping. While the kernel
  221. idmapset always indicates an idmapset in the kernel id space the userspace
  222. idmapset indicates a userspace id. So the following translations are forbidden::
  223. /* Map the userspace id down into a kernel id in the first idmapping. */
  224. make_kuid(u0:k10000:r10000, u1000) = k11000
  225. /* INVALID: Map the kernel id down into a kernel id in the second idmapping. */
  226. make_kuid(u10000:k20000:r10000, k110000) = k21000
  227. ~~~~~~~
  228. and equally wrong::
  229. /* Map the kernel id up into a userspace id in the first idmapping. */
  230. from_kuid(u0:k10000:r10000, k11000) = u1000
  231. /* INVALID: Map the userspace id up into a userspace id in the second idmapping. */
  232. from_kuid(u20000:k0:r10000, u1000) = k21000
  233. ~~~~~
  234. Since userspace ids have type ``uid_t`` and ``gid_t`` and kernel ids have type
  235. ``kuid_t`` and ``kgid_t`` the compiler will throw an error when they are
  236. conflated. So the two examples above would cause a compilation failure.
  237. Idmappings when creating filesystem objects
  238. -------------------------------------------
  239. The concepts of mapping an id down or mapping an id up are expressed in the two
  240. kernel functions filesystem developers are rather familiar with and which we've
  241. already used in this document::
  242. /* Map the userspace id down into a kernel id. */
  243. make_kuid(idmapping, uid)
  244. /* Map the kernel id up into a userspace id. */
  245. from_kuid(idmapping, kuid)
  246. We will take an abbreviated look into how idmappings figure into creating
  247. filesystem objects. For simplicity we will only look at what happens when the
  248. VFS has already completed path lookup right before it calls into the filesystem
  249. itself. So we're concerned with what happens when e.g. ``vfs_mkdir()`` is
  250. called. We will also assume that the directory we're creating filesystem
  251. objects in is readable and writable for everyone.
  252. When creating a filesystem object the caller will look at the caller's
  253. filesystem ids. These are just regular ``uid_t`` and ``gid_t`` userspace ids
  254. but they are exclusively used when determining file ownership which is why they
  255. are called "filesystem ids". They are usually identical to the uid and gid of
  256. the caller but can differ. We will just assume they are always identical to not
  257. get lost in too many details.
  258. When the caller enters the kernel two things happen:
  259. 1. Map the caller's userspace ids down into kernel ids in the caller's
  260. idmapping.
  261. (To be precise, the kernel will simply look at the kernel ids stashed in the
  262. credentials of the current task but for our education we'll pretend this
  263. translation happens just in time.)
  264. 2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
  265. filesystem's idmapping.
  266. The second step is important as regular filesystem will ultimately need to map
  267. the kernel id back up into a userspace id when writing to disk.
  268. So with the second step the kernel guarantees that a valid userspace id can be
  269. written to disk. If it can't the kernel will refuse the creation request to not
  270. even remotely risk filesystem corruption.
  271. The astute reader will have realized that this is simply a variation of the
  272. crossmapping algorithm we mentioned above in a previous section. First, the
  273. kernel maps the caller's userspace id down into a kernel id according to the
  274. caller's idmapping and then maps that kernel id up according to the
  275. filesystem's idmapping.
  276. From the implementation point it's worth mentioning how idmappings are represented.
  277. All idmappings are taken from the corresponding user namespace.
  278. - caller's idmapping (usually taken from ``current_user_ns()``)
  279. - filesystem's idmapping (``sb->s_user_ns``)
  280. - mount's idmapping (``mnt_idmap(vfsmnt)``)
  281. Let's see some examples with caller/filesystem idmapping but without mount
  282. idmappings. This will exhibit some problems we can hit. After that we will
  283. revisit/reconsider these examples, this time using mount idmappings, to see how
  284. they can solve the problems we observed before.
  285. Example 1
  286. ~~~~~~~~~
  287. ::
  288. caller id: u1000
  289. caller idmapping: u0:k0:r4294967295
  290. filesystem idmapping: u0:k0:r4294967295
  291. Both the caller and the filesystem use the identity idmapping:
  292. 1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
  293. make_kuid(u0:k0:r4294967295, u1000) = k1000
  294. 2. Verify that the caller's kernel ids can be mapped to userspace ids in the
  295. filesystem's idmapping.
  296. For this second step the kernel will call the function
  297. ``fsuidgid_has_mapping()`` which ultimately boils down to calling
  298. ``from_kuid()``::
  299. from_kuid(u0:k0:r4294967295, k1000) = u1000
  300. In this example both idmappings are the same so there's nothing exciting going
  301. on. Ultimately the userspace id that lands on disk will be ``u1000``.
  302. Example 2
  303. ~~~~~~~~~
  304. ::
  305. caller id: u1000
  306. caller idmapping: u0:k10000:r10000
  307. filesystem idmapping: u0:k20000:r10000
  308. 1. Map the caller's userspace ids down into kernel ids in the caller's
  309. idmapping::
  310. make_kuid(u0:k10000:r10000, u1000) = k11000
  311. 2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
  312. filesystem's idmapping::
  313. from_kuid(u0:k20000:r10000, k11000) = u-1
  314. It's immediately clear that while the caller's userspace id could be
  315. successfully mapped down into kernel ids in the caller's idmapping the kernel
  316. ids could not be mapped up according to the filesystem's idmapping. So the
  317. kernel will deny this creation request.
  318. Note that while this example is less common, because most filesystem can't be
  319. mounted with non-initial idmappings this is a general problem as we can see in
  320. the next examples.
  321. Example 3
  322. ~~~~~~~~~
  323. ::
  324. caller id: u1000
  325. caller idmapping: u0:k10000:r10000
  326. filesystem idmapping: u0:k0:r4294967295
  327. 1. Map the caller's userspace ids down into kernel ids in the caller's
  328. idmapping::
  329. make_kuid(u0:k10000:r10000, u1000) = k11000
  330. 2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
  331. filesystem's idmapping::
  332. from_kuid(u0:k0:r4294967295, k11000) = u11000
  333. We can see that the translation always succeeds. The userspace id that the
  334. filesystem will ultimately put to disk will always be identical to the value of
  335. the kernel id that was created in the caller's idmapping. This has mainly two
  336. consequences.
  337. First, that we can't allow a caller to ultimately write to disk with another
  338. userspace id. We could only do this if we were to mount the whole filesystem
  339. with the caller's or another idmapping. But that solution is limited to a few
  340. filesystems and not very flexible. But this is a use-case that is pretty
  341. important in containerized workloads.
  342. Second, the caller will usually not be able to create any files or access
  343. directories that have stricter permissions because none of the filesystem's
  344. kernel ids map up into valid userspace ids in the caller's idmapping
  345. 1. Map raw userspace ids down to kernel ids in the filesystem's idmapping::
  346. make_kuid(u0:k0:r4294967295, u1000) = k1000
  347. 2. Map kernel ids up to userspace ids in the caller's idmapping::
  348. from_kuid(u0:k10000:r10000, k1000) = u-1
  349. Example 4
  350. ~~~~~~~~~
  351. ::
  352. file id: u1000
  353. caller idmapping: u0:k10000:r10000
  354. filesystem idmapping: u0:k0:r4294967295
  355. In order to report ownership to userspace the kernel uses the crossmapping
  356. algorithm introduced in a previous section:
  357. 1. Map the userspace id on disk down into a kernel id in the filesystem's
  358. idmapping::
  359. make_kuid(u0:k0:r4294967295, u1000) = k1000
  360. 2. Map the kernel id up into a userspace id in the caller's idmapping::
  361. from_kuid(u0:k10000:r10000, k1000) = u-1
  362. The crossmapping algorithm fails in this case because the kernel id in the
  363. filesystem idmapping cannot be mapped up to a userspace id in the caller's
  364. idmapping. Thus, the kernel will report the ownership of this file as the
  365. overflowid.
  366. Example 5
  367. ~~~~~~~~~
  368. ::
  369. file id: u1000
  370. caller idmapping: u0:k10000:r10000
  371. filesystem idmapping: u0:k20000:r10000
  372. In order to report ownership to userspace the kernel uses the crossmapping
  373. algorithm introduced in a previous section:
  374. 1. Map the userspace id on disk down into a kernel id in the filesystem's
  375. idmapping::
  376. make_kuid(u0:k20000:r10000, u1000) = k21000
  377. 2. Map the kernel id up into a userspace id in the caller's idmapping::
  378. from_kuid(u0:k10000:r10000, k21000) = u-1
  379. Again, the crossmapping algorithm fails in this case because the kernel id in
  380. the filesystem idmapping cannot be mapped to a userspace id in the caller's
  381. idmapping. Thus, the kernel will report the ownership of this file as the
  382. overflowid.
  383. Note how in the last two examples things would be simple if the caller would be
  384. using the initial idmapping. For a filesystem mounted with the initial
  385. idmapping it would be trivial. So we only consider a filesystem with an
  386. idmapping of ``u0:k20000:r10000``:
  387. 1. Map the userspace id on disk down into a kernel id in the filesystem's
  388. idmapping::
  389. make_kuid(u0:k20000:r10000, u1000) = k21000
  390. 2. Map the kernel id up into a userspace id in the caller's idmapping::
  391. from_kuid(u0:k0:r4294967295, k21000) = u21000
  392. Idmappings on idmapped mounts
  393. -----------------------------
  394. The examples we've seen in the previous section where the caller's idmapping
  395. and the filesystem's idmapping are incompatible causes various issues for
  396. workloads. For a more complex but common example, consider two containers
  397. started on the host. To completely prevent the two containers from affecting
  398. each other, an administrator may often use different non-overlapping idmappings
  399. for the two containers::
  400. container1 idmapping: u0:k10000:r10000
  401. container2 idmapping: u0:k20000:r10000
  402. filesystem idmapping: u0:k30000:r10000
  403. An administrator wanting to provide easy read-write access to the following set
  404. of files::
  405. dir id: u0
  406. dir/file1 id: u1000
  407. dir/file2 id: u2000
  408. to both containers currently can't.
  409. Of course the administrator has the option to recursively change ownership via
  410. ``chown()``. For example, they could change ownership so that ``dir`` and all
  411. files below it can be crossmapped from the filesystem's into the container's
  412. idmapping. Let's assume they change ownership so it is compatible with the
  413. first container's idmapping::
  414. dir id: u10000
  415. dir/file1 id: u11000
  416. dir/file2 id: u12000
  417. This would still leave ``dir`` rather useless to the second container. In fact,
  418. ``dir`` and all files below it would continue to appear owned by the overflowid
  419. for the second container.
  420. Or consider another increasingly popular example. Some service managers such as
  421. systemd implement a concept called "portable home directories". A user may want
  422. to use their home directories on different machines where they are assigned
  423. different login userspace ids. Most users will have ``u1000`` as the login id
  424. on their machine at home and all files in their home directory will usually be
  425. owned by ``u1000``. At uni or at work they may have another login id such as
  426. ``u1125``. This makes it rather difficult to interact with their home directory
  427. on their work machine.
  428. In both cases changing ownership recursively has grave implications. The most
  429. obvious one is that ownership is changed globally and permanently. In the home
  430. directory case this change in ownership would even need to happen every time the
  431. user switches from their home to their work machine. For really large sets of
  432. files this becomes increasingly costly.
  433. If the user is lucky, they are dealing with a filesystem that is mountable
  434. inside user namespaces. But this would also change ownership globally and the
  435. change in ownership is tied to the lifetime of the filesystem mount, i.e. the
  436. superblock. The only way to change ownership is to completely unmount the
  437. filesystem and mount it again in another user namespace. This is usually
  438. impossible because it would mean that all users currently accessing the
  439. filesystem can't anymore. And it means that ``dir`` still can't be shared
  440. between two containers with different idmappings.
  441. But usually the user doesn't even have this option since most filesystems
  442. aren't mountable inside containers. And not having them mountable might be
  443. desirable as it doesn't require the filesystem to deal with malicious
  444. filesystem images.
  445. But the usecases mentioned above and more can be handled by idmapped mounts.
  446. They allow to expose the same set of dentries with different ownership at
  447. different mounts. This is achieved by marking the mounts with a user namespace
  448. through the ``mount_setattr()`` system call. The idmapping associated with it
  449. is then used to translate from the caller's idmapping to the filesystem's
  450. idmapping and vica versa using the remapping algorithm we introduced above.
  451. Idmapped mounts make it possible to change ownership in a temporary and
  452. localized way. The ownership changes are restricted to a specific mount and the
  453. ownership changes are tied to the lifetime of the mount. All other users and
  454. locations where the filesystem is exposed are unaffected.
  455. Filesystems that support idmapped mounts don't have any real reason to support
  456. being mountable inside user namespaces. A filesystem could be exposed
  457. completely under an idmapped mount to get the same effect. This has the
  458. advantage that filesystems can leave the creation of the superblock to
  459. privileged users in the initial user namespace.
  460. However, it is perfectly possible to combine idmapped mounts with filesystems
  461. mountable inside user namespaces. We will touch on this further below.
  462. Filesystem types vs idmapped mount types
  463. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  464. With the introduction of idmapped mounts we need to distinguish between
  465. filesystem ownership and mount ownership of a VFS object such as an inode. The
  466. owner of a inode might be different when looked at from a filesystem
  467. perspective than when looked at from an idmapped mount. Such fundamental
  468. conceptual distinctions should almost always be clearly expressed in the code.
  469. So, to distinguish idmapped mount ownership from filesystem ownership separate
  470. types have been introduced.
  471. If a uid or gid has been generated using the filesystem or caller's idmapping
  472. then we will use the ``kuid_t`` and ``kgid_t`` types. However, if a uid or gid
  473. has been generated using a mount idmapping then we will be using the dedicated
  474. ``vfsuid_t`` and ``vfsgid_t`` types.
  475. All VFS helpers that generate or take uids and gids as arguments use the
  476. ``vfsuid_t`` and ``vfsgid_t`` types and we will be able to rely on the compiler
  477. to catch errors that originate from conflating filesystem and VFS uids and gids.
  478. The ``vfsuid_t`` and ``vfsgid_t`` types are often mapped from and to ``kuid_t``
  479. and ``kgid_t`` types similar how ``kuid_t`` and ``kgid_t`` types are mapped
  480. from and to ``uid_t`` and ``gid_t`` types::
  481. uid_t <--> kuid_t <--> vfsuid_t
  482. gid_t <--> kgid_t <--> vfsgid_t
  483. Whenever we report ownership based on a ``vfsuid_t`` or ``vfsgid_t`` type,
  484. e.g., during ``stat()``, or store ownership information in a shared VFS object
  485. based on a ``vfsuid_t`` or ``vfsgid_t`` type, e.g., during ``chown()`` we can
  486. use the ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()`` helpers.
  487. To illustrate why this helper currently exists, consider what happens when we
  488. change ownership of an inode from an idmapped mount. After we generated
  489. a ``vfsuid_t`` or ``vfsgid_t`` based on the mount idmapping we later commit to
  490. this ``vfsuid_t`` or ``vfsgid_t`` to become the new filesystem wide ownership.
  491. Thus, we are turning the ``vfsuid_t`` or ``vfsgid_t`` into a global ``kuid_t``
  492. or ``kgid_t``. And this can be done by using ``vfsuid_into_kuid()`` and
  493. ``vfsgid_into_kgid()``.
  494. Note, whenever a shared VFS object, e.g., a cached ``struct inode`` or a cached
  495. ``struct posix_acl``, stores ownership information a filesystem or "global"
  496. ``kuid_t`` and ``kgid_t`` must be used. Ownership expressed via ``vfsuid_t``
  497. and ``vfsgid_t`` is specific to an idmapped mount.
  498. We already noted that ``vfsuid_t`` and ``vfsgid_t`` types are generated based
  499. on mount idmappings whereas ``kuid_t`` and ``kgid_t`` types are generated based
  500. on filesystem idmappings. To prevent abusing filesystem idmappings to generate
  501. ``vfsuid_t`` or ``vfsgid_t`` types or mount idmappings to generate ``kuid_t``
  502. or ``kgid_t`` types filesystem idmappings and mount idmappings are different
  503. types as well.
  504. All helpers that map to or from ``vfsuid_t`` and ``vfsgid_t`` types require
  505. a mount idmapping to be passed which is of type ``struct mnt_idmap``. Passing
  506. a filesystem or caller idmapping will cause a compilation error.
  507. Similar to how we prefix all userspace ids in this document with ``u`` and all
  508. kernel ids with ``k`` we will prefix all VFS ids with ``v``. So a mount
  509. idmapping will be written as: ``u0:v10000:r10000``.
  510. Remapping helpers
  511. ~~~~~~~~~~~~~~~~~
  512. Idmapping functions were added that translate between idmappings. They make use
  513. of the remapping algorithm we've introduced earlier. We're going to look at:
  514. - ``i_uid_into_vfsuid()`` and ``i_gid_into_vfsgid()``
  515. The ``i_*id_into_vfs*id()`` functions translate filesystem's kernel ids into
  516. VFS ids in the mount's idmapping::
  517. /* Map the filesystem's kernel id up into a userspace id in the filesystem's idmapping. */
  518. from_kuid(filesystem, kid) = uid
  519. /* Map the filesystem's userspace id down ito a VFS id in the mount's idmapping. */
  520. make_kuid(mount, uid) = kuid
  521. - ``mapped_fsuid()`` and ``mapped_fsgid()``
  522. The ``mapped_fs*id()`` functions translate the caller's kernel ids into
  523. kernel ids in the filesystem's idmapping. This translation is achieved by
  524. remapping the caller's VFS ids using the mount's idmapping::
  525. /* Map the caller's VFS id up into a userspace id in the mount's idmapping. */
  526. from_kuid(mount, kid) = uid
  527. /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
  528. make_kuid(filesystem, uid) = kuid
  529. - ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()``
  530. Whenever
  531. Note that these two functions invert each other. Consider the following
  532. idmappings::
  533. caller idmapping: u0:k10000:r10000
  534. filesystem idmapping: u0:k20000:r10000
  535. mount idmapping: u0:v10000:r10000
  536. Assume a file owned by ``u1000`` is read from disk. The filesystem maps this id
  537. to ``k21000`` according to its idmapping. This is what is stored in the
  538. inode's ``i_uid`` and ``i_gid`` fields.
  539. When the caller queries the ownership of this file via ``stat()`` the kernel
  540. would usually simply use the crossmapping algorithm and map the filesystem's
  541. kernel id up to a userspace id in the caller's idmapping.
  542. But when the caller is accessing the file on an idmapped mount the kernel will
  543. first call ``i_uid_into_vfsuid()`` thereby translating the filesystem's kernel
  544. id into a VFS id in the mount's idmapping::
  545. i_uid_into_vfsuid(k21000):
  546. /* Map the filesystem's kernel id up into a userspace id. */
  547. from_kuid(u0:k20000:r10000, k21000) = u1000
  548. /* Map the filesystem's userspace id down into a VFS id in the mount's idmapping. */
  549. make_kuid(u0:v10000:r10000, u1000) = v11000
  550. Finally, when the kernel reports the owner to the caller it will turn the
  551. VFS id in the mount's idmapping into a userspace id in the caller's
  552. idmapping::
  553. k11000 = vfsuid_into_kuid(v11000)
  554. from_kuid(u0:k10000:r10000, k11000) = u1000
  555. We can test whether this algorithm really works by verifying what happens when
  556. we create a new file. Let's say the user is creating a file with ``u1000``.
  557. The kernel maps this to ``k11000`` in the caller's idmapping. Usually the
  558. kernel would now apply the crossmapping, verifying that ``k11000`` can be
  559. mapped to a userspace id in the filesystem's idmapping. Since ``k11000`` can't
  560. be mapped up in the filesystem's idmapping directly this creation request
  561. fails.
  562. But when the caller is accessing the file on an idmapped mount the kernel will
  563. first call ``mapped_fs*id()`` thereby translating the caller's kernel id into
  564. a VFS id according to the mount's idmapping::
  565. mapped_fsuid(k11000):
  566. /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */
  567. from_kuid(u0:k10000:r10000, k11000) = u1000
  568. /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
  569. make_kuid(u0:v20000:r10000, u1000) = v21000
  570. When finally writing to disk the kernel will then map ``v21000`` up into a
  571. userspace id in the filesystem's idmapping::
  572. k21000 = vfsuid_into_kuid(v21000)
  573. from_kuid(u0:k20000:r10000, k21000) = u1000
  574. As we can see, we end up with an invertible and therefore information
  575. preserving algorithm. A file created from ``u1000`` on an idmapped mount will
  576. also be reported as being owned by ``u1000`` and vica versa.
  577. Let's now briefly reconsider the failing examples from earlier in the context
  578. of idmapped mounts.
  579. Example 2 reconsidered
  580. ~~~~~~~~~~~~~~~~~~~~~~
  581. ::
  582. caller id: u1000
  583. caller idmapping: u0:k10000:r10000
  584. filesystem idmapping: u0:k20000:r10000
  585. mount idmapping: u0:v10000:r10000
  586. When the caller is using a non-initial idmapping the common case is to attach
  587. the same idmapping to the mount. We now perform three steps:
  588. 1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
  589. make_kuid(u0:k10000:r10000, u1000) = k11000
  590. 2. Translate the caller's VFS id into a kernel id in the filesystem's
  591. idmapping::
  592. mapped_fsuid(v11000):
  593. /* Map the VFS id up into a userspace id in the mount's idmapping. */
  594. from_kuid(u0:v10000:r10000, v11000) = u1000
  595. /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
  596. make_kuid(u0:k20000:r10000, u1000) = k21000
  597. 3. Verify that the caller's kernel ids can be mapped to userspace ids in the
  598. filesystem's idmapping::
  599. from_kuid(u0:k20000:r10000, k21000) = u1000
  600. So the ownership that lands on disk will be ``u1000``.
  601. Example 3 reconsidered
  602. ~~~~~~~~~~~~~~~~~~~~~~
  603. ::
  604. caller id: u1000
  605. caller idmapping: u0:k10000:r10000
  606. filesystem idmapping: u0:k0:r4294967295
  607. mount idmapping: u0:v10000:r10000
  608. The same translation algorithm works with the third example.
  609. 1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
  610. make_kuid(u0:k10000:r10000, u1000) = k11000
  611. 2. Translate the caller's VFS id into a kernel id in the filesystem's
  612. idmapping::
  613. mapped_fsuid(v11000):
  614. /* Map the VFS id up into a userspace id in the mount's idmapping. */
  615. from_kuid(u0:v10000:r10000, v11000) = u1000
  616. /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
  617. make_kuid(u0:k0:r4294967295, u1000) = k1000
  618. 3. Verify that the caller's kernel ids can be mapped to userspace ids in the
  619. filesystem's idmapping::
  620. from_kuid(u0:k0:r4294967295, k1000) = u1000
  621. So the ownership that lands on disk will be ``u1000``.
  622. Example 4 reconsidered
  623. ~~~~~~~~~~~~~~~~~~~~~~
  624. ::
  625. file id: u1000
  626. caller idmapping: u0:k10000:r10000
  627. filesystem idmapping: u0:k0:r4294967295
  628. mount idmapping: u0:v10000:r10000
  629. In order to report ownership to userspace the kernel now does three steps using
  630. the translation algorithm we introduced earlier:
  631. 1. Map the userspace id on disk down into a kernel id in the filesystem's
  632. idmapping::
  633. make_kuid(u0:k0:r4294967295, u1000) = k1000
  634. 2. Translate the kernel id into a VFS id in the mount's idmapping::
  635. i_uid_into_vfsuid(k1000):
  636. /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
  637. from_kuid(u0:k0:r4294967295, k1000) = u1000
  638. /* Map the userspace id down into a VFS id in the mounts's idmapping. */
  639. make_kuid(u0:v10000:r10000, u1000) = v11000
  640. 3. Map the VFS id up into a userspace id in the caller's idmapping::
  641. k11000 = vfsuid_into_kuid(v11000)
  642. from_kuid(u0:k10000:r10000, k11000) = u1000
  643. Earlier, the caller's kernel id couldn't be crossmapped in the filesystems's
  644. idmapping. With the idmapped mount in place it now can be crossmapped into the
  645. filesystem's idmapping via the mount's idmapping. The file will now be created
  646. with ``u1000`` according to the mount's idmapping.
  647. Example 5 reconsidered
  648. ~~~~~~~~~~~~~~~~~~~~~~
  649. ::
  650. file id: u1000
  651. caller idmapping: u0:k10000:r10000
  652. filesystem idmapping: u0:k20000:r10000
  653. mount idmapping: u0:v10000:r10000
  654. Again, in order to report ownership to userspace the kernel now does three
  655. steps using the translation algorithm we introduced earlier:
  656. 1. Map the userspace id on disk down into a kernel id in the filesystem's
  657. idmapping::
  658. make_kuid(u0:k20000:r10000, u1000) = k21000
  659. 2. Translate the kernel id into a VFS id in the mount's idmapping::
  660. i_uid_into_vfsuid(k21000):
  661. /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
  662. from_kuid(u0:k20000:r10000, k21000) = u1000
  663. /* Map the userspace id down into a VFS id in the mounts's idmapping. */
  664. make_kuid(u0:v10000:r10000, u1000) = v11000
  665. 3. Map the VFS id up into a userspace id in the caller's idmapping::
  666. k11000 = vfsuid_into_kuid(v11000)
  667. from_kuid(u0:k10000:r10000, k11000) = u1000
  668. Earlier, the file's kernel id couldn't be crossmapped in the filesystems's
  669. idmapping. With the idmapped mount in place it now can be crossmapped into the
  670. filesystem's idmapping via the mount's idmapping. The file is now owned by
  671. ``u1000`` according to the mount's idmapping.
  672. Changing ownership on a home directory
  673. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  674. We've seen above how idmapped mounts can be used to translate between
  675. idmappings when either the caller, the filesystem or both uses a non-initial
  676. idmapping. A wide range of usecases exist when the caller is using
  677. a non-initial idmapping. This mostly happens in the context of containerized
  678. workloads. The consequence is as we have seen that for both, filesystem's
  679. mounted with the initial idmapping and filesystems mounted with non-initial
  680. idmappings, access to the filesystem isn't working because the kernel ids can't
  681. be crossmapped between the caller's and the filesystem's idmapping.
  682. As we've seen above idmapped mounts provide a solution to this by remapping the
  683. caller's or filesystem's idmapping according to the mount's idmapping.
  684. Aside from containerized workloads, idmapped mounts have the advantage that
  685. they also work when both the caller and the filesystem use the initial
  686. idmapping which means users on the host can change the ownership of directories
  687. and files on a per-mount basis.
  688. Consider our previous example where a user has their home directory on portable
  689. storage. At home they have id ``u1000`` and all files in their home directory
  690. are owned by ``u1000`` whereas at uni or work they have login id ``u1125``.
  691. Taking their home directory with them becomes problematic. They can't easily
  692. access their files, they might not be able to write to disk without applying
  693. lax permissions or ACLs and even if they can, they will end up with an annoying
  694. mix of files and directories owned by ``u1000`` and ``u1125``.
  695. Idmapped mounts allow to solve this problem. A user can create an idmapped
  696. mount for their home directory on their work computer or their computer at home
  697. depending on what ownership they would prefer to end up on the portable storage
  698. itself.
  699. Let's assume they want all files on disk to belong to ``u1000``. When the user
  700. plugs in their portable storage at their work station they can setup a job that
  701. creates an idmapped mount with the minimal idmapping ``u1000:k1125:r1``. So now
  702. when they create a file the kernel performs the following steps we already know
  703. from above:::
  704. caller id: u1125
  705. caller idmapping: u0:k0:r4294967295
  706. filesystem idmapping: u0:k0:r4294967295
  707. mount idmapping: u1000:v1125:r1
  708. 1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
  709. make_kuid(u0:k0:r4294967295, u1125) = k1125
  710. 2. Translate the caller's VFS id into a kernel id in the filesystem's
  711. idmapping::
  712. mapped_fsuid(v1125):
  713. /* Map the VFS id up into a userspace id in the mount's idmapping. */
  714. from_kuid(u1000:v1125:r1, v1125) = u1000
  715. /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
  716. make_kuid(u0:k0:r4294967295, u1000) = k1000
  717. 3. Verify that the caller's filesystem ids can be mapped to userspace ids in the
  718. filesystem's idmapping::
  719. from_kuid(u0:k0:r4294967295, k1000) = u1000
  720. So ultimately the file will be created with ``u1000`` on disk.
  721. Now let's briefly look at what ownership the caller with id ``u1125`` will see
  722. on their work computer:
  723. ::
  724. file id: u1000
  725. caller idmapping: u0:k0:r4294967295
  726. filesystem idmapping: u0:k0:r4294967295
  727. mount idmapping: u1000:v1125:r1
  728. 1. Map the userspace id on disk down into a kernel id in the filesystem's
  729. idmapping::
  730. make_kuid(u0:k0:r4294967295, u1000) = k1000
  731. 2. Translate the kernel id into a VFS id in the mount's idmapping::
  732. i_uid_into_vfsuid(k1000):
  733. /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
  734. from_kuid(u0:k0:r4294967295, k1000) = u1000
  735. /* Map the userspace id down into a VFS id in the mounts's idmapping. */
  736. make_kuid(u1000:v1125:r1, u1000) = v1125
  737. 3. Map the VFS id up into a userspace id in the caller's idmapping::
  738. k1125 = vfsuid_into_kuid(v1125)
  739. from_kuid(u0:k0:r4294967295, k1125) = u1125
  740. So ultimately the caller will be reported that the file belongs to ``u1125``
  741. which is the caller's userspace id on their workstation in our example.
  742. The raw userspace id that is put on disk is ``u1000`` so when the user takes
  743. their home directory back to their home computer where they are assigned
  744. ``u1000`` using the initial idmapping and mount the filesystem with the initial
  745. idmapping they will see all those files owned by ``u1000``.