localio.rst 17 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370
  1. ===========
  2. NFS LOCALIO
  3. ===========
  4. Overview
  5. ========
  6. The LOCALIO auxiliary RPC protocol allows the Linux NFS client and
  7. server to reliably handshake to determine if they are on the same
  8. host. Select "NFS client and server support for LOCALIO auxiliary
  9. protocol" in menuconfig to enable CONFIG_NFS_LOCALIO in the kernel
  10. config (both CONFIG_NFS_FS and CONFIG_NFSD must also be enabled).
  11. Once an NFS client and server handshake as "local", the client will
  12. bypass the network RPC protocol for read, write and commit operations.
  13. Due to this XDR and RPC bypass, these operations will operate faster.
  14. The LOCALIO auxiliary protocol's implementation, which uses the same
  15. connection as NFS traffic, follows the pattern established by the NFS
  16. ACL protocol extension.
  17. The LOCALIO auxiliary protocol is needed to allow robust discovery of
  18. clients local to their servers. In a private implementation that
  19. preceded use of this LOCALIO protocol, a fragile sockaddr network
  20. address based match against all local network interfaces was attempted.
  21. But unlike the LOCALIO protocol, the sockaddr-based matching didn't
  22. handle use of iptables or containers.
  23. The robust handshake between local client and server is just the
  24. beginning, the ultimate use case this locality makes possible is the
  25. client is able to open files and issue reads, writes and commits
  26. directly to the server without having to go over the network. The
  27. requirement is to perform these loopback NFS operations as efficiently
  28. as possible, this is particularly useful for container use cases
  29. (e.g. kubernetes) where it is possible to run an IO job local to the
  30. server.
  31. The performance advantage realized from LOCALIO's ability to bypass
  32. using XDR and RPC for reads, writes and commits can be extreme, e.g.:
  33. fio for 20 secs with directio, qd of 8, 16 libaio threads:
  34. - With LOCALIO:
  35. 4K read: IOPS=979k, BW=3825MiB/s (4011MB/s)(74.7GiB/20002msec)
  36. 4K write: IOPS=165k, BW=646MiB/s (678MB/s)(12.6GiB/20002msec)
  37. 128K read: IOPS=402k, BW=49.1GiB/s (52.7GB/s)(982GiB/20002msec)
  38. 128K write: IOPS=11.5k, BW=1433MiB/s (1503MB/s)(28.0GiB/20004msec)
  39. - Without LOCALIO:
  40. 4K read: IOPS=79.2k, BW=309MiB/s (324MB/s)(6188MiB/20003msec)
  41. 4K write: IOPS=59.8k, BW=234MiB/s (245MB/s)(4671MiB/20002msec)
  42. 128K read: IOPS=33.9k, BW=4234MiB/s (4440MB/s)(82.7GiB/20004msec)
  43. 128K write: IOPS=11.5k, BW=1434MiB/s (1504MB/s)(28.0GiB/20011msec)
  44. fio for 20 secs with directio, qd of 8, 1 libaio thread:
  45. - With LOCALIO:
  46. 4K read: IOPS=230k, BW=898MiB/s (941MB/s)(17.5GiB/20001msec)
  47. 4K write: IOPS=22.6k, BW=88.3MiB/s (92.6MB/s)(1766MiB/20001msec)
  48. 128K read: IOPS=38.8k, BW=4855MiB/s (5091MB/s)(94.8GiB/20001msec)
  49. 128K write: IOPS=11.4k, BW=1428MiB/s (1497MB/s)(27.9GiB/20001msec)
  50. - Without LOCALIO:
  51. 4K read: IOPS=77.1k, BW=301MiB/s (316MB/s)(6022MiB/20001msec)
  52. 4K write: IOPS=32.8k, BW=128MiB/s (135MB/s)(2566MiB/20001msec)
  53. 128K read: IOPS=24.4k, BW=3050MiB/s (3198MB/s)(59.6GiB/20001msec)
  54. 128K write: IOPS=11.4k, BW=1430MiB/s (1500MB/s)(27.9GiB/20001msec)
  55. FAQ
  56. ===
  57. 1. What are the use cases for LOCALIO?
  58. a. Workloads where the NFS client and server are on the same host
  59. realize improved IO performance. In particular, it is common when
  60. running containerised workloads for jobs to find themselves
  61. running on the same host as the knfsd server being used for
  62. storage.
  63. 2. What are the requirements for LOCALIO?
  64. a. Bypass use of the network RPC protocol as much as possible. This
  65. includes bypassing XDR and RPC for open, read, write and commit
  66. operations.
  67. b. Allow client and server to autonomously discover if they are
  68. running local to each other without making any assumptions about
  69. the local network topology.
  70. c. Support the use of containers by being compatible with relevant
  71. namespaces (e.g. network, user, mount).
  72. d. Support all versions of NFS. NFSv3 is of particular importance
  73. because it has wide enterprise usage and pNFS flexfiles makes use
  74. of it for the data path.
  75. 3. Why doesn’t LOCALIO just compare IP addresses or hostnames when
  76. deciding if the NFS client and server are co-located on the same
  77. host?
  78. Since one of the main use cases is containerised workloads, we cannot
  79. assume that IP addresses will be shared between the client and
  80. server. This sets up a requirement for a handshake protocol that
  81. needs to go over the same connection as the NFS traffic in order to
  82. identify that the client and the server really are running on the
  83. same host. The handshake uses a secret that is sent over the wire,
  84. and can be verified by both parties by comparing with a value stored
  85. in shared kernel memory if they are truly co-located.
  86. 4. Does LOCALIO improve pNFS flexfiles?
  87. Yes, LOCALIO complements pNFS flexfiles by allowing it to take
  88. advantage of NFS client and server locality. Policy that initiates
  89. client IO as closely to the server where the data is stored naturally
  90. benefits from the data path optimization LOCALIO provides.
  91. 5. Why not develop a new pNFS layout to enable LOCALIO?
  92. A new pNFS layout could be developed, but doing so would put the
  93. onus on the server to somehow discover that the client is co-located
  94. when deciding to hand out the layout.
  95. There is value in a simpler approach (as provided by LOCALIO) that
  96. allows the NFS client to negotiate and leverage locality without
  97. requiring more elaborate modeling and discovery of such locality in a
  98. more centralized manner.
  99. 6. Why is having the client perform a server-side file OPEN, without
  100. using RPC, beneficial? Is the benefit pNFS specific?
  101. Avoiding the use of XDR and RPC for file opens is beneficial to
  102. performance regardless of whether pNFS is used. Especially when
  103. dealing with small files its best to avoid going over the wire
  104. whenever possible, otherwise it could reduce or even negate the
  105. benefits of avoiding the wire for doing the small file I/O itself.
  106. Given LOCALIO's requirements the current approach of having the
  107. client perform a server-side file open, without using RPC, is ideal.
  108. If in the future requirements change then we can adapt accordingly.
  109. 7. Why is LOCALIO only supported with UNIX Authentication (AUTH_UNIX)?
  110. Strong authentication is usually tied to the connection itself. It
  111. works by establishing a context that is cached by the server, and
  112. that acts as the key for discovering the authorisation token, which
  113. can then be passed to rpc.mountd to complete the authentication
  114. process. On the other hand, in the case of AUTH_UNIX, the credential
  115. that was passed over the wire is used directly as the key in the
  116. upcall to rpc.mountd. This simplifies the authentication process, and
  117. so makes AUTH_UNIX easier to support.
  118. 8. How do export options that translate RPC user IDs behave for LOCALIO
  119. operations (eg. root_squash, all_squash)?
  120. Export options that translate user IDs are managed by nfsd_setuser()
  121. which is called by nfsd_setuser_and_check_port() which is called by
  122. __fh_verify(). So they get handled exactly the same way for LOCALIO
  123. as they do for non-LOCALIO.
  124. 9. How does LOCALIO make certain that object lifetimes are managed
  125. properly given NFSD and NFS operate in different contexts?
  126. See the detailed "NFS Client and Server Interlock" section below.
  127. RPC
  128. ===
  129. The LOCALIO auxiliary RPC protocol consists of a single "UUID_IS_LOCAL"
  130. RPC method that allows the Linux NFS client to verify the local Linux
  131. NFS server can see the nonce (single-use UUID) the client generated and
  132. made available in nfs_common. This protocol isn't part of an IETF
  133. standard, nor does it need to be considering it is Linux-to-Linux
  134. auxiliary RPC protocol that amounts to an implementation detail.
  135. The UUID_IS_LOCAL method encodes the client generated uuid_t in terms of
  136. the fixed UUID_SIZE (16 bytes). The fixed size opaque encode and decode
  137. XDR methods are used instead of the less efficient variable sized
  138. methods.
  139. The RPC program number for the NFS_LOCALIO_PROGRAM is 400122 (as assigned
  140. by IANA, see https://www.iana.org/assignments/rpc-program-numbers/ ):
  141. Linux Kernel Organization 400122 nfslocalio
  142. The LOCALIO protocol spec in rpcgen syntax is::
  143. /* raw RFC 9562 UUID */
  144. #define UUID_SIZE 16
  145. typedef u8 uuid_t<UUID_SIZE>;
  146. program NFS_LOCALIO_PROGRAM {
  147. version LOCALIO_V1 {
  148. void
  149. NULL(void) = 0;
  150. void
  151. UUID_IS_LOCAL(uuid_t) = 1;
  152. } = 1;
  153. } = 400122;
  154. LOCALIO uses the same transport connection as NFS traffic. As such,
  155. LOCALIO is not registered with rpcbind.
  156. NFS Common and Client/Server Handshake
  157. ======================================
  158. fs/nfs_common/nfslocalio.c provides interfaces that enable an NFS client
  159. to generate a nonce (single-use UUID) and associated short-lived
  160. nfs_uuid_t struct, register it with nfs_common for subsequent lookup and
  161. verification by the NFS server and if matched the NFS server populates
  162. members in the nfs_uuid_t struct. The NFS client then uses nfs_common to
  163. transfer the nfs_uuid_t from its nfs_uuids to the nn->nfsd_serv
  164. clients_list from the nfs_common's uuids_list. See:
  165. fs/nfs/localio.c:nfs_local_probe()
  166. nfs_common's nfs_uuids list is the basis for LOCALIO enablement, as such
  167. it has members that point to nfsd memory for direct use by the client
  168. (e.g. 'net' is the server's network namespace, through it the client can
  169. access nn->nfsd_serv with proper rcu read access). It is this client
  170. and server synchronization that enables advanced usage and lifetime of
  171. objects to span from the host kernel's nfsd to per-container knfsd
  172. instances that are connected to nfs client's running on the same local
  173. host.
  174. NFS Client and Server Interlock
  175. ===============================
  176. LOCALIO provides the nfs_uuid_t object and associated interfaces to
  177. allow proper network namespace (net-ns) and NFSD object refcounting:
  178. We don't want to keep a long-term counted reference on each NFSD's
  179. net-ns in the client because that prevents a server container from
  180. completely shutting down.
  181. So we avoid taking a reference at all and rely on the per-cpu
  182. reference to the server (detailed below) being sufficient to keep
  183. the net-ns active. This involves allowing the NFSD's net-ns exit
  184. code to iterate all active clients and clear their ->net pointers
  185. (which are needed to find the per-cpu-refcount for the nfsd_serv).
  186. Details:
  187. - Embed nfs_uuid_t in nfs_client. nfs_uuid_t provides a list_head
  188. that can be used to find the client. It does add the 16-byte
  189. uuid_t to nfs_client so it is bigger than needed (given that
  190. uuid_t is only used during the initial NFS client and server
  191. LOCALIO handshake to determine if they are local to each other).
  192. If that is really a problem we can find a fix.
  193. - When the nfs server confirms that the uuid_t is local, it moves
  194. the nfs_uuid_t onto a per-net-ns list in NFSD's nfsd_net.
  195. - When each server's net-ns is shutting down - in a "pre_exit"
  196. handler, all these nfs_uuid_t have their ->net cleared. There is
  197. an rcu_synchronize() call between pre_exit() handlers and exit()
  198. handlers so any caller that sees nfs_uuid_t ->net as not NULL can
  199. safely manage the per-cpu-refcount for nfsd_serv.
  200. - The client's nfs_uuid_t is passed to nfsd_open_local_fh() so it
  201. can safely dereference ->net in a private rcu_read_lock() section
  202. to allow safe access to the associated nfsd_net and nfsd_serv.
  203. So LOCALIO required the introduction and use of NFSD's percpu_ref to
  204. interlock nfsd_destroy_serv() and nfsd_open_local_fh(), to ensure each
  205. nn->nfsd_serv is not destroyed while in use by nfsd_open_local_fh(), and
  206. warrants a more detailed explanation:
  207. nfsd_open_local_fh() uses nfsd_serv_try_get() before opening its
  208. nfsd_file handle and then the caller (NFS client) must drop the
  209. reference for the nfsd_file and associated nn->nfsd_serv using
  210. nfs_file_put_local() once it has completed its IO.
  211. This interlock working relies heavily on nfsd_open_local_fh() being
  212. afforded the ability to safely deal with the possibility that the
  213. NFSD's net-ns (and nfsd_net by association) may have been destroyed
  214. by nfsd_destroy_serv() via nfsd_shutdown_net() -- which is only
  215. possible given the nfs_uuid_t ->net pointer managemenet detailed
  216. above.
  217. All told, this elaborate interlock of the NFS client and server has been
  218. verified to fix an easy to hit crash that would occur if an NFSD
  219. instance running in a container, with a LOCALIO client mounted, is
  220. shutdown. Upon restart of the container and associated NFSD the client
  221. would go on to crash due to NULL pointer dereference that occurred due
  222. to the LOCALIO client's attempting to nfsd_open_local_fh(), using
  223. nn->nfsd_serv, without having a proper reference on nn->nfsd_serv.
  224. NFS Client issues IO instead of Server
  225. ======================================
  226. Because LOCALIO is focused on protocol bypass to achieve improved IO
  227. performance, alternatives to the traditional NFS wire protocol (SUNRPC
  228. with XDR) must be provided to access the backing filesystem.
  229. See fs/nfs/localio.c:nfs_local_open_fh() and
  230. fs/nfsd/localio.c:nfsd_open_local_fh() for the interface that makes
  231. focused use of select nfs server objects to allow a client local to a
  232. server to open a file pointer without needing to go over the network.
  233. The client's fs/nfs/localio.c:nfs_local_open_fh() will call into the
  234. server's fs/nfsd/localio.c:nfsd_open_local_fh() and carefully access
  235. both the associated nfsd network namespace and nn->nfsd_serv in terms of
  236. RCU. If nfsd_open_local_fh() finds that the client no longer sees valid
  237. nfsd objects (be it struct net or nn->nfsd_serv) it returns -ENXIO
  238. to nfs_local_open_fh() and the client will try to reestablish the
  239. LOCALIO resources needed by calling nfs_local_probe() again. This
  240. recovery is needed if/when an nfsd instance running in a container were
  241. to reboot while a LOCALIO client is connected to it.
  242. Once the client has an open nfsd_file pointer it will issue reads,
  243. writes and commits directly to the underlying local filesystem (normally
  244. done by the nfs server). As such, for these operations, the NFS client
  245. is issuing IO to the underlying local filesystem that it is sharing with
  246. the NFS server. See: fs/nfs/localio.c:nfs_local_doio() and
  247. fs/nfs/localio.c:nfs_local_commit().
  248. With normal NFS that makes use of RPC to issue IO to the server, if an
  249. application uses O_DIRECT the NFS client will bypass the pagecache but
  250. the NFS server will not. Because the NFS server's use of buffered IO
  251. affords applications to be less precise with their alignment when
  252. issuing IO to the NFS client. LOCALIO can be configured to use O_DIRECT
  253. semantics by setting the 'localio_O_DIRECT_semantics' nfs module
  254. parameter to Y, e.g.:
  255. echo Y > /sys/module/nfs/parameters/localio_O_DIRECT_semantics
  256. Once enabled, it will cause LOCALIO to use O_DIRECT semantics (this may
  257. cause IO to fail if applications do not properly align their IO).
  258. Security
  259. ========
  260. Localio is only supported when UNIX-style authentication (AUTH_UNIX, aka
  261. AUTH_SYS) is used.
  262. Care is taken to ensure the same NFS security mechanisms are used
  263. (authentication, etc) regardless of whether LOCALIO or regular NFS
  264. access is used. The auth_domain established as part of the traditional
  265. NFS client access to the NFS server is also used for LOCALIO.
  266. Relative to containers, LOCALIO gives the client access to the network
  267. namespace the server has. This is required to allow the client to access
  268. the server's per-namespace nfsd_net struct. With traditional NFS, the
  269. client is afforded this same level of access (albeit in terms of the NFS
  270. protocol via SUNRPC). No other namespaces (user, mount, etc) have been
  271. altered or purposely extended from the server to the client.
  272. Testing
  273. =======
  274. The LOCALIO auxiliary protocol and associated NFS LOCALIO read, write
  275. and commit access have proven stable against various test scenarios:
  276. - Client and server both on the same host.
  277. - All permutations of client and server support enablement for both
  278. local and remote client and server.
  279. - Testing against NFS storage products that don't support the LOCALIO
  280. protocol was also performed.
  281. - Client on host, server within a container (for both v3 and v4.2).
  282. The container testing was in terms of podman managed containers and
  283. includes successful container stop/restart scenario.
  284. - Formalizing these test scenarios in terms of existing test
  285. infrastructure is on-going. Initial regular coverage is provided in
  286. terms of ktest running xfstests against a LOCALIO-enabled NFS loopback
  287. mount configuration, and includes lockdep and KASAN coverage, see:
  288. https://evilpiepirate.org/~testdashboard/ci?user=snitzer&branch=snitm-nfs-next
  289. https://github.com/koverstreet/ktest
  290. - Various kdevops testing (in terms of "Chuck's BuildBot") has been
  291. performed to regularly verify the LOCALIO changes haven't caused any
  292. regressions to non-LOCALIO NFS use cases.
  293. - All of Hammerspace's various sanity tests pass with LOCALIO enabled
  294. (this includes numerous pNFS and flexfiles tests).