| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370 |
- ===========
- NFS LOCALIO
- ===========
- Overview
- ========
- The LOCALIO auxiliary RPC protocol allows the Linux NFS client and
- server to reliably handshake to determine if they are on the same
- host. Select "NFS client and server support for LOCALIO auxiliary
- protocol" in menuconfig to enable CONFIG_NFS_LOCALIO in the kernel
- config (both CONFIG_NFS_FS and CONFIG_NFSD must also be enabled).
- Once an NFS client and server handshake as "local", the client will
- bypass the network RPC protocol for read, write and commit operations.
- Due to this XDR and RPC bypass, these operations will operate faster.
- The LOCALIO auxiliary protocol's implementation, which uses the same
- connection as NFS traffic, follows the pattern established by the NFS
- ACL protocol extension.
- The LOCALIO auxiliary protocol is needed to allow robust discovery of
- clients local to their servers. In a private implementation that
- preceded use of this LOCALIO protocol, a fragile sockaddr network
- address based match against all local network interfaces was attempted.
- But unlike the LOCALIO protocol, the sockaddr-based matching didn't
- handle use of iptables or containers.
- The robust handshake between local client and server is just the
- beginning, the ultimate use case this locality makes possible is the
- client is able to open files and issue reads, writes and commits
- directly to the server without having to go over the network. The
- requirement is to perform these loopback NFS operations as efficiently
- as possible, this is particularly useful for container use cases
- (e.g. kubernetes) where it is possible to run an IO job local to the
- server.
- The performance advantage realized from LOCALIO's ability to bypass
- using XDR and RPC for reads, writes and commits can be extreme, e.g.:
- fio for 20 secs with directio, qd of 8, 16 libaio threads:
- - With LOCALIO:
- 4K read: IOPS=979k, BW=3825MiB/s (4011MB/s)(74.7GiB/20002msec)
- 4K write: IOPS=165k, BW=646MiB/s (678MB/s)(12.6GiB/20002msec)
- 128K read: IOPS=402k, BW=49.1GiB/s (52.7GB/s)(982GiB/20002msec)
- 128K write: IOPS=11.5k, BW=1433MiB/s (1503MB/s)(28.0GiB/20004msec)
- - Without LOCALIO:
- 4K read: IOPS=79.2k, BW=309MiB/s (324MB/s)(6188MiB/20003msec)
- 4K write: IOPS=59.8k, BW=234MiB/s (245MB/s)(4671MiB/20002msec)
- 128K read: IOPS=33.9k, BW=4234MiB/s (4440MB/s)(82.7GiB/20004msec)
- 128K write: IOPS=11.5k, BW=1434MiB/s (1504MB/s)(28.0GiB/20011msec)
- fio for 20 secs with directio, qd of 8, 1 libaio thread:
- - With LOCALIO:
- 4K read: IOPS=230k, BW=898MiB/s (941MB/s)(17.5GiB/20001msec)
- 4K write: IOPS=22.6k, BW=88.3MiB/s (92.6MB/s)(1766MiB/20001msec)
- 128K read: IOPS=38.8k, BW=4855MiB/s (5091MB/s)(94.8GiB/20001msec)
- 128K write: IOPS=11.4k, BW=1428MiB/s (1497MB/s)(27.9GiB/20001msec)
- - Without LOCALIO:
- 4K read: IOPS=77.1k, BW=301MiB/s (316MB/s)(6022MiB/20001msec)
- 4K write: IOPS=32.8k, BW=128MiB/s (135MB/s)(2566MiB/20001msec)
- 128K read: IOPS=24.4k, BW=3050MiB/s (3198MB/s)(59.6GiB/20001msec)
- 128K write: IOPS=11.4k, BW=1430MiB/s (1500MB/s)(27.9GiB/20001msec)
- FAQ
- ===
- 1. What are the use cases for LOCALIO?
- a. Workloads where the NFS client and server are on the same host
- realize improved IO performance. In particular, it is common when
- running containerised workloads for jobs to find themselves
- running on the same host as the knfsd server being used for
- storage.
- 2. What are the requirements for LOCALIO?
- a. Bypass use of the network RPC protocol as much as possible. This
- includes bypassing XDR and RPC for open, read, write and commit
- operations.
- b. Allow client and server to autonomously discover if they are
- running local to each other without making any assumptions about
- the local network topology.
- c. Support the use of containers by being compatible with relevant
- namespaces (e.g. network, user, mount).
- d. Support all versions of NFS. NFSv3 is of particular importance
- because it has wide enterprise usage and pNFS flexfiles makes use
- of it for the data path.
- 3. Why doesn’t LOCALIO just compare IP addresses or hostnames when
- deciding if the NFS client and server are co-located on the same
- host?
- Since one of the main use cases is containerised workloads, we cannot
- assume that IP addresses will be shared between the client and
- server. This sets up a requirement for a handshake protocol that
- needs to go over the same connection as the NFS traffic in order to
- identify that the client and the server really are running on the
- same host. The handshake uses a secret that is sent over the wire,
- and can be verified by both parties by comparing with a value stored
- in shared kernel memory if they are truly co-located.
- 4. Does LOCALIO improve pNFS flexfiles?
- Yes, LOCALIO complements pNFS flexfiles by allowing it to take
- advantage of NFS client and server locality. Policy that initiates
- client IO as closely to the server where the data is stored naturally
- benefits from the data path optimization LOCALIO provides.
- 5. Why not develop a new pNFS layout to enable LOCALIO?
- A new pNFS layout could be developed, but doing so would put the
- onus on the server to somehow discover that the client is co-located
- when deciding to hand out the layout.
- There is value in a simpler approach (as provided by LOCALIO) that
- allows the NFS client to negotiate and leverage locality without
- requiring more elaborate modeling and discovery of such locality in a
- more centralized manner.
- 6. Why is having the client perform a server-side file OPEN, without
- using RPC, beneficial? Is the benefit pNFS specific?
- Avoiding the use of XDR and RPC for file opens is beneficial to
- performance regardless of whether pNFS is used. Especially when
- dealing with small files its best to avoid going over the wire
- whenever possible, otherwise it could reduce or even negate the
- benefits of avoiding the wire for doing the small file I/O itself.
- Given LOCALIO's requirements the current approach of having the
- client perform a server-side file open, without using RPC, is ideal.
- If in the future requirements change then we can adapt accordingly.
- 7. Why is LOCALIO only supported with UNIX Authentication (AUTH_UNIX)?
- Strong authentication is usually tied to the connection itself. It
- works by establishing a context that is cached by the server, and
- that acts as the key for discovering the authorisation token, which
- can then be passed to rpc.mountd to complete the authentication
- process. On the other hand, in the case of AUTH_UNIX, the credential
- that was passed over the wire is used directly as the key in the
- upcall to rpc.mountd. This simplifies the authentication process, and
- so makes AUTH_UNIX easier to support.
- 8. How do export options that translate RPC user IDs behave for LOCALIO
- operations (eg. root_squash, all_squash)?
- Export options that translate user IDs are managed by nfsd_setuser()
- which is called by nfsd_setuser_and_check_port() which is called by
- __fh_verify(). So they get handled exactly the same way for LOCALIO
- as they do for non-LOCALIO.
- 9. How does LOCALIO make certain that object lifetimes are managed
- properly given NFSD and NFS operate in different contexts?
- See the detailed "NFS Client and Server Interlock" section below.
- RPC
- ===
- The LOCALIO auxiliary RPC protocol consists of a single "UUID_IS_LOCAL"
- RPC method that allows the Linux NFS client to verify the local Linux
- NFS server can see the nonce (single-use UUID) the client generated and
- made available in nfs_common. This protocol isn't part of an IETF
- standard, nor does it need to be considering it is Linux-to-Linux
- auxiliary RPC protocol that amounts to an implementation detail.
- The UUID_IS_LOCAL method encodes the client generated uuid_t in terms of
- the fixed UUID_SIZE (16 bytes). The fixed size opaque encode and decode
- XDR methods are used instead of the less efficient variable sized
- methods.
- The RPC program number for the NFS_LOCALIO_PROGRAM is 400122 (as assigned
- by IANA, see https://www.iana.org/assignments/rpc-program-numbers/ ):
- Linux Kernel Organization 400122 nfslocalio
- The LOCALIO protocol spec in rpcgen syntax is::
- /* raw RFC 9562 UUID */
- #define UUID_SIZE 16
- typedef u8 uuid_t<UUID_SIZE>;
- program NFS_LOCALIO_PROGRAM {
- version LOCALIO_V1 {
- void
- NULL(void) = 0;
- void
- UUID_IS_LOCAL(uuid_t) = 1;
- } = 1;
- } = 400122;
- LOCALIO uses the same transport connection as NFS traffic. As such,
- LOCALIO is not registered with rpcbind.
- NFS Common and Client/Server Handshake
- ======================================
- fs/nfs_common/nfslocalio.c provides interfaces that enable an NFS client
- to generate a nonce (single-use UUID) and associated short-lived
- nfs_uuid_t struct, register it with nfs_common for subsequent lookup and
- verification by the NFS server and if matched the NFS server populates
- members in the nfs_uuid_t struct. The NFS client then uses nfs_common to
- transfer the nfs_uuid_t from its nfs_uuids to the nn->nfsd_serv
- clients_list from the nfs_common's uuids_list. See:
- fs/nfs/localio.c:nfs_local_probe()
- nfs_common's nfs_uuids list is the basis for LOCALIO enablement, as such
- it has members that point to nfsd memory for direct use by the client
- (e.g. 'net' is the server's network namespace, through it the client can
- access nn->nfsd_serv with proper rcu read access). It is this client
- and server synchronization that enables advanced usage and lifetime of
- objects to span from the host kernel's nfsd to per-container knfsd
- instances that are connected to nfs client's running on the same local
- host.
- NFS Client and Server Interlock
- ===============================
- LOCALIO provides the nfs_uuid_t object and associated interfaces to
- allow proper network namespace (net-ns) and NFSD object refcounting:
- We don't want to keep a long-term counted reference on each NFSD's
- net-ns in the client because that prevents a server container from
- completely shutting down.
- So we avoid taking a reference at all and rely on the per-cpu
- reference to the server (detailed below) being sufficient to keep
- the net-ns active. This involves allowing the NFSD's net-ns exit
- code to iterate all active clients and clear their ->net pointers
- (which are needed to find the per-cpu-refcount for the nfsd_serv).
- Details:
- - Embed nfs_uuid_t in nfs_client. nfs_uuid_t provides a list_head
- that can be used to find the client. It does add the 16-byte
- uuid_t to nfs_client so it is bigger than needed (given that
- uuid_t is only used during the initial NFS client and server
- LOCALIO handshake to determine if they are local to each other).
- If that is really a problem we can find a fix.
- - When the nfs server confirms that the uuid_t is local, it moves
- the nfs_uuid_t onto a per-net-ns list in NFSD's nfsd_net.
- - When each server's net-ns is shutting down - in a "pre_exit"
- handler, all these nfs_uuid_t have their ->net cleared. There is
- an rcu_synchronize() call between pre_exit() handlers and exit()
- handlers so any caller that sees nfs_uuid_t ->net as not NULL can
- safely manage the per-cpu-refcount for nfsd_serv.
- - The client's nfs_uuid_t is passed to nfsd_open_local_fh() so it
- can safely dereference ->net in a private rcu_read_lock() section
- to allow safe access to the associated nfsd_net and nfsd_serv.
- So LOCALIO required the introduction and use of NFSD's percpu_ref to
- interlock nfsd_destroy_serv() and nfsd_open_local_fh(), to ensure each
- nn->nfsd_serv is not destroyed while in use by nfsd_open_local_fh(), and
- warrants a more detailed explanation:
- nfsd_open_local_fh() uses nfsd_serv_try_get() before opening its
- nfsd_file handle and then the caller (NFS client) must drop the
- reference for the nfsd_file and associated nn->nfsd_serv using
- nfs_file_put_local() once it has completed its IO.
- This interlock working relies heavily on nfsd_open_local_fh() being
- afforded the ability to safely deal with the possibility that the
- NFSD's net-ns (and nfsd_net by association) may have been destroyed
- by nfsd_destroy_serv() via nfsd_shutdown_net() -- which is only
- possible given the nfs_uuid_t ->net pointer managemenet detailed
- above.
- All told, this elaborate interlock of the NFS client and server has been
- verified to fix an easy to hit crash that would occur if an NFSD
- instance running in a container, with a LOCALIO client mounted, is
- shutdown. Upon restart of the container and associated NFSD the client
- would go on to crash due to NULL pointer dereference that occurred due
- to the LOCALIO client's attempting to nfsd_open_local_fh(), using
- nn->nfsd_serv, without having a proper reference on nn->nfsd_serv.
- NFS Client issues IO instead of Server
- ======================================
- Because LOCALIO is focused on protocol bypass to achieve improved IO
- performance, alternatives to the traditional NFS wire protocol (SUNRPC
- with XDR) must be provided to access the backing filesystem.
- See fs/nfs/localio.c:nfs_local_open_fh() and
- fs/nfsd/localio.c:nfsd_open_local_fh() for the interface that makes
- focused use of select nfs server objects to allow a client local to a
- server to open a file pointer without needing to go over the network.
- The client's fs/nfs/localio.c:nfs_local_open_fh() will call into the
- server's fs/nfsd/localio.c:nfsd_open_local_fh() and carefully access
- both the associated nfsd network namespace and nn->nfsd_serv in terms of
- RCU. If nfsd_open_local_fh() finds that the client no longer sees valid
- nfsd objects (be it struct net or nn->nfsd_serv) it returns -ENXIO
- to nfs_local_open_fh() and the client will try to reestablish the
- LOCALIO resources needed by calling nfs_local_probe() again. This
- recovery is needed if/when an nfsd instance running in a container were
- to reboot while a LOCALIO client is connected to it.
- Once the client has an open nfsd_file pointer it will issue reads,
- writes and commits directly to the underlying local filesystem (normally
- done by the nfs server). As such, for these operations, the NFS client
- is issuing IO to the underlying local filesystem that it is sharing with
- the NFS server. See: fs/nfs/localio.c:nfs_local_doio() and
- fs/nfs/localio.c:nfs_local_commit().
- With normal NFS that makes use of RPC to issue IO to the server, if an
- application uses O_DIRECT the NFS client will bypass the pagecache but
- the NFS server will not. Because the NFS server's use of buffered IO
- affords applications to be less precise with their alignment when
- issuing IO to the NFS client. LOCALIO can be configured to use O_DIRECT
- semantics by setting the 'localio_O_DIRECT_semantics' nfs module
- parameter to Y, e.g.:
- echo Y > /sys/module/nfs/parameters/localio_O_DIRECT_semantics
- Once enabled, it will cause LOCALIO to use O_DIRECT semantics (this may
- cause IO to fail if applications do not properly align their IO).
- Security
- ========
- Localio is only supported when UNIX-style authentication (AUTH_UNIX, aka
- AUTH_SYS) is used.
- Care is taken to ensure the same NFS security mechanisms are used
- (authentication, etc) regardless of whether LOCALIO or regular NFS
- access is used. The auth_domain established as part of the traditional
- NFS client access to the NFS server is also used for LOCALIO.
- Relative to containers, LOCALIO gives the client access to the network
- namespace the server has. This is required to allow the client to access
- the server's per-namespace nfsd_net struct. With traditional NFS, the
- client is afforded this same level of access (albeit in terms of the NFS
- protocol via SUNRPC). No other namespaces (user, mount, etc) have been
- altered or purposely extended from the server to the client.
- Testing
- =======
- The LOCALIO auxiliary protocol and associated NFS LOCALIO read, write
- and commit access have proven stable against various test scenarios:
- - Client and server both on the same host.
- - All permutations of client and server support enablement for both
- local and remote client and server.
- - Testing against NFS storage products that don't support the LOCALIO
- protocol was also performed.
- - Client on host, server within a container (for both v3 and v4.2).
- The container testing was in terms of podman managed containers and
- includes successful container stop/restart scenario.
- - Formalizing these test scenarios in terms of existing test
- infrastructure is on-going. Initial regular coverage is provided in
- terms of ktest running xfstests against a LOCALIO-enabled NFS loopback
- mount configuration, and includes lockdep and KASAN coverage, see:
- https://evilpiepirate.org/~testdashboard/ci?user=snitzer&branch=snitm-nfs-next
- https://github.com/koverstreet/ktest
- - Various kdevops testing (in terms of "Chuck's BuildBot") has been
- performed to regularly verify the LOCALIO changes haven't caused any
- regressions to non-LOCALIO NFS use cases.
- - All of Hammerspace's various sanity tests pass with LOCALIO enabled
- (this includes numerous pNFS and flexfiles tests).
|