123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312 |
- .. SPDX-License-Identifier: GPL-2.0
- ======
- AF_XDP
- ======
- Overview
- ========
- AF_XDP is an address family that is optimized for high performance
- packet processing.
- This document assumes that the reader is familiar with BPF and XDP. If
- not, the Cilium project has an excellent reference guide at
- http://cilium.readthedocs.io/en/latest/bpf/.
- Using the XDP_REDIRECT action from an XDP program, the program can
- redirect ingress frames to other XDP enabled netdevs, using the
- bpf_redirect_map() function. AF_XDP sockets enable the possibility for
- XDP programs to redirect frames to a memory buffer in a user-space
- application.
- An AF_XDP socket (XSK) is created with the normal socket()
- syscall. Associated with each XSK are two rings: the RX ring and the
- TX ring. A socket can receive packets on the RX ring and it can send
- packets on the TX ring. These rings are registered and sized with the
- setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory
- to have at least one of these rings for each socket. An RX or TX
- descriptor ring points to a data buffer in a memory area called a
- UMEM. RX and TX can share the same UMEM so that a packet does not have
- to be copied between RX and TX. Moreover, if a packet needs to be kept
- for a while due to a possible retransmit, the descriptor that points
- to that packet can be changed to point to another and reused right
- away. This again avoids copying data.
- The UMEM consists of a number of equally sized chunks. A descriptor in
- one of the rings references a frame by referencing its addr. The addr
- is simply an offset within the entire UMEM region. The user space
- allocates memory for this UMEM using whatever means it feels is most
- appropriate (malloc, mmap, huge pages, etc). This memory area is then
- registered with the kernel using the new setsockopt XDP_UMEM_REG. The
- UMEM also has two rings: the FILL ring and the COMPLETION ring. The
- fill ring is used by the application to send down addr for the kernel
- to fill in with RX packet data. References to these frames will then
- appear in the RX ring once each packet has been received. The
- completion ring, on the other hand, contains frame addr that the
- kernel has transmitted completely and can now be used again by user
- space, for either TX or RX. Thus, the frame addrs appearing in the
- completion ring are addrs that were previously transmitted using the
- TX ring. In summary, the RX and FILL rings are used for the RX path
- and the TX and COMPLETION rings are used for the TX path.
- The socket is then finally bound with a bind() call to a device and a
- specific queue id on that device, and it is not until bind is
- completed that traffic starts to flow.
- The UMEM can be shared between processes, if desired. If a process
- wants to do this, it simply skips the registration of the UMEM and its
- corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind
- call and submits the XSK of the process it would like to share UMEM
- with as well as its own newly created XSK socket. The new process will
- then receive frame addr references in its own RX ring that point to
- this shared UMEM. Note that since the ring structures are
- single-consumer / single-producer (for performance reasons), the new
- process has to create its own socket with associated RX and TX rings,
- since it cannot share this with the other process. This is also the
- reason that there is only one set of FILL and COMPLETION rings per
- UMEM. It is the responsibility of a single process to handle the UMEM.
- How is then packets distributed from an XDP program to the XSKs? There
- is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The
- user-space application can place an XSK at an arbitrary place in this
- map. The XDP program can then redirect a packet to a specific index in
- this map and at this point XDP validates that the XSK in that map was
- indeed bound to that device and ring number. If not, the packet is
- dropped. If the map is empty at that index, the packet is also
- dropped. This also means that it is currently mandatory to have an XDP
- program loaded (and one XSK in the XSKMAP) to be able to get any
- traffic to user space through the XSK.
- AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
- driver does not have support for XDP, or XDP_SKB is explicitly chosen
- when loading the XDP program, XDP_SKB mode is employed that uses SKBs
- together with the generic XDP support and copies out the data to user
- space. A fallback mode that works for any network device. On the other
- hand, if the driver has support for XDP, it will be used by the AF_XDP
- code to provide better performance, but there is still a copy of the
- data into user space.
- Concepts
- ========
- In order to use an AF_XDP socket, a number of associated objects need
- to be setup.
- Jonathan Corbet has also written an excellent article on LWN,
- "Accelerating networking with AF_XDP". It can be found at
- https://lwn.net/Articles/750845/.
- UMEM
- ----
- UMEM is a region of virtual contiguous memory, divided into
- equal-sized frames. An UMEM is associated to a netdev and a specific
- queue id of that netdev. It is created and configured (chunk size,
- headroom, start address and size) by using the XDP_UMEM_REG setsockopt
- system call. A UMEM is bound to a netdev and queue id, via the bind()
- system call.
- An AF_XDP is socket linked to a single UMEM, but one UMEM can have
- multiple AF_XDP sockets. To share an UMEM created via one socket A,
- the next socket B can do this by setting the XDP_SHARED_UMEM flag in
- struct sockaddr_xdp member sxdp_flags, and passing the file descriptor
- of A to struct sockaddr_xdp member sxdp_shared_umem_fd.
- The UMEM has two single-producer/single-consumer rings, that are used
- to transfer ownership of UMEM frames between the kernel and the
- user-space application.
- Rings
- -----
- There are a four different kind of rings: Fill, Completion, RX and
- TX. All rings are single-producer/single-consumer, so the user-space
- application need explicit synchronization of multiple
- processes/threads are reading/writing to them.
- The UMEM uses two rings: Fill and Completion. Each socket associated
- with the UMEM must have an RX queue, TX queue or both. Say, that there
- is a setup with four sockets (all doing TX and RX). Then there will be
- one Fill ring, one Completion ring, four TX rings and four RX rings.
- The rings are head(producer)/tail(consumer) based rings. A producer
- writes the data ring at the index pointed out by struct xdp_ring
- producer member, and increasing the producer index. A consumer reads
- the data ring at the index pointed out by struct xdp_ring consumer
- member, and increasing the consumer index.
- The rings are configured and created via the _RING setsockopt system
- calls and mmapped to user-space using the appropriate offset to mmap()
- (XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and
- XDP_UMEM_PGOFF_COMPLETION_RING).
- The size of the rings need to be of size power of two.
- UMEM Fill Ring
- ~~~~~~~~~~~~~~
- The Fill ring is used to transfer ownership of UMEM frames from
- user-space to kernel-space. The UMEM addrs are passed in the ring. As
- an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has
- 16 chunks and can pass addrs between 0 and 64k.
- Frames passed to the kernel are used for the ingress path (RX rings).
- The user application produces UMEM addrs to this ring. Note that the
- kernel will mask the incoming addr. E.g. for a chunk size of 2k, the
- log2(2048) LSB of the addr will be masked off, meaning that 2048, 2050
- and 3000 refers to the same chunk.
- UMEM Completetion Ring
- ~~~~~~~~~~~~~~~~~~~~~~
- The Completion Ring is used transfer ownership of UMEM frames from
- kernel-space to user-space. Just like the Fill ring, UMEM indicies are
- used.
- Frames passed from the kernel to user-space are frames that has been
- sent (TX ring) and can be used by user-space again.
- The user application consumes UMEM addrs from this ring.
- RX Ring
- ~~~~~~~
- The RX ring is the receiving side of a socket. Each entry in the ring
- is a struct xdp_desc descriptor. The descriptor contains UMEM offset
- (addr) and the length of the data (len).
- If no frames have been passed to kernel via the Fill ring, no
- descriptors will (or can) appear on the RX ring.
- The user application consumes struct xdp_desc descriptors from this
- ring.
- TX Ring
- ~~~~~~~
- The TX ring is used to send frames. The struct xdp_desc descriptor is
- filled (index, length and offset) and passed into the ring.
- To start the transfer a sendmsg() system call is required. This might
- be relaxed in the future.
- The user application produces struct xdp_desc descriptors to this
- ring.
- XSKMAP / BPF_MAP_TYPE_XSKMAP
- ----------------------------
- On XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that
- is used in conjunction with bpf_redirect_map() to pass the ingress
- frame to a socket.
- The user application inserts the socket into the map, via the bpf()
- system call.
- Note that if an XDP program tries to redirect to a socket that does
- not match the queue configuration and netdev, the frame will be
- dropped. E.g. an AF_XDP socket is bound to netdev eth0 and
- queue 17. Only the XDP program executing for eth0 and queue 17 will
- successfully pass data to the socket. Please refer to the sample
- application (samples/bpf/) in for an example.
- Usage
- =====
- In order to use AF_XDP sockets there are two parts needed. The
- user-space application and the XDP program. For a complete setup and
- usage example, please refer to the sample application. The user-space
- side is xdpsock_user.c and the XDP side xdpsock_kern.c.
- Naive ring dequeue and enqueue could look like this::
- // struct xdp_rxtx_ring {
- // __u32 *producer;
- // __u32 *consumer;
- // struct xdp_desc *desc;
- // };
- // struct xdp_umem_ring {
- // __u32 *producer;
- // __u32 *consumer;
- // __u64 *desc;
- // };
- // typedef struct xdp_rxtx_ring RING;
- // typedef struct xdp_umem_ring RING;
- // typedef struct xdp_desc RING_TYPE;
- // typedef __u64 RING_TYPE;
- int dequeue_one(RING *ring, RING_TYPE *item)
- {
- __u32 entries = *ring->producer - *ring->consumer;
- if (entries == 0)
- return -1;
- // read-barrier!
- *item = ring->desc[*ring->consumer & (RING_SIZE - 1)];
- (*ring->consumer)++;
- return 0;
- }
- int enqueue_one(RING *ring, const RING_TYPE *item)
- {
- u32 free_entries = RING_SIZE - (*ring->producer - *ring->consumer);
- if (free_entries == 0)
- return -1;
- ring->desc[*ring->producer & (RING_SIZE - 1)] = *item;
- // write-barrier!
- (*ring->producer)++;
- return 0;
- }
- For a more optimized version, please refer to the sample application.
- Sample application
- ==================
- There is a xdpsock benchmarking/test application included that
- demonstrates how to use AF_XDP sockets with both private and shared
- UMEMs. Say that you would like your UDP traffic from port 4242 to end
- up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
- for this::
- ethtool -N p3p2 rx-flow-hash udp4 fn
- ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
- action 16
- Running the rxdrop benchmark in XDP_DRV mode can then be done
- using::
- samples/bpf/xdpsock -i p3p2 -q 16 -r -N
- For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
- can be displayed with "-h", as usual.
- Credits
- =======
- - Björn Töpel (AF_XDP core)
- - Magnus Karlsson (AF_XDP core)
- - Alexander Duyck
- - Alexei Starovoitov
- - Daniel Borkmann
- - Jesper Dangaard Brouer
- - John Fastabend
- - Jonathan Corbet (LWN coverage)
- - Michael S. Tsirkin
- - Qi Z Zhang
- - Willem de Bruijn
|