| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267 |
- =========================
- BPF Graph Data Structures
- =========================
- This document describes implementation details of new-style "graph" data
- structures (linked_list, rbtree), with particular focus on the verifier's
- implementation of semantics specific to those data structures.
- Although no specific verifier code is referred to in this document, the document
- assumes that the reader has general knowledge of BPF verifier internals, BPF
- maps, and BPF program writing.
- Note that the intent of this document is to describe the current state of
- these graph data structures. **No guarantees** of stability for either
- semantics or APIs are made or implied here.
- .. contents::
- :local:
- :depth: 2
- Introduction
- ------------
- The BPF map API has historically been the main way to expose data structures
- of various types for use within BPF programs. Some data structures fit naturally
- with the map API (HASH, ARRAY), others less so. Consequently, programs
- interacting with the latter group of data structures can be hard to parse
- for kernel programmers without previous BPF experience.
- Luckily, some restrictions which necessitated the use of BPF map semantics are
- no longer relevant. With the introduction of kfuncs, kptrs, and the any-context
- BPF allocator, it is now possible to implement BPF data structures whose API
- and semantics more closely match those exposed to the rest of the kernel.
- Two such data structures - linked_list and rbtree - have many verification
- details in common. Because both have "root"s ("head" for linked_list) and
- "node"s, the verifier code and this document refer to common functionality
- as "graph_api", "graph_root", "graph_node", etc.
- Unless otherwise stated, examples and semantics below apply to both graph data
- structures.
- Unstable API
- ------------
- Data structures implemented using the BPF map API have historically used BPF
- helper functions - either standard map API helpers like ``bpf_map_update_elem``
- or map-specific helpers. The new-style graph data structures instead use kfuncs
- to define their manipulation helpers. Because there are no stability guarantees
- for kfuncs, the API and semantics for these data structures can be evolved in
- a way that breaks backwards compatibility if necessary.
- Root and node types for the new data structures are opaquely defined in the
- ``uapi/linux/bpf.h`` header.
- Locking
- -------
- The new-style data structures are intrusive and are defined similarly to their
- vanilla kernel counterparts:
- .. code-block:: c
- struct node_data {
- long key;
- long data;
- struct bpf_rb_node node;
- };
- struct bpf_spin_lock glock;
- struct bpf_rb_root groot __contains(node_data, node);
- The "root" type for both linked_list and rbtree expects to be in a map_value
- which also contains a ``bpf_spin_lock`` - in the above example both global
- variables are placed in a single-value arraymap. The verifier considers this
- spin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in
- the same map_value and will enforce that the correct lock is held when
- verifying BPF programs that manipulate the tree. Since this lock checking
- happens at verification time, there is no runtime penalty.
- Non-owning references
- ---------------------
- **Motivation**
- Consider the following BPF code:
- .. code-block:: c
- struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
- bpf_spin_lock(&lock);
- bpf_rbtree_add(&tree, n); /* PASSED */
- bpf_spin_unlock(&lock);
- From the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new``
- has type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of
- ``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the
- program has ownership of the pointee's (object pointed to by ``n``) lifetime.
- The BPF program must pass off ownership before exiting - either via
- ``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with
- ``bpf_rbtree_add``.
- (``ACQUIRED`` and ``PASSED`` comments in the example denote statements where
- "ownership is acquired" and "ownership is passed", respectively)
- What should the verifier do with ``n`` after ownership is passed off? If the
- object was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier
- should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as
- the object is no longer valid. The underlying memory may have been reused for
- some other allocation, unmapped, etc.
- When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less
- obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``,
- but that would result in programs with useful, common coding patterns being
- rejected, e.g.:
- .. code-block:: c
- int x;
- struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
- bpf_spin_lock(&lock);
- bpf_rbtree_add(&tree, n); /* PASSED */
- x = n->data;
- n->data = 42;
- bpf_spin_unlock(&lock);
- Both the read from and write to ``n->data`` would be rejected. The verifier
- can do better, though, by taking advantage of two details:
- * Graph data structure APIs can only be used when the ``bpf_spin_lock``
- associated with the graph root is held
- * Both graph data structures have pointer stability
- * Because graph nodes are allocated with ``bpf_obj_new`` and
- adding / removing from the root involves fiddling with the
- ``bpf_{list,rb}_node`` field of the node struct, a graph node will
- remain at the same address after either operation.
- Because the associated ``bpf_spin_lock`` must be held by any program adding
- or removing, if we're in the critical section bounded by that lock, we know
- that no other program can add or remove until the end of the critical section.
- This combined with pointer stability means that, until the critical section
- ends, we can safely access the graph node through ``n`` even after it was used
- to pass ownership.
- The verifier considers such a reference a *non-owning reference*. The ref
- returned by ``bpf_obj_new`` is accordingly considered an *owning reference*.
- Both terms currently only have meaning in the context of graph nodes and API.
- **Details**
- Let's enumerate the properties of both types of references.
- *owning reference*
- * This reference controls the lifetime of the pointee
- * Ownership of pointee must be 'released' by passing it to some graph API
- kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee
- * If not released before program ends, verifier considers program invalid
- * Access to the pointee's memory will not page fault
- *non-owning reference*
- * This reference does not own the pointee
- * It cannot be used to add the graph node to a graph root, nor ``free``'d via
- ``bpf_obj_drop``
- * No explicit control of lifetime, but can infer valid lifetime based on
- non-owning ref existence (see explanation below)
- * Access to the pointee's memory will not page fault
- From verifier's perspective non-owning references can only exist
- between spin_lock and spin_unlock. Why? After spin_unlock another program
- can do arbitrary operations on the data structure like removing and ``free``-ing
- via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
- ``free``'d, and reused via bpf_obj_new would point to an entirely different thing.
- Or the memory could go away.
- To prevent this logic violation all non-owning references are invalidated by the
- verifier after a critical section ends. This is necessary to ensure the "will
- not page fault" property of non-owning references. So if the verifier hasn't
- invalidated a non-owning ref, accessing it will not page fault.
- Currently ``bpf_obj_drop`` is not allowed in the critical section, so
- if there's a valid non-owning ref, we must be in a critical section, and can
- conclude that the ref's memory hasn't been dropped-and- ``free``'d or
- dropped-and-reused.
- Any reference to a node that is in an rbtree _must_ be non-owning, since
- the tree has control of the pointee's lifetime. Similarly, any ref to a node
- that isn't in rbtree _must_ be owning. This results in a nice property:
- graph API add / remove implementations don't need to check if a node
- has already been added (or already removed), as the ownership model
- allows the verifier to prevent such a state from being valid by simply checking
- types.
- However, pointer aliasing poses an issue for the above "nice property".
- Consider the following example:
- .. code-block:: c
- struct node_data *n, *m, *o, *p;
- n = bpf_obj_new(typeof(*n)); /* 1 */
- bpf_spin_lock(&lock);
- bpf_rbtree_add(&tree, n); /* 2 */
- m = bpf_rbtree_first(&tree); /* 3 */
- o = bpf_rbtree_remove(&tree, n); /* 4 */
- p = bpf_rbtree_remove(&tree, m); /* 5 */
- bpf_spin_unlock(&lock);
- bpf_obj_drop(o);
- bpf_obj_drop(p); /* 6 */
- Assume the tree is empty before this program runs. If we track verifier state
- changes here using numbers in above comments:
- 1) n is an owning reference
- 2) n is a non-owning reference, it's been added to the tree
- 3) n and m are non-owning references, they both point to the same node
- 4) o is an owning reference, n and m non-owning, all point to same node
- 5) o and p are owning, n and m non-owning, all point to the same node
- 6) a double-free has occurred, since o and p point to same node and o was
- ``free``'d in previous statement
- States 4 and 5 violate our "nice property", as there are non-owning refs to
- a node which is not in an rbtree. Statement 5 will try to remove a node which
- has already been removed as a result of this violation. State 6 is a dangerous
- double-free.
- At a minimum we should prevent state 6 from being possible. If we can't also
- prevent state 5 then we must abandon our "nice property" and check whether a
- node has already been removed at runtime.
- We prevent both by generalizing the "invalidate non-owning references" behavior
- of ``bpf_spin_unlock`` and doing similar invalidation after
- ``bpf_rbtree_remove``. The logic here being that any graph API kfunc which:
- * takes an arbitrary node argument
- * removes it from the data structure
- * returns an owning reference to the removed node
- May result in a state where some other non-owning reference points to the same
- node. So ``remove``-type kfuncs must be considered a non-owning reference
- invalidation point as well.
|