|  | ========================= | 
|  | BPF Graph Data Structures | 
|  | ========================= | 
|  |  | 
|  | This document describes implementation details of new-style "graph" data | 
|  | structures (linked_list, rbtree), with particular focus on the verifier's | 
|  | implementation of semantics specific to those data structures. | 
|  |  | 
|  | Although no specific verifier code is referred to in this document, the document | 
|  | assumes that the reader has general knowledge of BPF verifier internals, BPF | 
|  | maps, and BPF program writing. | 
|  |  | 
|  | Note that the intent of this document is to describe the current state of | 
|  | these graph data structures. **No guarantees** of stability for either | 
|  | semantics or APIs are made or implied here. | 
|  |  | 
|  | .. contents:: | 
|  | :local: | 
|  | :depth: 2 | 
|  |  | 
|  | Introduction | 
|  | ------------ | 
|  |  | 
|  | The BPF map API has historically been the main way to expose data structures | 
|  | of various types for use within BPF programs. Some data structures fit naturally | 
|  | with the map API (HASH, ARRAY), others less so. Consequently, programs | 
|  | interacting with the latter group of data structures can be hard to parse | 
|  | for kernel programmers without previous BPF experience. | 
|  |  | 
|  | Luckily, some restrictions which necessitated the use of BPF map semantics are | 
|  | no longer relevant. With the introduction of kfuncs, kptrs, and the any-context | 
|  | BPF allocator, it is now possible to implement BPF data structures whose API | 
|  | and semantics more closely match those exposed to the rest of the kernel. | 
|  |  | 
|  | Two such data structures - linked_list and rbtree - have many verification | 
|  | details in common. Because both have "root"s ("head" for linked_list) and | 
|  | "node"s, the verifier code and this document refer to common functionality | 
|  | as "graph_api", "graph_root", "graph_node", etc. | 
|  |  | 
|  | Unless otherwise stated, examples and semantics below apply to both graph data | 
|  | structures. | 
|  |  | 
|  | Unstable API | 
|  | ------------ | 
|  |  | 
|  | Data structures implemented using the BPF map API have historically used BPF | 
|  | helper functions - either standard map API helpers like ``bpf_map_update_elem`` | 
|  | or map-specific helpers. The new-style graph data structures instead use kfuncs | 
|  | to define their manipulation helpers. Because there are no stability guarantees | 
|  | for kfuncs, the API and semantics for these data structures can be evolved in | 
|  | a way that breaks backwards compatibility if necessary. | 
|  |  | 
|  | Root and node types for the new data structures are opaquely defined in the | 
|  | ``uapi/linux/bpf.h`` header. | 
|  |  | 
|  | Locking | 
|  | ------- | 
|  |  | 
|  | The new-style data structures are intrusive and are defined similarly to their | 
|  | vanilla kernel counterparts: | 
|  |  | 
|  | .. code-block:: c | 
|  |  | 
|  | struct node_data { | 
|  | long key; | 
|  | long data; | 
|  | struct bpf_rb_node node; | 
|  | }; | 
|  |  | 
|  | struct bpf_spin_lock glock; | 
|  | struct bpf_rb_root groot __contains(node_data, node); | 
|  |  | 
|  | The "root" type for both linked_list and rbtree expects to be in a map_value | 
|  | which also contains a ``bpf_spin_lock`` - in the above example both global | 
|  | variables are placed in a single-value arraymap. The verifier considers this | 
|  | spin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in | 
|  | the same map_value and will enforce that the correct lock is held when | 
|  | verifying BPF programs that manipulate the tree. Since this lock checking | 
|  | happens at verification time, there is no runtime penalty. | 
|  |  | 
|  | Non-owning references | 
|  | --------------------- | 
|  |  | 
|  | **Motivation** | 
|  |  | 
|  | Consider the following BPF code: | 
|  |  | 
|  | .. code-block:: c | 
|  |  | 
|  | struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */ | 
|  |  | 
|  | bpf_spin_lock(&lock); | 
|  |  | 
|  | bpf_rbtree_add(&tree, n); /* PASSED */ | 
|  |  | 
|  | bpf_spin_unlock(&lock); | 
|  |  | 
|  | From the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new`` | 
|  | has type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of | 
|  | ``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the | 
|  | program has ownership of the pointee's (object pointed to by ``n``) lifetime. | 
|  | The BPF program must pass off ownership before exiting - either via | 
|  | ``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with | 
|  | ``bpf_rbtree_add``. | 
|  |  | 
|  | (``ACQUIRED`` and ``PASSED`` comments in the example denote statements where | 
|  | "ownership is acquired" and "ownership is passed", respectively) | 
|  |  | 
|  | What should the verifier do with ``n`` after ownership is passed off? If the | 
|  | object was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier | 
|  | should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as | 
|  | the object is no longer valid. The underlying memory may have been reused for | 
|  | some other allocation, unmapped, etc. | 
|  |  | 
|  | When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less | 
|  | obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``, | 
|  | but that would result in programs with useful, common coding patterns being | 
|  | rejected, e.g.: | 
|  |  | 
|  | .. code-block:: c | 
|  |  | 
|  | int x; | 
|  | struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */ | 
|  |  | 
|  | bpf_spin_lock(&lock); | 
|  |  | 
|  | bpf_rbtree_add(&tree, n); /* PASSED */ | 
|  | x = n->data; | 
|  | n->data = 42; | 
|  |  | 
|  | bpf_spin_unlock(&lock); | 
|  |  | 
|  | Both the read from and write to ``n->data`` would be rejected. The verifier | 
|  | can do better, though, by taking advantage of two details: | 
|  |  | 
|  | * Graph data structure APIs can only be used when the ``bpf_spin_lock`` | 
|  | associated with the graph root is held | 
|  |  | 
|  | * Both graph data structures have pointer stability | 
|  |  | 
|  | * Because graph nodes are allocated with ``bpf_obj_new`` and | 
|  | adding / removing from the root involves fiddling with the | 
|  | ``bpf_{list,rb}_node`` field of the node struct, a graph node will | 
|  | remain at the same address after either operation. | 
|  |  | 
|  | Because the associated ``bpf_spin_lock`` must be held by any program adding | 
|  | or removing, if we're in the critical section bounded by that lock, we know | 
|  | that no other program can add or remove until the end of the critical section. | 
|  | This combined with pointer stability means that, until the critical section | 
|  | ends, we can safely access the graph node through ``n`` even after it was used | 
|  | to pass ownership. | 
|  |  | 
|  | The verifier considers such a reference a *non-owning reference*. The ref | 
|  | returned by ``bpf_obj_new`` is accordingly considered an *owning reference*. | 
|  | Both terms currently only have meaning in the context of graph nodes and API. | 
|  |  | 
|  | **Details** | 
|  |  | 
|  | Let's enumerate the properties of both types of references. | 
|  |  | 
|  | *owning reference* | 
|  |  | 
|  | * This reference controls the lifetime of the pointee | 
|  |  | 
|  | * Ownership of pointee must be 'released' by passing it to some graph API | 
|  | kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee | 
|  |  | 
|  | * If not released before program ends, verifier considers program invalid | 
|  |  | 
|  | * Access to the pointee's memory will not page fault | 
|  |  | 
|  | *non-owning reference* | 
|  |  | 
|  | * This reference does not own the pointee | 
|  |  | 
|  | * It cannot be used to add the graph node to a graph root, nor ``free``'d via | 
|  | ``bpf_obj_drop`` | 
|  |  | 
|  | * No explicit control of lifetime, but can infer valid lifetime based on | 
|  | non-owning ref existence (see explanation below) | 
|  |  | 
|  | * Access to the pointee's memory will not page fault | 
|  |  | 
|  | From verifier's perspective non-owning references can only exist | 
|  | between spin_lock and spin_unlock. Why? After spin_unlock another program | 
|  | can do arbitrary operations on the data structure like removing and ``free``-ing | 
|  | via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd, | 
|  | ``free``'d, and reused via bpf_obj_new would point to an entirely different thing. | 
|  | Or the memory could go away. | 
|  |  | 
|  | To prevent this logic violation all non-owning references are invalidated by the | 
|  | verifier after a critical section ends. This is necessary to ensure the "will | 
|  | not page fault" property of non-owning references. So if the verifier hasn't | 
|  | invalidated a non-owning ref, accessing it will not page fault. | 
|  |  | 
|  | Currently ``bpf_obj_drop`` is not allowed in the critical section, so | 
|  | if there's a valid non-owning ref, we must be in a critical section, and can | 
|  | conclude that the ref's memory hasn't been dropped-and- ``free``'d or | 
|  | dropped-and-reused. | 
|  |  | 
|  | Any reference to a node that is in an rbtree _must_ be non-owning, since | 
|  | the tree has control of the pointee's lifetime. Similarly, any ref to a node | 
|  | that isn't in rbtree _must_ be owning. This results in a nice property: | 
|  | graph API add / remove implementations don't need to check if a node | 
|  | has already been added (or already removed), as the ownership model | 
|  | allows the verifier to prevent such a state from being valid by simply checking | 
|  | types. | 
|  |  | 
|  | However, pointer aliasing poses an issue for the above "nice property". | 
|  | Consider the following example: | 
|  |  | 
|  | .. code-block:: c | 
|  |  | 
|  | struct node_data *n, *m, *o, *p; | 
|  | n = bpf_obj_new(typeof(*n));     /* 1 */ | 
|  |  | 
|  | bpf_spin_lock(&lock); | 
|  |  | 
|  | bpf_rbtree_add(&tree, n);        /* 2 */ | 
|  | m = bpf_rbtree_first(&tree);     /* 3 */ | 
|  |  | 
|  | o = bpf_rbtree_remove(&tree, n); /* 4 */ | 
|  | p = bpf_rbtree_remove(&tree, m); /* 5 */ | 
|  |  | 
|  | bpf_spin_unlock(&lock); | 
|  |  | 
|  | bpf_obj_drop(o); | 
|  | bpf_obj_drop(p); /* 6 */ | 
|  |  | 
|  | Assume the tree is empty before this program runs. If we track verifier state | 
|  | changes here using numbers in above comments: | 
|  |  | 
|  | 1) n is an owning reference | 
|  |  | 
|  | 2) n is a non-owning reference, it's been added to the tree | 
|  |  | 
|  | 3) n and m are non-owning references, they both point to the same node | 
|  |  | 
|  | 4) o is an owning reference, n and m non-owning, all point to same node | 
|  |  | 
|  | 5) o and p are owning, n and m non-owning, all point to the same node | 
|  |  | 
|  | 6) a double-free has occurred, since o and p point to same node and o was | 
|  | ``free``'d in previous statement | 
|  |  | 
|  | States 4 and 5 violate our "nice property", as there are non-owning refs to | 
|  | a node which is not in an rbtree. Statement 5 will try to remove a node which | 
|  | has already been removed as a result of this violation. State 6 is a dangerous | 
|  | double-free. | 
|  |  | 
|  | At a minimum we should prevent state 6 from being possible. If we can't also | 
|  | prevent state 5 then we must abandon our "nice property" and check whether a | 
|  | node has already been removed at runtime. | 
|  |  | 
|  | We prevent both by generalizing the "invalidate non-owning references" behavior | 
|  | of ``bpf_spin_unlock`` and doing similar invalidation after | 
|  | ``bpf_rbtree_remove``. The logic here being that any graph API kfunc which: | 
|  |  | 
|  | * takes an arbitrary node argument | 
|  |  | 
|  | * removes it from the data structure | 
|  |  | 
|  | * returns an owning reference to the removed node | 
|  |  | 
|  | May result in a state where some other non-owning reference points to the same | 
|  | node. So ``remove``-type kfuncs must be considered a non-owning reference | 
|  | invalidation point as well. |