Categories
Reflections on our eBPF demo application
In our last post on eBPF we presented a simple eBPF application which implements an L2TPv3 Ethernet pseudowire dataplane.
We covered the high level design of the application, and how to build and run the code.
In this post we will dive into some more specific details about the application at the code level, and reflect on some of the challenges we met when working on it.
When is C not really C?
eBPF code looks a lot like C code, which makes it quite approachable if you already know C.
That's a good thing!
But there is a potential drawback: it's easy to assume that since eBPF code looks like normal C code, it is possible to design an eBPF program as you would any C program.
However, that's not quite the case.
The eBPF environment is quite specialised and deliberately limited, for example:
- code can compile to a limited maximum number of instructions,
- code cannot infinitely loop,
- code cannot arbitrarily access memory.
These limitations, and more besides, are enforced by the verifier, which checks code when it is loaded into the kernel.
This effectively adds another stage to the developmnent process: it's not unusual to compile what you think should be a working program only to have the verifier throw it out because you've accidentally violated one of its constraints.
In some cases you'll end up changing how you write code specifically to keep the verifier happy; or even having to rework designs because what you initially planned to do isn't actually possible within the constraints of the verifier.
The best practice is to work in small steps, keep code very simple initially, and be prepared to prototype different approaches in order to find a workable solution.
Context is everything
The input argument(s) to your eBPF program, and the in-kernel helpers it can access, are determined by where in the kernel it is running.
That is, if you write an XDP eBPF program, it will have access to different information and helpers than, for example, a socket operations eBPF program.
The reasons for this are fairly obvious of course: a program running at the lowest level of the network stack will necessarily run in a different context to one running as a part of the sockets API.
Still, it can make exploring the helper API difficult. Pay special attention to what helpers are accessible from what context (the bpf-helpers(7) manpage points you in the right direction) in order to keep tabs on what you program can call.
What's in a name?
Part of the magic of the eBPF build process is leaning heavily on ELF sections in order to give meaning to different parts of the program.
Some section names are arbitrary: for example, in our eBPF demo program we used the section names cls_act/encap and cls_act/decap. You might imagine that the traffic control tooling demanded the cls_act prefix for clsact queue discipline eBPF programs, but that's not the case. When loading the eBPF program using tc(8), you can name the ELF section to be loaded, and there are no requirements on the section name.
Other section names are more prescriptive. For example, the .maps section name is used for eBPF maps, which is understood and expected by various eBPF components such as libbpf.
As a general rule, be aware that section names may be important.
If in doubt, following section naming conventions used in the Linux kernel source tree under tools/testing/selftests/bpf is a reasonable approach to avoid trouble.
Helper limitations
Our demo application's primary job is modifying network packets, either to add or remove headers.
It sounds as though it should be simple, but we found ourselves needing to do a lot of reading around and experiementing to determine which helper APIs could be used.
bpf_skb_adjust_room limitations
For decapsulation, we need to remove headers from the front of the incoming packet.
On reviewing the helper API, we saw that bpf_skb_adjust_room() would allow us to shrink the packet, and we could see examples in the kernel of the API being used in that fashion.
But we found that bpf_skb_adjust_room() had a number of unexpected behaviours which limited how we could use it. At the time of writing, bpf_skb_adjust_room():
- is restricted as to where in the packet the room adjustment applies (you cannot remove bytes from the very start of the packet, for example);
- is also restricted as to the size of the room adjustment depending on the nature of the packet (the resulting packet post-adjustment must have a minimum size which is informed by whether the packet has an IPv4 or IPv6 header).
The first limitation was an issue since we wanted to strip encapsulating headers from the start of the packet, but bpf_skb_adjust_room() doesn't support that.
We managed to work around this in the end by being a bit creative with how we strip the encapsulating headers.
The second limitation was an issue since by default the packet size checks inside bpf_skb_adjust_room() were such that small packets could be rejected as "too small" when IPv6 was used in the encapsulating headers. This was rejecting ARP packets, and hence breaking the datapath.
Fortunately, there is a flag, BPF_F_ADJ_ROOM_DECAP_L3_IPV4, which allows us to work around the packet length assumptions in bpf_skb_adjust_room(), at least for our usecase. Sadly not all kernels have BPF_F_ADJ_ROOM_DECAP_L3_IPV4, so not all kernels can use IPv6 with our demo.
Figuring out how to make use of bpf_skb_adjust_room() was not easy. We ended up having to dig into the kernel code to see how things worked in the helper; and reviewed the git history to get context on why the helper is implemented the way it is.
Packet checksumming
The other main pain point for the demo program is around packet checksums.
Specifically when using IPv6 and UDP in the outer encapsulation headers: due to IPv6 having no checksum mechanism it is mandatory to add a valid checksum in the UDP header.
When running as a part of the native kernel environment, code has support from the network stack for checksumming. There are facilities for performing partial checksums, keeping track of what has been checksummed and what has not, and offloading checksumming to hardware where such support is available.
In the eBPF environment, this isn't so much the case. At the time of writing the only approach we have found is to simply perform the checksum for the entire encapsulated packet in the eBPF code directly.
Given that in a lot of cases the encapsulated packet has been checksummed already this is a sub-optimal approach.
Furthermore, trying to implement the checksum in eBPF was not without its difficulties.
Although writing a loop to step through the packet seems fairly straight-forward, you may remember that one of the verifier's jobs is to prevent unbounded looping in eBPF code. It can be difficult to convince the verifier that a given loop will terminate!
Happily, this is a problem other people have already solved by means of an eBPF helper bpf_loop(). This API provides a verifier-blessed way to perform arbitrary looping, albeit while still having to call out a bounded maximum number of loop iterations.
The downside is that this places another kernel-version restriction on our eBPF program: the kernel needs to have bpf_loop() in order for us to be able to use IPv6/UDP encapsulation.
Reading packet contents
Our application depends on parsing the headers of an incoming encapsulated packet in order to determine the L2TP session context for that packet.
In order to do that, we need to read the packet contents!
Exactly how this is done is a little unclear on the face of things.
Our program has access to a struct __sk_buff pointer, which is a mirror of an internal kernel structure, and contains a __u32 field called data.
The data field would seem to be the thing to access to get at packet data, but:
- it's in the middle of the structure so it's clearly not an allocated array,
- __u32 looks like it's too small to represent a pointer on 64 bit hardware.
Most C programmers will be scratching their heads at this point and wondering: "What's going on?".
The answer lies in the verifier, which will convert accesses to the struct __sk_buff members into accesses into the internal kernel structure, performing checks along the way.
Although the data field doen't obviously point to a data buffer, accesses to the field will result in accesses to the packet data buffer once the verifier has interpreted the read instructions.
So we just need to convince the compiler that the data field is actually a pointer, and we can then access it as a normal memory buffer, more or less. Our eBPF code has a macro which does this:
#define skb_ptr(_p) ((char *)(long)(_p))
This would be called on a data field like so:
char *ptr = skb_ptr(skb->data);
This technique is borrowed from the kernel bpf tests.
However, when reading from such a pointer we need to take some extra steps to show the verifier that we are not reading outside the bounds of the packet buffer.
To do so we need to explicitly check that the pointer hasn't exceeded the data_end field in struct __sk_buff.
We ended up implementing a wrapper function to do that for us:
static void *skb_pullb(struct __sk_buff *skb, char **dptr, size_t nbytes)
{
char *end = skb_ptr(skb->data_end);
void *out = NULL;
if (*dptr + nbytes <= end) {
out = *dptr;
*dptr += nbytes;
}
return out;
}
This function has the added convenience of passing out the current 'read cursor', which is useful when sequentially reading headers from a packet buffer since it avoids having to manually keep track of your current read offset. For example:
char *data = skb_ptr(skb->data);
struct ethhdr *eth;
struct iphdr *ip;
if (!(eth = skb_pullb(skb, &data, sizeof(*eth))))
return false;
if (!(ip = skb_pullb(skb, &data, sizeof(*ip))))
return false;
With these two elements in place we can access the packet buffer in much the same way you would in normal C code.
Wrapping it up
This post, and the two that preceded it, have described eBPF and what it does, signposted some interesting open source projects using eBPF, and presented our experiences about the nuts and bolts of building a small real-world eBPF program.
A changing landscape
Looking back on the experience of getting to grips with eBPF, it feels that although eBPF has been around for around ten years now, it is still a relatively nascent technology.
By this we mean that eBPF is still in flux:
- the verifier is growing in capabilities,
- the in-kernel bpf-helpers API is still evolving,
- the userspace tooling landscape seems to be shifting (bcc v.s. CO-RE and libbpf, for example).
This flux, perhaps, is indicative of a range of users seeking to deploy their own particular kernel-space innovation via eBPF, rather than trying to upstream those features in the kernel proper.
Different use-cases have different requirements and different contexts, and drive different featuresets -- all within the overall scope of eBPF, and so change is inevitable.
As more people use eBPF, it seems reasonable to expect tooling and features to become more rounded and general, and the overall development experience to be more solid and predictable.
Kernel innovation
One of the draws of eBPF is that it permits rapid development and innovation in code that will run in kernel space.
With eBPF it is possible to implement and deploy the feature you want in a short space of time without having to worry about any of the overhead of upstreaming changes, let alone waiting for changes to make it out to disto kernels.
As fantastic as this sounds, it may perhaps have an unintended consequence in that useful features could end up buried in various eBPF silos rather than being in the kernel itself for all to use.
To an extent, we may lament the loss of these features in the kernel -- but on the other hand, the agency that eBPF grants to implement any given niche tool is undeniable, and surely there will be eBPF-fueled features out there which would never be considered appropriate for upstreaming.
Thanks for reading
Hopefully this series of posts on eBPF has given you some insights into the technology, and perhaps inspired you to have a go at some eBPF coding of your own.
If so -- happy hacking, and may the verifier go with you.