Categories
The Extended Berkeley Packet Filter in action
Our recent post about eBPF gave an introduction to the technology.
In this followup article we will expand on the theme by taking a look at some practical examples of eBPF use.
We will look at some mature open source projects from active areas of eBPF deployment.
We will also explore a smaller example program we have developed with the goal of illustrating some of the fundamentals of eBPF development.
The many faces of eBPF
It's probably fair to say that to date, eBPF has found most use in Cloud networking and in observability tooling, and it's in these areas that we find the most prominent open source projects.
One project which we should mention to begin with is Cillium, which has been a huge influence in terms of eBPF development and adoption, and which is widely used in Cloud networking.
Although Cillium is a fantastic project, it can be a lot to take in for the newcomer to eBPF. We'd be remiss not to mention it, but we have found smaller, more self-contained examples easier to get to grips with for learning about eBPF.
So while Cillium is an obvious success story, and well worth exploring, we also wanted to point out some smaller projects: still powerful tools, but perhaps easier to immediately understand.
eBPF for the Cloud
Cloud networking presents many challenges to traditional network stacks due to the dynamic nature and scale of typical Cloud software architectures.
eBPF has been widely adopted in this space due to the rapid pace of change it allows. Where it may take months or years to get features upstreamed into the Linux Kernel, an eBPF program which meets your need can be developed and deployed in a fraction of that time.
Some examples of interesting Cloud networking projects include:
- CloudFlare's Tubular. Tubular allows socket lookup to be programmed with eBPF, allowing for a much more flexible binding of incoming network frames to sockets, and working around some of the limitations of the BSD sockets API at scale.
- Facebook's Katran. Katran is a Layer-4 load balancer, which allows TCP traffic to be flexibly distributed across a number of Layer-7 endpoints.
eBPF for observability
Due to its ability to intelligently extract data from a running kernel, eBPF's other big usecase at the time of writing is around tracing, debugging, and analysis tools:
- OpenTelemetry eBPF profiler is a whole-system multi-language profiler. It runs with low overhead, and allows for generation of stack traces without debug symbols being installed. It supports mixed stacktraces which can span from kernel space through system libraries up to interpreted language runtimes.
- 0x.tools is a lightweight application performance analysis toolkit. It makes use of eBPF to track application state using a snapshotting approach hooking into the scheduler, allowing it to gain sophistcated insight into application behaviour with relatively minimal overhead.
A small demo application
Having taken a look at some compelling open source projects, the remainder of this article will present a small demo eBPF application which we've implemented.
The goal of this application is to do something useful as simply as possible in order to show how the basics of an eBPF application fit together.
A network eBPF program
Our primary interest is in how eBPF can be used in networking usecases, and we have plenty of prior experience of the L2TP network protocol.
As such, we decided to implement a program which implements an L2TP datapath, although the techniques we have used could apply to similar protocols.
What exactly is L2TP?
L2TP generally exists to connect Layer 2 networks over an intermediate Layer 3 network.
The L2TP specifications allows for multiple different types of such connections to be made. The earlier v2 version of the specification only carried PPP L2 traffic.
Subsequently the v3 version generalised the protocol to allow the transmission of any L2 traffic type. The generic term for a single connection carrying L2 traffic is a "pseudowire".
It's fairly common to use L2TP with PPP pseudowires, using either L2TPv2 or L2TPv3. However, PPP is a fairly involved protocol suite in its own right, and overly complex to manage for a demonstration.
As such, we decided to implement the dataplane for an L2TPv3 Ethernet pseudowire.
Although the name sounds complex, don't be alarmed! The actual behaviour of an L2TPv3 Ethernet pseudowire in terms of data transmission is fairly simple, as we'll see.
High-level implementation notes
Now we have a concrete goal in mind, we can consider what the L2TPv3 Ethernet pseudowire needs to do in terms of network packets.
Remember, the goal is to send a Layer 2 frame across a Layer 3 network.
So at one side of the pseudowire an L2 frame is received, and then it's the pseudowire's job to send that frame across the L3 network, before presenting it again on the other side as an L2 frame.
To call this out explicitly, we need to:
- take Ethernet frames arriving on a network interface attached to an L2 network,
- add L2TP encapsulation to the frame,
- transmit the encapsulated frame over the L3 network.
Then at the other side of the L3 network:
- take encapsulated L2TPv3 frames arriving from the L3 network,
- remove L2TP encapsulation from the frame,
- transmit the decapsulated frame on the L2 network.
Of course, L2TP connections need to handle bi-directional traffic, so we would need to implement this on both sides of the L3 network.
L2TP encapsulation headers vary depending on configuration options and whether UDP is used. Our demo supports both UDP and direct encapsulation in IP, over IPv4 and IPv6.
Each L2TP data packet contains a session ID which identifies the pseudowire. There are other optional fields in the header, such as a cookie, which our demo ignores.
Bringing eBPF to bear
The first decision to make when designing an eBPF program is to decide what program type best fits your usecase.
In our case we want to intercept packets early on in the network stack to allow us to deal with encapsulation and then forward the packet to a different interface.
There are a couple of options open to us. XDP programs get access very early, potentially even before they enter the network stack. Alternatively, the traffic control subsystem offers scheduler hook points which run later after the packets have been through some processing.
Either approach could be viable, but for this example we used the traffic control scheduler.
Let's take a look at the code
Our full eBPF example can be seen here.
The eBPF code itself implements both the encapsulation and decapsulation parts of the datapath. The encap() function implements encapsulation, while decap() implements decapsulation. These functions are executed by the traffic control subsystem hooks.
In order to allow the different functions to be attached separately, the eBPF code uses the SEC() macro (defined in bpf_helpers.h) to place the different parts of the program into different elf sections.
The different sections are called out when we load the eBPF program using tc, allowing us to attach cls_act/encap to encapsulate packets arriving on one interface, and cls_act/decap to decapsulate packets arriving on another interface.
The eBPF program is configured at runtime by means of two BPF hashmaps, both of which contain L2TP session information, but which can be indexed by different information depending on whether it's the encapsulation path or decapsulation path which is looking up the information.
The encapsulation path (the encap() function in the eBPF code) is simplest. The eBPF code needs to encapsulate all frames arriving on a specific interface.
To obtain the L2TP context for the encapsulation processing, the code searches the eth_session_map, which is indexed by interface ifindex.
Once the session information has been obtained, the encap() function applies encapsulating headers, starting with Ethernet, then IP (either v4 or v6), then UDP (if enabled) and then finally L2TP. The L2TP header layout differs slightly for UDP and IP encapsulation types.
The contents of these headers are largely informed by the session information from the eth_session_map.
Finally the encap() function redirects the encapsulated packet to the interface specified by the session information from the eth_session_map.
The decapsulation path is slightly more complicated, in that the incoming encapsulated packet must be parsed to build a key which is then used to look up session information in the l2tp_session_map.
In order to uniquely identify a given session from an incoming frame, we need to match on:
- source and destination IP address,
- source and destination UDP port (when using UDP encapsulation),
- IP protocol,
- L2TP version,
- the L2TP session ID.
This information can all be extracted from the headers of an incoming L2TP packet.
Since the interface carrying L2TP packets may be carrying other traffic too, the decap() function bails out early with the return code TC_ACT_OK if no key can be parsed.
This signals to the traffic control framework that the packet should be allowed to continue its normal journey through the stack, and allows for non-L2TP traffic to be handled as usual.
Assuming an incoming frame is parsed successfully, and session information looked up in the l2tp_session_map, the decap() function goes on to remove the outer encapsulation headers.
Finally, the decapsulated frame is redirected to the interface specified by the session information from the l2tp_session_map.
Configuring the eBPF program
The encapsulation and decapsulation paths in the eBPF program are configured using BPF maps, which can be updated from userspace.
In this way we could configure multiple different L2TPv3 Ethernet pseudowires, up to the limit of entries in the map (called out by the MAX_SESSIONS macro in ebpf_clsact.c).
Our example project includes a simple userspace application for writing to the BPF maps: map_session.c.
This application is driven using commandline arguments to fully specify the session.
It then uses the bpf(2) system call to update the maps for encapsulation and decapsulation handling.
The maps themselves are accessed using bpf_obj_get() which is a thin libbpf wrapper around the bpf(2) BPF_OBJ_GET subcommand.
As you can see, the code required to update the BPF maps is minimal: the bulk of map_session.c is concerned with parsing and sanity checking the command line arguments!
Running the code
Before running the code, you must first build it: the instructions in the project README give more details.
Included in the repository alongside the eBPF and userspace code is a set of scripts which instantiate an L2TP datapath and test it using ping.
The script creates a number of network namespaces to emulate a simple L2TP configuration: you can see a full description of the test environment in the l2tp-test.sh script.
For convenience, the Makefile has a check target which will run through the supported scenarios:
- IPv4 or IPv6 addresses,
- L2TPv3 UDP or IP encapsulation.
These are tested for the following configurations:
- default: this uses the existing kernel L2TP subsystem, configured using iproute2 commands,
- ebpf-ns1, ebpf-ns2: these use eBPF for one of the L2TP peers, and the kernel L2TP subsystem for the other,
- ebpf-ns1-ns2: this uses eBPF for both L2TP peers.
Depending on your development environment, you may not be able to use eBPF with IPv6 addresses -- although the eBPF code supports it, it depends on specific eBPF helpers which may not be available in all distro kernels at the time of writing.
If you'd like to get some more context on what the eBPF code is doing at runtime, you can build with debugging enabled:
$ make V=1
and then in a separate terminal:
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
This will display the extra logging the eBPF code generates using bpf_trace_printk().
Comparisions with the kernel L2TP subsystem
Our small demo implements a functioning (if not fully-featured) L2TPv3 Ethernet pseudowire in a few hundred lines of code.
The kernel L2TP subsystem, by contrast, involves a few thousand lines.
Granted, the kernel does a lot more than the demo does, but it is still striking how much functionality the demo implements with relatively little code.
Not only that, but the kernel L2TP subsystem has to deal with a lot more complexity due to the fact that it is kernel code: it has to understand socket lifetimes, memory management, different execution contexts, kernel locking primitives, and more besides.
The eBPF code gets to sidestep the vast majority of that complexity, which makes it much quicker and easier to write.
Conclusions
In this post we've looked at some interesting open source eBPF projects for Cloud networking and application profiling.
We've also given an overview of a small demo project which implements an L2TPv3 Ethernet pseudowire datapath using eBPF, written to be as simple as possible to illustrate the fundamental tools and APIs for dealing with eBPF code, loading it, and running it.
Hopefully these different projects give you some idea of the utility of eBPF -- perhaps you can even start to imagine some uses for it in your own work!
In the next post we will deal in a little more depth with some important implementation details of our demo project, and share some of the challenges we had when working on it.