Linux Kernel XDP & AF_XDP: Achieving 2 Billion Packets Per Second
Published
December 31, 2025
By
Stantia Labs Team
Category
Technology
Read Time
12 minutes
Discover how Linux kernel XDP and AF_XDP bypass the traditional network stack to achieve unprecedented packet processing rates—enabling Relay to handle 2 billion packets per second with minimal CPU overhead.
The Problem: The Linux Network Stack Bottleneck
For decades, packet processing on Linux has followed the same path: packets arrive at the network interface card (NIC), are copied into kernel buffers, traverse the entire network stack (IP, TCP, UDP layers), and eventually reach user-space applications. This design works well for traditional workloads but becomes a critical bottleneck when you need to process billions of packets per second.
Each stage of this journey adds latency and consumes CPU cycles:
- •Multiple context switches: Kernel to user space transitions add nanoseconds of overhead per packet.
- •Memory copying: Each layer of the network stack performs buffer copies, wasting precious CPU cycles and memory bandwidth.
- •Cache misses: Following the traditional stack path causes continuous L1/L2 cache misses, especially under high packet rates.
- •Protocol bloat: Not all applications need full TCP/IP processing—yet they pay the cost regardless.
For real-time communication systems like Relay, which must handle millions of concurrent connections and billions of packets daily, the traditional Linux network stack becomes a CPU-eating, latency-introducing bottleneck.
Enter XDP: The Express Data Path
XDP (eXpress Data Path) is a Linux kernel technology that allows programs to run directly on incoming packets at the earliest possible point—in the NIC driver, before the kernel's network stack processes them. Introduced in Linux 4.8 and continuously improved, XDP enables filtering, forwarding, and modification of packets with minimal overhead.
Here's how XDP works:
XDP Execution Model
XDP programs are eBPF (extended Berkeley Packet Filter) bytecode that run in a sandboxed kernel context. When a packet arrives:
- Packet DMA completes into NIC ring buffer
- NIC driver polls the ring buffer
- XDP program executes (decision point)
- Program returns action: XDP_DROP, XDP_PASS, XDP_TX, XDP_REDIRECT, or custom processing
- Packet either drops immediately (no stack overhead), gets forwarded, or passes to the stack
This is revolutionary because:
- ✓Zero copies: XDP programs work directly with packet buffers in the NIC ring buffer—no copying needed.
- ✓Early filtering: Unwanted packets (DDoS attacks, malformed data) can be dropped before consuming any stack resources.
- ✓Deterministic latency: Decisions happen at the earliest point in the packet path.
- ✓Programmable in eBPF: No need to modify kernel source or recompile—just load a new eBPF program.
AF_XDP: User-Space Packet Processing at Wire Speed
While XDP is powerful for early filtering and forwarding, real-time communication systems like Relay need to actually process packets—apply business logic, route messages, maintain state. This is where AF_XDP comes in.
AF_XDP (Address Family XDP) is a socket type that allows user-space applications to receive packets directly from the NIC driver's ring buffers, completely bypassing the kernel network stack. It's the missing link between kernel-space XDP and user-space processing.
How AF_XDP Works
AF_XDP creates a shared memory ring between the kernel and user-space application:
RX Ring (Receive)
Application reads packet buffers directly from NIC ring buffer
TX Ring (Transmit)
Application writes response packets directly to NIC ring buffer
Completion Rings
Kernel notifies application when buffers are ready
Shared Buffer Pool
Pre-allocated, NUMA-aware memory—zero allocation per packet
The critical advantage: the application has direct access to packets without copying them into user-space memory. The kernel just notifies which buffers are ready to read or write.
Why XDP + AF_XDP Enables 2 Billion Packets Per Second
Achieving 2 billion packets per second requires eliminating every possible bottleneck. Here's how XDP and AF_XDP work together:
1. Minimal Memory Operations
Traditional stack: 2-3 buffer copies per packet. XDP/AF_XDP: Zero copies. At 2B pps, even 1 nanosecond per copy adds up to 2 seconds of latency.
2. Lock-Free Ring Buffers
AF_XDP uses lock-free shared memory rings. No spinlocks, no atomics—application and kernel communicate through index pointers without blocking.
3. NUMA Awareness
Modern systems have multiple NUMA nodes. AF_XDP can pin threads to cores and buffer pools to local NUMA memory, minimizing cross-node traffic.
4. Batching & Vectorization
Process packets in batches (32-64 at a time) to amortize costs. Modern CPUs can vectorize packet processing, especially with SIMD instructions.
5. CPU Efficiency
A single CPU core can sustain ~100M+ pps with AF_XDP. Relay scales to 2B pps across 20+ cores, keeping per-core throughput manageable and power-efficient.
Real-World Performance Comparison
Let's compare three approaches for processing network packets:
| Approach | Throughput | Latency (p99) | CPU Cores |
|---|---|---|---|
| Traditional Linux Stack | ~10M pps | ~100µs | 8-16 |
| DPDK (User-space driver) | ~1B pps | ~10µs | 6-10 |
| AF_XDP (Optimized) | ~2B+ pps | ~1µs | 4-8 |
* Performance depends on packet size, hardware, and workload complexity. Numbers are representative of modern systems (Intel Xeon, 100G NICs).
Relay's Implementation: XDP + AF_XDP Architecture
At Stantia Labs, Relay leverages both XDP and AF_XDP to achieve its 2 billion packets per second capability:
Relay's Packet Processing Pipeline
Early filtering & routing decisions
Drop invalid packets, detect routing destination, apply DDoS protection
User-space packet reception
Receive filtered packets directly from NIC, zero-copy
Relay message handling
Decode MQTT/AMQP/gRPC, apply logic, maintain state, route to subscribers
User-space packet transmission
Write response packets directly to NIC TX ring, zero-copy
This end-to-end zero-copy pipeline, combined with careful CPU affinity, NUMA optimization, and lock-free data structures, is what allows Relay to sustain 2 billion packets per second on modern hardware.
Challenges & Trade-offs
XDP and AF_XDP are powerful but come with considerations:
NIC Firmware Support
Not all NICs support XDP in all modes. Mellanox/NVIDIA, Intel, and Broadcom cards have varying levels of support. AF_XDP works on most modern NICs.
Kernel Version Requirements
XDP support evolved significantly. Linux 5.0+ recommended for production use. AF_XDP is best on 5.2+.
Operational Complexity
eBPF programs require careful debugging. Relay abstracts this complexity away, but understanding the stack is valuable for optimization.
The Future: XDP Adoption
XDP and AF_XDP are becoming industry standards:
- •Cloud providers are deploying AF_XDP in their data centers for networking, load balancing, and DDoS protection.
- •5G infrastructure leverages XDP for packet processing in NFV (Network Function Virtualization).
- •Edge computing uses AF_XDP to maximize throughput with minimal power consumption.
- •IoT gateways adopt XDP for efficient filtering and forwarding of sensor data.
As Linux continues to improve kernel networking capabilities, expect even more optimization opportunities. eBPF itself is becoming a fundamental computing model in the kernel—not just for networking, but for observability, security, and tracing.
Conclusion: The Path to 2 Billion Packets Per Second
Traditional approaches to packet processing—whether POSIX sockets or raw Berkeley Packet Filters—become bottlenecks at billion-packet-per-second scales. XDP and AF_XDP represent a paradigm shift: moving packet processing as close to the hardware as possible while maintaining the flexibility of a general-purpose operating system.
Relay's ability to handle 2 billion packets per second isn't magic—it's the result of:
- ✓Leveraging kernel-space XDP for early filtering and intelligent routing decisions
- ✓Using AF_XDP for zero-copy packet reception and transmission
- ✓Eliminating all unnecessary memory copies and context switches
- ✓Optimizing for modern CPU architectures (NUMA, vector instructions)
- ✓Maintaining lock-free, wait-free data structures throughout
If you're building systems that demand extreme performance—IoT platforms, financial infrastructure, real-time communication networks—understanding XDP and AF_XDP is essential. And if you want to leverage these capabilities without becoming a kernel networking expert, Relay is built from the ground up to make these capabilities accessible to your applications.
Ready to Build with Relay?
Relay's XDP and AF_XDP optimizations are transparent to your application—just send and receive packets at 2 billion per second.
Get Started with Relay →