Linux Kernel XDP & AF_XDP: Achieving 2 Billion Packets Per Second

Published

December 31, 2025

Stantia Labs Team

The Problem: The Linux Network Stack Bottleneck

For decades, packet processing on Linux has followed the same path: packets arrive at the network interface card (NIC), are copied into kernel buffers, traverse the entire network stack (IP, TCP, UDP layers), and eventually reach user-space applications. This design works well for traditional workloads but becomes a critical bottleneck when you need to process billions of packets per second.

Each stage of this journey adds latency and consumes CPU cycles:

•Multiple context switches: Kernel to user space transitions add nanoseconds of overhead per packet.
•Memory copying: Each layer of the network stack performs buffer copies, wasting precious CPU cycles and memory bandwidth.
•Cache misses: Following the traditional stack path causes continuous L1/L2 cache misses, especially under high packet rates.
•Protocol bloat: Not all applications need full TCP/IP processing—yet they pay the cost regardless.

For real-time communication systems like Relay, which must handle millions of concurrent connections and billions of packets daily, the traditional Linux network stack becomes a CPU-eating, latency-introducing bottleneck.

Enter XDP: The Express Data Path

XDP (eXpress Data Path) is a Linux kernel technology that allows programs to run directly on incoming packets at the earliest possible point—in the NIC driver, before the kernel's network stack processes them. Introduced in Linux 4.8 and continuously improved, XDP enables filtering, forwarding, and modification of packets with minimal overhead.

Here's how XDP works:

XDP Execution Model

XDP programs are eBPF (extended Berkeley Packet Filter) bytecode that run in a sandboxed kernel context. When a packet arrives:

Packet DMA completes into NIC ring buffer
NIC driver polls the ring buffer
XDP program executes (decision point)
Program returns action: XDP_DROP, XDP_PASS, XDP_TX, XDP_REDIRECT, or custom processing
Packet either drops immediately (no stack overhead), gets forwarded, or passes to the stack

This is revolutionary because:

✓Zero copies: XDP programs work directly with packet buffers in the NIC ring buffer—no copying needed.
✓Early filtering: Unwanted packets (DDoS attacks, malformed data) can be dropped before consuming any stack resources.
✓Deterministic latency: Decisions happen at the earliest point in the packet path.
✓Programmable in eBPF: No need to modify kernel source or recompile—just load a new eBPF program.

AF_XDP: User-Space Packet Processing at Wire Speed

While XDP is powerful for early filtering and forwarding, real-time communication systems like Relay need to actually process packets—apply business logic, route messages, maintain state. This is where AF_XDP comes in.

AF_XDP (Address Family XDP) is a socket type that allows user-space applications to receive packets directly from the NIC driver's ring buffers, completely bypassing the kernel network stack. It's the missing link between kernel-space XDP and user-space processing.

How AF_XDP Works

AF_XDP creates a shared memory ring between the kernel and user-space application:

RX Ring (Receive)

Application reads packet buffers directly from NIC ring buffer

TX Ring (Transmit)

Application writes response packets directly to NIC ring buffer

Completion Rings

Kernel notifies application when buffers are ready

Shared Buffer Pool

Pre-allocated, NUMA-aware memory—zero allocation per packet

The critical advantage: the application has direct access to packets without copying them into user-space memory. The kernel just notifies which buffers are ready to read or write.

Why XDP + AF_XDP Enables 2 Billion Packets Per Second

Achieving 2 billion packets per second requires eliminating every possible bottleneck. Here's how XDP and AF_XDP work together:

1. Minimal Memory Operations

Traditional stack: 2-3 buffer copies per packet. XDP/AF_XDP: Zero copies. At 2B pps, even 1 nanosecond per copy adds up to 2 seconds of latency.

2. Lock-Free Ring Buffers

AF_XDP uses lock-free shared memory rings. No spinlocks, no atomics—application and kernel communicate through index pointers without blocking.

3. NUMA Awareness

Modern systems have multiple NUMA nodes. AF_XDP can pin threads to cores and buffer pools to local NUMA memory, minimizing cross-node traffic.

4. Batching & Vectorization

Process packets in batches (32-64 at a time) to amortize costs. Modern CPUs can vectorize packet processing, especially with SIMD instructions.

5. CPU Efficiency

A single CPU core can sustain ~100M+ pps with AF_XDP. Relay scales to 2B pps across 20+ cores, keeping per-core throughput manageable and power-efficient.

Real-World Performance Comparison

Let's compare three approaches for processing network packets:

Approach	Throughput	Latency (p99)	CPU Cores
Traditional Linux Stack	~10M pps	~100µs	8-16
DPDK (User-space driver)	~1B pps	~10µs	6-10
AF_XDP (Optimized)	~2B+ pps	~1µs	4-8

* Performance depends on packet size, hardware, and workload complexity. Numbers are representative of modern systems (Intel Xeon, 100G NICs).

Relay's Implementation: XDP + AF_XDP Architecture

At Stantia Labs, Relay leverages both XDP and AF_XDP to achieve its 2 billion packets per second capability:

Relay's Packet Processing Pipeline

XDP Layer

Early filtering & routing decisions

Drop invalid packets, detect routing destination, apply DDoS protection

AF_XDP RX

User-space packet reception

Receive filtered packets directly from NIC, zero-copy

Processing

Relay message handling

Decode MQTT/AMQP/gRPC, apply logic, maintain state, route to subscribers

AF_XDP TX

User-space packet transmission

Write response packets directly to NIC TX ring, zero-copy

This end-to-end zero-copy pipeline, combined with careful CPU affinity, NUMA optimization, and lock-free data structures, is what allows Relay to sustain 2 billion packets per second on modern hardware.

Challenges & Trade-offs

XDP and AF_XDP are powerful but come with considerations:

NIC Firmware Support

Not all NICs support XDP in all modes. Mellanox/NVIDIA, Intel, and Broadcom cards have varying levels of support. AF_XDP works on most modern NICs.

Kernel Version Requirements

XDP support evolved significantly. Linux 5.0+ recommended for production use. AF_XDP is best on 5.2+.

Operational Complexity

eBPF programs require careful debugging. Relay abstracts this complexity away, but understanding the stack is valuable for optimization.

The Future: XDP Adoption

XDP and AF_XDP are becoming industry standards:

•Cloud providers are deploying AF_XDP in their data centers for networking, load balancing, and DDoS protection.
•5G infrastructure leverages XDP for packet processing in NFV (Network Function Virtualization).
•Edge computing uses AF_XDP to maximize throughput with minimal power consumption.
•IoT gateways adopt XDP for efficient filtering and forwarding of sensor data.

As Linux continues to improve kernel networking capabilities, expect even more optimization opportunities. eBPF itself is becoming a fundamental computing model in the kernel—not just for networking, but for observability, security, and tracing.

Conclusion: The Path to 2 Billion Packets Per Second

Traditional approaches to packet processing—whether POSIX sockets or raw Berkeley Packet Filters—become bottlenecks at billion-packet-per-second scales. XDP and AF_XDP represent a paradigm shift: moving packet processing as close to the hardware as possible while maintaining the flexibility of a general-purpose operating system.

Relay's ability to handle 2 billion packets per second isn't magic—it's the result of:

✓Leveraging kernel-space XDP for early filtering and intelligent routing decisions
✓Using AF_XDP for zero-copy packet reception and transmission
✓Eliminating all unnecessary memory copies and context switches
✓Optimizing for modern CPU architectures (NUMA, vector instructions)
✓Maintaining lock-free, wait-free data structures throughout

If you're building systems that demand extreme performance—IoT platforms, financial infrastructure, real-time communication networks—understanding XDP and AF_XDP is essential. And if you want to leverage these capabilities without becoming a kernel networking expert, Relay is built from the ground up to make these capabilities accessible to your applications.

Ready to Build with Relay?

Relay's XDP and AF_XDP optimizations are transparent to your application—just send and receive packets at 2 billion per second.

Get Started with Relay →

Technology

MQTT over QUIC: The Future of IoT Communication

Explore how MQTT over QUIC combines protocol reliability with transport performance.