io_uring, Async, and Thread-Per-Core: Scaling to Millions of Clients at Minimal Cost

Published

January 1, 2026

Stantia Labs Team

The Cost Challenge: Scaling to Millions of Clients

Running a real-time communication platform that connects millions of clients is a mathematical problem: each connected client requires memory, CPU cycles, and I/O handling. Traditional approaches to connection management—one thread per connection, blocking I/O, context switching—become economically unsustainable at scale.

Consider the economics:

•Thread overhead: Each OS thread consumes ~1-2 MB of memory for stack and metadata. One million threads = 1-2 TB of wasted RAM just for stacks.
•Context switching: With thousands of runnable threads, the CPU spends most time context switching rather than doing useful work.
•Cache thrashing: Each thread switch evicts CPU caches. At massive scale, you're constantly reloading instruction and data caches.
•Lock contention: Multiple threads competing for locks on shared data structures causes CPU stalls and poor scalability.

The result? A traditional architecture might require 100+ servers (at $10K/month each = $1M/month) to handle what a properly optimized platform can do on 5-10 servers. This is why cost-efficient scaling is not just an engineering problem—it's a business imperative.

Enter io_uring: Asynchronous I/O Reimagined

io_uring is a modern Linux subsystem (introduced in 5.1, but production-ready in 5.2+) that fundamentally changes how applications handle I/O. Instead of blocking syscalls or complex polling mechanisms, io_uring uses a shared-memory queue interface between application and kernel.

How io_uring Works

The core innovation is simplicity through shared memory:

The io_uring Model

Step 1

Application submits work

Write to SQ (Submission Queue) without syscall overhead

Step 2

Kernel processes asynchronously

Kernel reads SQ, batches operations, executes them

Step 3

Results appear in CQ

Completion Queue filled with results, no interrupt needed

Step 4

Application polls results

Single syscall (or polling) to harvest thousands of completions

The magic: no syscall overhead per operation. Traditional `read()` and `write()` syscalls cost ~100-200 nanoseconds each. With io_uring, you batch hundreds of I/O operations and make a single syscall, amortizing the cost to ~0.1 nanoseconds per operation.

io_uring vs Traditional I/O

Aspect	Traditional (epoll)	io_uring
Syscall/op	~1 (read/write)	~0.001 (batched)
Latency (p99)	~1-5µs	~100-500ns
Throughput/core	~100K ops/sec	~10M ops/sec
CPU efficiency	~50-60%	~80-90%
Ops/watt	~1K	~10K

Async/Await: Programming Model for Scale

io_uring enables efficient I/O, but the programming model matters enormously. Modern async/await (borrowed from languages like Python, Rust, and JavaScript) allows developers to write sequential code that actually runs concurrently, without explicit threading.

Why does this matter for cost?

Minimal Stack Memory

An async task consumes ~200 bytes of RAM, not 2MB. One million async tasks = 200MB, not 2GB. That's 10x less memory per connection.

Cooperative Scheduling

Tasks yield at await points, not randomly. No context switching surprises. CPU caches stay hot. Predictable performance.

Lock-Free Patterns

With proper async library design, you avoid locks entirely. Actors, channels, and message passing replace shared-memory concurrency.

The cost impact: a system handling 1M concurrent connections with async can run on 5-10 cores, where the same workload with threads would need 100+ cores.

Thread-Per-Core: The Missing Piece

Here's where the true magic happens. Instead of creating thousands of threads competing for CPU time, thread-per-core architecture runs exactly one thread per CPU core, with no context switching at all.

The Thread-Per-Core Model

With N CPU cores, you create exactly N threads—one pinned to each core. Each thread:

✓Runs an event loop (async reactor) handling multiple connections
✓Uses io_uring to batch I/O operations from all its connections
✓Never blocks—if one connection stalls, others continue
✓Never context switches—stays on its core, keeps caches hot
✓Uses NUMA-local memory for its connections

Example: 64-Core System

CPU Cores64

Threads64 (one per core)

Connections per thread~15,000 (with async/io_uring)

Total concurrent clients960,000

Memory footprint~60-80 GB (vs 1TB+ for thread-per-connection)

Context switches/sec0 (zero!)

The Synergy: io_uring + Async + Thread-Per-Core

These three technologies are not independent—they form a powerful synergy for scaling real-time systems:

1. I/O Efficiency with io_uring

io_uring batches I/O operations across all connections on a thread, reducing syscall overhead to negligible levels.

Example: 15K connections submitting 1 operation each = 1 syscall instead of 15K

2. Async Scheduling Efficiency

Async/await let a single thread manage thousands of concurrent operations without threads or locks.

Example: One async task per connection, 15K tasks on one thread = 3MB memory

3. CPU Affinity Optimization

Thread-per-core + CPU pinning ensures zero context switches and maximum cache utilization.

Example: 64-core system running 64 threads = 100% CPU utilization, zero waste

Real-World Performance Impact

Consider handling 1 million concurrent clients, each receiving 1 message per second:

Traditional Architecture (Thread-Per-Connection)

⚠1M threads × 2MB stack = 2TB memory
⚠1M context switches/second (at minimum)
⚠~300+ servers (8-core, 32GB RAM each)
⚠~$3M+/month infrastructure cost

Relay Architecture (io_uring + Async + Thread-Per-Core)

✓64-128 threads, ~100-150GB memory
✓0 context switches (zero!)
✓~2-4 servers (64+ cores, 256GB RAM)
✓~$50K-100K/month infrastructure cost

30-50x cost reduction for the same workload. That's not incremental improvement—that's transformational.

Overcoming the Bottlenecks

With io_uring + async + thread-per-core, what becomes the bottleneck? Not CPU—you're running at 85-95% efficiency. The real constraints are:

Memory Bandwidth

Modern CPUs can process operations faster than DRAM can feed them. Solution: optimize for L3 cache locality, use NUMA-aware allocation.

Network Bandwidth

100 Gbps NICs are now standard. Relay uses multiple NICs and RSS (Receive Side Scaling) to distribute load across cores.

Disk/Storage I/O

For systems requiring persistence, io_uring excels at batching disk operations. NVMe SSDs with proper queue depth reach GB/sec throughput.

Implementation Challenges & Solutions

Building systems with these technologies requires addressing real challenges:

Challenge: Kernel Version Support

io_uring matured significantly between Linux 5.2 and 5.10. Solution: target Linux 5.10+ and test thoroughly on deployment targets.

Challenge: Async Library Maturity

Not all async runtimes are created equal. Solution: use battle-tested async frameworks (Tokio in Rust, asyncio in Python, etc.) rather than building from scratch.

Challenge: Debugging Complexity

Thread-per-core with async makes traditional debugging tools less effective. Solution: invest in observability (tracing, metrics, profiling) from the start.

Cost Breakdown: Where Savings Come From

Let's quantify the cost savings:

Fewer Servers: 300 → 3 servers$216K→$2K/month

Lower Power Consumption: 80+ cores busy → 64 cores, efficiency gains40-50% reduction

Reduced Network: Fewer servers = fewer NIC switching30-40% reduction

Operational Overhead: 300 servers vs 3 servers100x easier to manage

Total Cost Reduction90%+ for the same workload

Relay's Implementation

Relay is architected from first principles around these technologies:

✓io_uring integration: All network and disk I/O goes through io_uring for batched, efficient processing
✓Native async runtime: Relay's core handles thousands of concurrent connections per thread
✓Thread-per-core execution: Relay pins threads to cores, eliminates context switching, maximizes cache efficiency
✓NUMA awareness: Memory allocation respects NUMA topology for optimal latency
✓CPU affinity control: Applications can control thread placement and scheduling

The result: Relay achieves 2 billion packets per second while maintaining sub-millisecond latency, all while consuming minimal CPU and memory resources.

The Business Impact

For companies building real-time communication platforms, the business impact is profound:

✓Lower infrastructure costs: 90% reduction in server spending
✓Better unit economics: Cost per user drops dramatically as you scale
✓Environmental impact: Fewer servers = less power consumption, smaller carbon footprint
✓Operational simplicity: Managing 3 highly optimized servers beats managing 300 generic ones
✓Competitive advantage: Ability to offer lower prices or higher margins than competitors

Looking Forward: The Evolution of System Design

io_uring, async programming, and thread-per-core architecture represent a fundamental shift in how we think about systems design. Rather than adding more resources to handle load, modern systems architecture is about eliminating waste.

Future developments to watch:

•io_uring_prep_accept_direct: Even faster connection acceptance for high-concurrency servers
•eBPF integration: More sophisticated packet filtering and routing at the kernel level
•Heterogeneous compute: Offloading processing to specialized hardware (SmartNICs, GPUs) via io_uring
•Cross-kernel optimization: Better integration between userspace async runtimes and kernel schedulers

Conclusion: Efficient Scale Is Possible

Running a platform that serves millions of concurrent clients doesn't require massive infrastructure budgets. By combining modern Linux technologies—io_uring for efficient I/O, async/await for scalable concurrency, and thread-per-core for CPU efficiency—you can achieve unprecedented scale with minimal resources.

Relay demonstrates that these aren't theoretical optimizations—they're practical tools that reduce operational cost by 90% while improving performance. For any company building real-time systems, understanding and leveraging these technologies is the difference between a sustainable business and one that loses to better-optimized competitors.

The era of brute-force scaling is ending. The era of intelligent, efficient systems is here.

Ready to Scale Efficiently?

Relay's architecture handles millions of concurrent clients while minimizing cost and resource consumption. Deploy a real-time communication platform that works with your budget, not against it.

Get Started with Relay →

Technology

MQTT over QUIC: The Future of IoT Communication

Explore how MQTT over QUIC combines protocol reliability with transport performance.