io_uring, Async, and Thread-Per-Core: Scaling to Millions of Clients at Minimal Cost
Published
January 1, 2026
By
Stantia Labs Team
Category
Architecture
Read Time
14 minutes
Discover how modern Linux I/O technologies combine with thread-per-core architecture to enable real-time communication platforms to serve millions of concurrent clients while minimizing infrastructure costs and computational overhead.
The Cost Challenge: Scaling to Millions of Clients
Running a real-time communication platform that connects millions of clients is a mathematical problem: each connected client requires memory, CPU cycles, and I/O handling. Traditional approaches to connection management—one thread per connection, blocking I/O, context switching—become economically unsustainable at scale.
Consider the economics:
- •Thread overhead: Each OS thread consumes ~1-2 MB of memory for stack and metadata. One million threads = 1-2 TB of wasted RAM just for stacks.
- •Context switching: With thousands of runnable threads, the CPU spends most time context switching rather than doing useful work.
- •Cache thrashing: Each thread switch evicts CPU caches. At massive scale, you're constantly reloading instruction and data caches.
- •Lock contention: Multiple threads competing for locks on shared data structures causes CPU stalls and poor scalability.
The result? A traditional architecture might require 100+ servers (at $10K/month each = $1M/month) to handle what a properly optimized platform can do on 5-10 servers. This is why cost-efficient scaling is not just an engineering problem—it's a business imperative.
Enter io_uring: Asynchronous I/O Reimagined
io_uring is a modern Linux subsystem (introduced in 5.1, but production-ready in 5.2+) that fundamentally changes how applications handle I/O. Instead of blocking syscalls or complex polling mechanisms, io_uring uses a shared-memory queue interface between application and kernel.
How io_uring Works
The core innovation is simplicity through shared memory:
The io_uring Model
Application submits work
Write to SQ (Submission Queue) without syscall overhead
Kernel processes asynchronously
Kernel reads SQ, batches operations, executes them
Results appear in CQ
Completion Queue filled with results, no interrupt needed
Application polls results
Single syscall (or polling) to harvest thousands of completions
The magic: no syscall overhead per operation. Traditional `read()` and `write()` syscalls cost ~100-200 nanoseconds each. With io_uring, you batch hundreds of I/O operations and make a single syscall, amortizing the cost to ~0.1 nanoseconds per operation.
io_uring vs Traditional I/O
| Aspect | Traditional (epoll) | io_uring |
|---|---|---|
| Syscall/op | ~1 (read/write) | ~0.001 (batched) |
| Latency (p99) | ~1-5µs | ~100-500ns |
| Throughput/core | ~100K ops/sec | ~10M ops/sec |
| CPU efficiency | ~50-60% | ~80-90% |
| Ops/watt | ~1K | ~10K |
Async/Await: Programming Model for Scale
io_uring enables efficient I/O, but the programming model matters enormously. Modern async/await (borrowed from languages like Python, Rust, and JavaScript) allows developers to write sequential code that actually runs concurrently, without explicit threading.
Why does this matter for cost?
Minimal Stack Memory
An async task consumes ~200 bytes of RAM, not 2MB. One million async tasks = 200MB, not 2GB. That's 10x less memory per connection.
Cooperative Scheduling
Tasks yield at await points, not randomly. No context switching surprises. CPU caches stay hot. Predictable performance.
Lock-Free Patterns
With proper async library design, you avoid locks entirely. Actors, channels, and message passing replace shared-memory concurrency.
The cost impact: a system handling 1M concurrent connections with async can run on 5-10 cores, where the same workload with threads would need 100+ cores.
Thread-Per-Core: The Missing Piece
Here's where the true magic happens. Instead of creating thousands of threads competing for CPU time, thread-per-core architecture runs exactly one thread per CPU core, with no context switching at all.
The Thread-Per-Core Model
With N CPU cores, you create exactly N threads—one pinned to each core. Each thread:
- ✓Runs an event loop (async reactor) handling multiple connections
- ✓Uses io_uring to batch I/O operations from all its connections
- ✓Never blocks—if one connection stalls, others continue
- ✓Never context switches—stays on its core, keeps caches hot
- ✓Uses NUMA-local memory for its connections
Example: 64-Core System
The Synergy: io_uring + Async + Thread-Per-Core
These three technologies are not independent—they form a powerful synergy for scaling real-time systems:
1. I/O Efficiency with io_uring
io_uring batches I/O operations across all connections on a thread, reducing syscall overhead to negligible levels.
Example: 15K connections submitting 1 operation each = 1 syscall instead of 15K
2. Async Scheduling Efficiency
Async/await let a single thread manage thousands of concurrent operations without threads or locks.
Example: One async task per connection, 15K tasks on one thread = 3MB memory
3. CPU Affinity Optimization
Thread-per-core + CPU pinning ensures zero context switches and maximum cache utilization.
Example: 64-core system running 64 threads = 100% CPU utilization, zero waste
Real-World Performance Impact
Consider handling 1 million concurrent clients, each receiving 1 message per second:
Traditional Architecture (Thread-Per-Connection)
- ⚠1M threads × 2MB stack = 2TB memory
- ⚠1M context switches/second (at minimum)
- ⚠~300+ servers (8-core, 32GB RAM each)
- ⚠~$3M+/month infrastructure cost
Relay Architecture (io_uring + Async + Thread-Per-Core)
- ✓64-128 threads, ~100-150GB memory
- ✓0 context switches (zero!)
- ✓~2-4 servers (64+ cores, 256GB RAM)
- ✓~$50K-100K/month infrastructure cost
30-50x cost reduction for the same workload. That's not incremental improvement—that's transformational.
Overcoming the Bottlenecks
With io_uring + async + thread-per-core, what becomes the bottleneck? Not CPU—you're running at 85-95% efficiency. The real constraints are:
Memory Bandwidth
Modern CPUs can process operations faster than DRAM can feed them. Solution: optimize for L3 cache locality, use NUMA-aware allocation.
Network Bandwidth
100 Gbps NICs are now standard. Relay uses multiple NICs and RSS (Receive Side Scaling) to distribute load across cores.
Disk/Storage I/O
For systems requiring persistence, io_uring excels at batching disk operations. NVMe SSDs with proper queue depth reach GB/sec throughput.
Implementation Challenges & Solutions
Building systems with these technologies requires addressing real challenges:
Challenge: Kernel Version Support
io_uring matured significantly between Linux 5.2 and 5.10. Solution: target Linux 5.10+ and test thoroughly on deployment targets.
Challenge: Async Library Maturity
Not all async runtimes are created equal. Solution: use battle-tested async frameworks (Tokio in Rust, asyncio in Python, etc.) rather than building from scratch.
Challenge: Debugging Complexity
Thread-per-core with async makes traditional debugging tools less effective. Solution: invest in observability (tracing, metrics, profiling) from the start.
Cost Breakdown: Where Savings Come From
Let's quantify the cost savings:
Relay's Implementation
Relay is architected from first principles around these technologies:
- ✓io_uring integration: All network and disk I/O goes through io_uring for batched, efficient processing
- ✓Native async runtime: Relay's core handles thousands of concurrent connections per thread
- ✓Thread-per-core execution: Relay pins threads to cores, eliminates context switching, maximizes cache efficiency
- ✓NUMA awareness: Memory allocation respects NUMA topology for optimal latency
- ✓CPU affinity control: Applications can control thread placement and scheduling
The result: Relay achieves 2 billion packets per second while maintaining sub-millisecond latency, all while consuming minimal CPU and memory resources.
The Business Impact
For companies building real-time communication platforms, the business impact is profound:
- ✓Lower infrastructure costs: 90% reduction in server spending
- ✓Better unit economics: Cost per user drops dramatically as you scale
- ✓Environmental impact: Fewer servers = less power consumption, smaller carbon footprint
- ✓Operational simplicity: Managing 3 highly optimized servers beats managing 300 generic ones
- ✓Competitive advantage: Ability to offer lower prices or higher margins than competitors
Looking Forward: The Evolution of System Design
io_uring, async programming, and thread-per-core architecture represent a fundamental shift in how we think about systems design. Rather than adding more resources to handle load, modern systems architecture is about eliminating waste.
Future developments to watch:
- •io_uring_prep_accept_direct: Even faster connection acceptance for high-concurrency servers
- •eBPF integration: More sophisticated packet filtering and routing at the kernel level
- •Heterogeneous compute: Offloading processing to specialized hardware (SmartNICs, GPUs) via io_uring
- •Cross-kernel optimization: Better integration between userspace async runtimes and kernel schedulers
Conclusion: Efficient Scale Is Possible
Running a platform that serves millions of concurrent clients doesn't require massive infrastructure budgets. By combining modern Linux technologies—io_uring for efficient I/O, async/await for scalable concurrency, and thread-per-core for CPU efficiency—you can achieve unprecedented scale with minimal resources.
Relay demonstrates that these aren't theoretical optimizations—they're practical tools that reduce operational cost by 90% while improving performance. For any company building real-time systems, understanding and leveraging these technologies is the difference between a sustainable business and one that loses to better-optimized competitors.
The era of brute-force scaling is ending. The era of intelligent, efficient systems is here.
Ready to Scale Efficiently?
Relay's architecture handles millions of concurrent clients while minimizing cost and resource consumption. Deploy a real-time communication platform that works with your budget, not against it.
Get Started with Relay →