Scaling QUIC Ingress: eBPF Socket Steering for HTTP/3 Connection Migration
IT

Quick answer
Scaling HTTP/3 for High-Frequency Telemetry: eBPF Socket : MCP tunnel answer
MCP tunneling gives a local MCP server a public HTTPS endpoint so AI tools can reach it during development without deploying the server first.
What is MCP tunneling?
MCP tunneling exposes a local Model Context Protocol server through a public endpoint so compatible AI tools can connect during development.
When should I use InstaTunnel for MCP?
Use InstaTunnel Pro when a local MCP endpoint needs public HTTPS access, stable routing, and stream-friendly tunnel behavior.
When a remote edge node drops off the network for a few hundred milliseconds and comes back with a new IP address, a naive UDP proxy deployment will silently kill the session that was supposed to survive exactly that kind of disruption. This article looks at why that happens, and how eBPF-based socket steering at the kernel layer fixes it — using the real mechanisms Linux and Cloudflare actually ship, not just the theory.
Why QUIC, and why it breaks naive load balancing
Real-time telemetry — industrial sensor networks, autonomous-vehicle sensor fusion, mobile edge workloads — has largely moved off TCP and onto HTTP/3’s QUIC transport. TCP’s strict in-order delivery means a single lost packet stalls every stream multiplexed on that connection (head-of-line blocking). QUIC avoids this by running its own loss recovery and stream multiplexing directly over UDP, so a dropped packet on one stream doesn’t stall the others.
QUIC also supports 0-RTT — but it’s worth being precise about what that means: 0-RTT lets a returning client resume a previous session and send application data immediately, using a pre-shared key from an earlier handshake. A brand-new client still needs a full 1-RTT TLS 1.3 handshake; 0-RTT is a resumption optimization, not a property of every QUIC handshake.
The feature that matters most for this article is connection migration. A TCP connection is pinned to a 4-tuple — source IP, source port, destination IP, destination port. Change any of those (a phone switching from Wi-Fi to 5G, a robot roaming between access points) and the connection is gone; the client has to renegotiate from scratch. QUIC decouples the session from the network path by identifying it with a Connection ID (CID) instead of the 4-tuple. Per RFC 9000, a CID can be up to 20 bytes and is opaque to the peer — the server picks it, hands it to the client, and can keep recognizing that client even after its IP and port change mid-session.
That’s a huge win for a single client talking to a single server. It becomes a problem the moment the server side is actually a fleet of load-balanced worker processes.
The 4-tuple hash breaks under migration
Reverse proxies like NGINX, Envoy, and HAProxy scale across CPU cores by running multiple worker processes, each with its own socket bound to the same port via SO_REUSEPORT. For TCP, this is easy: the kernel handles the handshake and accept() hands a completed connection to exactly one worker, which the kernel then keeps routing to for the life of that connection.
UDP has no handshake and no persistent kernel-side connection state, so SO_REUSEPORT falls back to a much simpler mechanism: for every incoming datagram, the kernel hashes the 4-tuple and picks a socket from the reuseport group by that hash. As long as the 4-tuple stays fixed, every packet lands on the same worker.
The instant a client’s IP changes — the entire point of QUIC connection migration — the 4-tuple changes, the hash changes, and the kernel routes the packet to a different worker that has never seen this client, holds no TLS keys for it, and has no choice but to drop the packet. QUIC’s headline feature is neutralized by a load-balancing mechanism that predates it.
Teaching the kernel about QUIC with eBPF
Rather than hard-coding QUIC awareness into the kernel, Linux lets you attach a custom eBPF program to a reuseport group and let it make the socket-selection decision instead of the default hash. This capability is BPF_PROG_TYPE_SK_REUSEPORT, added by Martin KaFai Lau in Linux 4.19, and it pairs with the bpf_sk_select_reuseport() helper, which assigns an incoming packet to a specific socket in a BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map (and, since Linux 5.8, SOCKHASH/SOCKMAP maps as well). If the eBPF program returns an invalid index, the kernel silently falls back to the default 4-tuple hash, so the mechanism degrades safely.
This lets you replace “hash the 4-tuple” with “read the QUIC Connection ID out of the packet and route on that instead” — entirely in kernel space, before the packet ever reaches a userspace socket buffer.
The steering pipeline
- Worker embeds its identity in the CID. During the very first handshake packet, before any migration has happened, the default hash is harmless — there’s no established state yet to misroute. The worker that lands the handshake (say, Worker 2) generates the Server Connection ID it hands back to the client, and encodes its own worker index somewhere inside those bytes alongside cryptographic entropy.
- The eBPF program parses the QUIC header in-kernel. On every subsequent packet, the
sk_reuseportprogram inspects the raw payload viastruct sk_reuseport_md, distinguishes QUIC’s long header (handshake packets) from the short header (steady-state 1-RTT packets), and extracts the Destination Connection ID field. - Worker ID lookup, not a hash-table scan. Because the worker ID is embedded directly in the CID rather than requiring a lookup in a table mapping millions of CIDs to sockets, the eBPF program just masks out the relevant bits to recover the integer.
bpf_sk_select_reuseport()does the routing. The extracted worker ID is used as the index into the socket array, and the kernel delivers the datagram straight to that worker’s socket — regardless of what the client’s current IP address is.
One correction worth making here: this “encode routing info directly in the CID” idea isn’t just a bespoke trick — it’s exactly the problem the IETF’s draft-ietf-quic-load-balancers (“QUIC-LB”) spec set out to standardize, with a defined octet layout (a reserved first octet for config-rotation/self-encoded-length bits, with the server/worker ID starting at the second octet, followed by an encrypted or obfuscated nonce). It’s important to be accurate about its status, though: QUIC-LB never advanced past Internet-Draft status and is now listed as expired/inactive by the IETF datatracker. It never became an RFC. That doesn’t make the technique fictional — plenty of real load balancers and proxies implement their own variant of the same idea — but it’s not an adopted standard, just a well-documented, unofficial convention.
eBPF isn’t a general-purpose scripting environment
It’s worth being concrete about why the eBPF program has to be this narrow and cheap, rather than hand-waving about “restrictions.” The in-kernel verifier statically proves a program will terminate and stay memory-safe before it’s ever allowed to load:
- Each program is capped at 512 bytes of stack space.
- Unbounded loops were rejected outright until Linux 5.3 introduced provably-terminating “bounded loops”; before that, loops had to be unrolled at compile time.
- The verifier enforces an overall complexity budget (on the order of a million simulated instruction-states per program), and blows past it quickly if you put unbounded-looking loops or excessive branching in a hot-path program.
None of this is exotic for a header-parsing task like CID extraction, but it does explain why the CID-encoding scheme is deliberately simple (a few bytes, masked out directly) rather than something that needs a real data structure to resolve.
Handling restarts: what actually ships in production
The original framing of this problem as “socket generations, similar to Cloudflare’s approach” undersold how concrete this already is in production. Cloudflare shipped exactly this as an open-source project called udpgrm (UDP Graceful Restart Marshal), described in a May 2025 engineering blog post, and it’s worth walking through because it resolves the upgrade problem more rigorously than a hand-rolled generation counter would.
The core issue: when you restart or reload a QUIC-terminating proxy, you get two sets of SO_REUSEPORT sockets in the same group — one from the old binary, draining its existing connections, and one from the new binary, accepting new ones. A naive CID-based eBPF router would just extract “Worker 2” and blindly hand the packet to new Worker 2, breaking every in-flight connection that belonged to the old Worker 2.
udpgrm’s model:
- A socket generation is the set of reuseport-group sockets belonging to one logical instance of the server (i.e., one deployment).
- A working generation pointer tells the eBPF program which generation should receive brand-new flows.
- A flow dissector decides, per packet, whether it belongs to a new flow (for QUIC, an Initial packet) or an established one, and if established, which specific socket generation originally owns it — even if that’s an older, draining generation.
- Flow state and socket references live in a
SOCKHASHmap that the daemon populates and keeps in sync from userspace, decoupling that bookkeeping from the application itself.
udpgrm ships three built-in dissector modes plus a “bespoke” template: a FLOW dissector that tracks a fixed-size 4-tuple hash table (useful for protocols with no native connection identifier), a CBPF cookie-based dissector where the routing identifier is embedded directly in the packet — exactly the QUIC-CID scheme described above, which Cloudflare calls a “udpgrm cookie” — and a NOOP mode for stateless protocols like DNS that don’t need any of this. The daemon integrates with systemd via a small setsockopt/getsockopt-based control protocol and a “decoy” process trick to work around systemd’s assumption that only one instance of a service runs at a time.
The practical takeaway for anyone building this themselves: don’t reinvent generation tracking and flow dissection from scratch unless you have a very specific reason to — udpgrm (or a similar production-tested reuseport-eBPF daemon) already solves the graceful-restart half of this problem, which is genuinely the harder half to get right.
Where this leaves enterprise HTTP/3 ingress
The shift from TCP to QUIC solves a real, longstanding transport-layer problem — but it exposes an assumption baked deep into how Linux load-balances UDP: that a “flow” is defined by its 4-tuple. QUIC explicitly rejects that assumption, and the kernel’s default SO_REUSEPORT behavior hasn’t caught up on its own. BPF_PROG_TYPE_SK_REUSEPORT and bpf_sk_select_reuseport() are the real, current mechanisms for closing that gap; QUIC-LB is the (now-lapsed) standardization attempt for the CID encoding convention; and udpgrm is a concrete, open-source example of what a production-grade version of the full pipeline — migration-aware routing and zero-downtime restarts — actually looks like today.
Sources
- RFC 9000 — QUIC: A UDP-Based Multiplexed and Secure Transport (IETF)
- draft-ietf-quic-load-balancers — QUIC-LB: Generating Routable QUIC Connection IDs (expired Internet-Draft)
- Cloudflare Blog — “QUIC restarts, slow problems: udpgrm to the rescue”, Marek Majkowski, May 7, 2025
- udpgrm GitHub repository
- eBPF Docs — Program Type
BPF_PROG_TYPE_SK_REUSEPORT - eBPF Docs — Helper Function
bpf_sk_select_reuseport - eBPF Docs — Loops
- Linux kernel commit — “bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT”, Martin KaFai Lau
- Vincent Bernat — “Using eBPF to load-balance traffic across UDP sockets with Go”
Changelog
Metadata removed: - Stripped the SEO-style title/hook-line pairing and the unverified trailing “presentation” blurb that read like leftover CMS metadata rather than sourced content.
Corrections: - Clarified that QUIC’s 0-RTT applies to session resumption with a pre-shared key, not to every handshake — a first-time connection still requires a full 1-RTT handshake. - Corrected the CID worker-ID encoding example: the original draft said the worker ID sits in “the first two bytes” of the CID. The actual convention this mirrors (IETF QUIC-LB) reserves the first octet for config-rotation/length bits and starts the server/worker ID at the second octet. - Added the accurate standardization status of that CID-encoding scheme: draft-ietf-quic-load-balancers never progressed to RFC and is currently listed as an expired Internet-Draft — it’s a well-known convention, not an adopted standard. - Replaced the vague “similar to Cloudflare’s udpgrm framework” aside with a verified, detailed description of udpgrm’s actual mechanics (working generation, flow dissectors, SOCKHASH-based state, systemd integration), sourced directly from Cloudflare’s engineering blog and the project’s public README. - Confirmed and kept: BPF_PROG_TYPE_SK_REUSEPORT, bpf_sk_select_reuseport(), BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the 20-byte QUIC CID limit, and the general 4-tuple-hash-breaks-under-migration mechanism — all verified against RFC 9000, the eBPF documentation project, and the original 2018 kernel commit.
Extensions: - Added sourced, concrete detail on eBPF verifier constraints (512-byte stack limit, pre-5.3 unbounded-loop rejection, complexity budget) to explain why the steering program has to stay minimal, rather than asserting it without support. - Added a full section on udpgrm’s dissector modes (FLOW, CBPF, NOOP, BESPOKE) and its systemd integration approach, since this is the actual production implementation of the “socket generations” concept the original draft only gestured at. - Added a Sources section with direct links to every primary source used (RFC, IETF draft, Cloudflare engineering blog, eBPF docs, kernel commit).
Related InstaTunnel pages
Continue from this article into the most relevant product guides and workflows.
Comments
Post a Comment