WireGuard UDP load-balancing
The Nature of WireGuard (UDP + Stateful Crypto)
WireGuard operates over UDP and establishes stateful cryptographic sessions between peers (https://www.wireguard.com/protocol/).
Key properties relevant to HA:
WireGuard uses UDP (connectionless transport).
Session keys are negotiated via handshake and stored in memory.
Transport packets are encrypted using per-session symmetric keys.
A transport packet can only be decrypted by the instance that holds the active session keys.
What this means for HA
If a client completes a handshake with Gateway A:
Only Gateway A can decrypt subsequent packets.
If packets are routed to Gateway B, they will be dropped.
Recovery requires the client to initiate a new handshake.
Therefore, simple round-robin load balancing is not sufficient.
You need:
Health-aware load balancing
Deterministic upstream selection (sticky routing)
Fast backend ejection on failure
Why a Load Balancer with Health Checks Is Required
WireGuard uses UDP, which is connectionless and provides no built-in failure detection. As a result:
UDP has no connection state, acknowledgements, or reset signals.
If a gateway crashes, the load balancer does not automatically detect it.
The load balancer may continue forwarding traffic to a dead gateway.
This results in silent packet drops and delayed failover.
To prevent this, a Layer 4 load balancer must:
Perform active health checks against each gateway.
Immediately mark failed gateways as unhealthy.
Re-route traffic to healthy instances.
Without properly configured health checks, high availability cannot be reliably achieved in a multi-gateway WireGuard setup.
Why Sticky Sessions Are Mandatory
WireGuard sessions are bound to a specific gateway instance.
Once a client completes a handshake:
Subsequent transport packets must reach the same gateway.
If packets are distributed per-packet or per-datagram, decryption will fail.
This results in silent packet drops.
Therefore, the load balancer must use deterministic routing:
Hash-based routing (e.g., source IP hashing)
Ring-hash or consistent-hash load balancing
Never per-packet load balancing
Sticky routing ensures:
All packets from a client reach the same gateway
Failover only occurs when the backend is unhealthy
Recommended Load Balancer: Envoy
We recommend Envoy for UDP load balancing.
Reasons:
Native UDP proxy support
Health checks with fine-grained timing controls
Proper backend ejection on failure
Production-grade L4 behavior
Recommended configuration characteristics:
lb_policy: RING_HASH
hash_policy: source_ip
Aggressive health check intervals
Low fail thresholds for immediate failover
Example configuration can be found here.
Envoy ensures:
Traffic is consistently routed to one backend
Dead backends are removed quickly
Failover happens reasonably fast
Operational Gotchas and Lessons Learned
During implementation and testing, several subtle issues were discovered.
This section documents them to prevent future confusion.
Envoy Health Checks timing is state-dependent
Envoy has multiple health-check timing parameters:
interval
no_traffic_interval
no_traffic_healthy_interval
unhealthy_interval
healthy_edge_interval
unhealthy_edge_interval
If only interval is configured, the effective behavior may differ depending on traffic state.
Symptoms:
Backend appears healthy even when container is dead.
Failover occurs only after long delays.
Solution: Explicitly configure all relevant health-check intervals to the same low value.
NGINX Is Not Suitable for This Use Case
NGINX stream/UDP proxy:
Does not implement an active health-check mechanism, and therefore can't reliably detect UDP backend failure
Does not reassign upstream without hard socket errors
This results in traffic never being routed to healthy backend after failure.
WireGuard May Require a Keepalive Interval to Fully Recover After Failover
Even after the load balancer successfully redirects traffic to a healthy gateway, the tunnel may not resume immediately.
Key points:
When a gateway fails, the existing WireGuard session becomes invalid.
The new gateway cannot decrypt transport packets from the old session and drops them silently.
A new handshake must be initiated by the client.
WireGuard only attempts a new handshake when triggered by traffic or by the PersistentKeepalive interval.
As a result:
Tunnel recovery may take up to the configured keepalive interval.
Shorter keepalive intervals result in faster failover recovery.
Expected Failover Behavior
With proper configuration:
Gateway A crashes.
Load-balancer health check marks it unhealthy within configured healthcheck interval.
Traffic is immediately routed to Gateway B.
Client initiates new WireGuard handshake.
Tunnel resumes operation.
Typical recovery time: 3-30 seconds, depending on envoy interval settings and client keepalive interval.
Summary
High availability for WireGuard gateways requires:
Layer 4 UDP load balancing
Sticky sessions (consistent hashing)
Aggressive health checks
Correct dataplane-aware health endpoint
Envoy with ring hash and properly configured health checks is the recommended solution.
The key principle is:
WireGuard sessions are stateful and bound to a specific instance. HA must respect that constraint.
With correct configuration, gateway failover can be fast and reliable.
Last updated
Was this helpful?