2.0
High availability
Requirements
Users should be able to run multiple instances of all components
Core
Proxy
Gateway
In case of failure of any component, other instances should take over the work
The recovery should be as fast as possible in terms of dropped requests/jobs. In perfect scenario we don’t drop any requests or jobs even if components that process them fail mid-processing.
High-level overview of Defguard workflows

Considered options
1. Active-active
Every core connects to every proxy and gateway (full mesh at the app layer).
Any core can initiate actions and handle requests from proxies/gateways.
DB-based queue for async jobs (like MFA disconnect)

Pros
No need for failure detection and leader election.
“True HA”, nothing special has to happen when one of the components dies (connections already exist).
Scales “core CPU” if the control plane is truly parallelizable.
Avoids core load-balancing issue (selecting active core when handling UI requests)
Cons
Requires job queue implementation to avoid duplicating scheduled work (e.g. MFA disconnects)
Requires additional routing logic for Proxy and Gateway components so as not to send the same request to multiple cores.
Requires really stateless core
2. Active-Failover
Run N cores.
Exactly one leader stores the “lease row” in the db and then establishes the gRPC connections to all gateways and proxies and performs all write/control actions.
The leader periodically updates the heartbeats on the lease row
Standby cores monitor the heartbeat, if the heartbeat wasn’t renewed they race to acquire the lease
Once the core successfully stores the lease row, it establishes grpc connections and acts as the new leader

Pros
Cleanest correctness story (no split-brain control plane).
Simplifies gRPC connection topology: each proxy/gateway sees one controlling core.
Cons
This is a “failover” rather than a true “HA” solution
Requires robust failure detection, leader election (DB lease / k8s lease) and careful failover to avoid two leaders during partitions.
Doesn’t scale control-plane (core) throughput horizontally, only gives us failover (maybe that’s all we need for now?).
How do we route HTTP requests to active core only? (failover instances can reply with error codes to healthchecks, lb then routes the requests only to leader)
What if only core-proxy connection failed, but core still works
3. Vertically connected components
Each core connects to one proxy and one gateway.
Deal breaking issue: For mobile-assisted MFA core has to be able to route responses to all proxies, not just the one from which the request originates. This is not possible with this approach.
4. Decoupling the components via external queue
Introduce external message bus / queue into the stack - redis/rabbit
Fefactor the components to use the queue instead of gRPC
Deal-breaking issues:
Major rewrite of all communication
Increases deployment complexity
Decision
Option 1 - Active-active approach is selected.
Rationale
The active-active approach provides true high availability rather than availability through failover. With multiple core instances operating concurrently, the system is self-correcting and can continue to function during partial failures without waiting for explicit leader detection or role transitions.
Although active-active operation introduces the need to coordinate scheduled and background tasks, this coordination is more constrained and predictable than the failure detection, leader election, and fencing mechanisms required by an active-passive design.
Last updated