# 2.0

## High availability

### Requirements

1. Users should be able to run multiple instances of all components.
   1. Core
   2. Edge
   3. Gateway
2. In case any component fails, other instances should take over the work.
3. Recovery should be as fast as possible in terms of dropped requests or jobs. In a perfect scenario, we do not drop any requests or jobs, even if the components processing them fail mid-processing.

### High-level overview of Defguard workflows

<figure><img src="/files/Rf6AM1JhPTRoE301QKiW" alt=""><figcaption></figcaption></figure>

### Considered options

#### 1. Active-active

* Every core connects to every Edge and Gateway (full mesh at the app layer).
* Any core can initiate actions and handle requests from Edges/Gateways.
* A DB-based queue for async jobs, such as MFA disconnects.

<figure><img src="/files/KdTiLzSliC2HufjcGgQ1" alt=""><figcaption></figcaption></figure>

**Pros**

* No need for failure detection and leader election.
* "True HA" - nothing special has to happen when one of the components dies, because the connections already exist.
* Scales Core CPU if the control plane is truly parallelizable.
* Avoids the Core load-balancing issue of selecting the active Core when handling UI requests.

**Cons**

* Requires a job queue implementation to avoid duplicating scheduled work, e.g. MFA disconnects.
* Requires additional routing logic for Edge and Gateway components so they do not send the same request to multiple Cores.
* Requires a truly stateless Core.

#### 2. Active-Failover

* Run N cores.
* Exactly **one leader** stores the "lease row" in the database, then establishes gRPC connections to **all** Gateways and Edges and performs all write and control actions.
* The leader periodically updates the heartbeat on the lease row.
* Standby Cores monitor the heartbeat. If the heartbeat is not renewed, they race to acquire the lease.
* Once a Core successfully stores the lease row, it establishes gRPC connections and acts as the new leader.

<figure><img src="/files/xYfk8eV05NXNxZqqRefc" alt=""><figcaption></figcaption></figure>

**Pros**

* Cleanest correctness story (no split-brain control plane).
* Simplifies gRPC connection topology: each Edge/Gateway sees **one controlling core**.

**Cons**

* This is a "failover" rather than a true "HA" solution.
* Requires robust failure detection, leader election (DB lease / k8s lease), and careful failover to avoid two leaders during partitions.
* Does not scale control-plane (Core) throughput horizontally; it only gives us failover, which may be all we need for now.
* How do we route HTTP requests only to the active Core? Failover instances can reply with error codes to health checks, and then the load balancer routes requests only to the leader.
* What if only the Core-Edge connection fails, but Core itself still works?

#### 3. Vertically connected components

Each core connects to one Edge and one Gateway.

**Deal-breaking issue:** For mobile-assisted MFA, Core has to be able to route responses to all Edges, not just the one from which the request originates. This is not possible with this approach.

#### 4. Decoupling the components via external queue

* Introduce an external message bus or queue into the stack, such as Redis or RabbitMQ.
* Refactor the components to use the queue instead of gRPC.

**Deal-breaking** issues:

* A major rewrite of all communication.
* Increased deployment complexity.

### Decision

Option 1, the active-active approach, is selected.

### Rationale

The active-active approach provides true high availability rather than availability through failover. With multiple core instances operating concurrently, the system is self-correcting and can continue to function during partial failures without waiting for explicit leader detection or role transitions.

Although active-active operation introduces the need to coordinate scheduled and background tasks, this coordination is more constrained and predictable than the failure detection, leader election, and fencing mechanisms required by an active-passive design.

## Access Control List changes

### Separating Alias kinds

In Defguard 2.0, the new UI clearly separates the previously existing alias kinds into two distinct sections:

* **Component** aliases are now just **Aliases**.
* **Destination** aliases are now **Destinations**.

When creating or editing rules, there is now also a clear distinction in the UI between **Aliases**, which are combined with the manually configured destination, and predefined **Destinations**.

This better reflects the different roles of both types of aliases:

* **Aliases** are reusable fragments used to configure a rule-local destination.
* **Destinations** are complete predefined destinations, each converted into a separate set of firewall rules, just like the manually configured destination.

This distinction already existed in practice in previous versions, but it was not expressed clearly enough in the UI.

Both Aliases and Destinations are still stored in the same underlying database model. The new split is enforced at the UI and API level.

### Explicit destination configuration

In previous Defguard versions, the ACL rule logic regarding destinations broadly reflected how most firewalls, such as `nftables` and `packetfilter`, work. As a result, some behaviors were implicit.

In particular, rules and aliases could omit destination addresses, ports, or protocols, which implicitly meant "match any". This matched firewall semantics, but it introduced ambiguity in the data model and in the UI:

* The UI showed a placeholder value ("All addresses/ports/protocols"), but the intent to match all addresses, ports, or protocols was not represented in the data model itself.
* The logic for generating firewall rules had to assume user intent, especially for rules using aliases.
* Validation and editing logic had to handle a number of edge cases.

As ACL functionality evolved, this approach became harder to maintain consistently. Defguard 2.0 introduces a more explicit ACL model to reduce ambiguity while preserving the effective firewall behavior of existing rules.

#### Database model changes

In Defguard 2.0, the ACL database model makes destination semantics explicit instead of inferring "match any" from empty fields.

Both the aliases and rules database tables now include explicit boolean flags for configuring destinations:

* any\_address
* any\_port
* any\_protocol

In addition, rules now include **use\_manual\_destination\_settings**, which defines how destination configuration should be interpreted:

* When **true**, the rule uses its own destination fields together with referenced component aliases.
* When **false**, the rule uses only the referenced **Destinations**.

The rule model also adds **allow\_all\_groups** and **deny\_all\_groups** to align group handling with the already explicit "all" flags used for other source types.

Overall, the 2.0 schema preserves the effective firewall behavior of existing ACL rules while making the model clearer, easier to validate, and less dependent on implicit assumptions.

#### Backfill logic

The Defguard 2.0 database migration includes SQL backfill logic that converts legacy implicit ACL behavior into the new explicit model while preserving the meaning of existing rules.

For **Destinations** (previously destination aliases), the migration backfills the new `any_*` flags from the legacy fields:

* **any\_address** is set to **true** when the alias had no destination addresses and no destination ranges.
* **any\_port** is set to **true** when the alias had no ports.
* **any\_protocol** is set to **true** when the alias had no protocols.

For **Rules**, the migration evaluates both rule-local destination settings and linked aliases.

The rule flags are backfilled as follows:

* **any\_address** is set to **true** only if the rule had no direct destination addresses or ranges and no linked component alias contributed addresses.
* **any\_port** is set to **true** only if the rule had no direct ports and no linked component alias contributed ports.
* **any\_protocol** is set to **true** only if the rule had no direct protocols and no linked component alias contributed protocols.

The migration sets **use\_manual\_destination\_settings** to **false** only when a legacy rule was effectively driven entirely by destination aliases, meaning:

* the rule had no direct destination addresses, ranges, ports, or protocols,
* no linked component alias contributed any destination fragments,
* at least one linked destination alias existed.

In every other case, **use\_manual\_destination\_settings** remains **true**, preserving the previous behavior of rules that relied on direct destination settings or component aliases.

As a result, legacy empty destination fields become explicit "match any" flags, rules based only on destination aliases become explicit Destination-based rules, and mixed or manual rules continue to behave as they did before migration.

#### Renamed columns

The database migration also renames some columns in ACL-related tables to better reflect their purpose:

* `destination` -> `addresses`
* `all_networks` -> `all_locations`


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.defguard.net/in-depth/architecture-decision-records/2.0.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
