- Anglicized setup and facts markdown file names for better consistency. - Updated 01-swarm-init-multinode.md to highlight Ansible automation of Swarm initialization and labeling. - Overhauled 03-infra-stack-changes.md to describe the single monolithic file strategy and reflect current Redis, RabbitMQ, and etcd cluster configurations. - Fixed minor overrides and typos in Patroni templates and Ansible bootstrap documents. - Restructured README and roadmap mapping to align with the renamed setup documents.
354 lines
13 KiB
Markdown
354 lines
13 KiB
Markdown
# 06 - Prod Terraform IaC
|
|
|
|
The purpose of this phase is to create HA-focused IaaS resources inside the prod Hetzner Cloud Project with Terraform. This document can be given to the prod Terraform agent on its own.
|
|
|
|
## Scope
|
|
|
|
Terraform creates the following in the prod environment:
|
|
|
|
- Private network: `iklim-prod-net`
|
|
- Subnets:
|
|
- App/Swarm subnet: `10.20.10.0/24`
|
|
- DB subnet: `10.20.20.0/24`
|
|
- Firewall:
|
|
- Public ingress: only `22/tcp`, `80/tcp`, `443/tcp`
|
|
- Private ingress: prod rules in `01-private-network-port-matrix.md`
|
|
- SSH key
|
|
- Placement groups:
|
|
- `iklim-prod-app-spread`
|
|
- `iklim-prod-db-spread`
|
|
- Floating IP: stable IPv4 for the app entry point, assigned to `iklim-app-01`
|
|
- Servers:
|
|
- `iklim-app-01`
|
|
- `iklim-app-02`
|
|
- `iklim-app-03`
|
|
- `iklim-db-01`
|
|
- `iklim-db-02`
|
|
- `iklim-db-03`
|
|
- Ansible inventory output
|
|
|
|
DB cluster software will not be installed with Terraform. DB nodes will be prepared only at the machine, network, and firewall level.
|
|
|
|
## Version Requirements
|
|
|
|
```text
|
|
Terraform >= 1.6
|
|
hcloud provider ~> 1.49
|
|
```
|
|
|
|
## Recommended File Structure
|
|
|
|
```text
|
|
terraform/
|
|
hetzner/
|
|
prod/
|
|
versions.tf
|
|
providers.tf
|
|
variables.tf
|
|
locals.tf
|
|
network.tf
|
|
firewall.tf
|
|
placement.tf
|
|
servers.tf
|
|
floating_ip.tf
|
|
outputs.tf
|
|
terraform.tfvars.example
|
|
```
|
|
|
|
`terraform.tfvars`, state files, and tokens will not be committed to the repo.
|
|
|
|
## Variables
|
|
|
|
The `environment` constant is in `locals.tf`; it is not overridden with `tfvars`.
|
|
|
|
Minimum variables:
|
|
|
|
```hcl
|
|
hcloud_token = "secret"
|
|
location = "fsn1"
|
|
image = "rocky-10"
|
|
server_type_app = "cpx42"
|
|
server_type_db = "cpx32"
|
|
admin_ssh_public_key_path = "~/.ssh/id_rsa.pub"
|
|
admin_allowed_cidrs = ["X.X.X.X/32"]
|
|
```
|
|
|
|
The server type decision was made by considering the current test environment metrics in `../hetzner-sizing-report.md` and the prod cluster topology. `cpx42` is recommended for prod app nodes because of Java microservice memory pressure, and the more economical `cpx32` is recommended for prod DB nodes because the cluster starts with 3 nodes. When capacity needs are validated with metrics, nodes can be added or in-place rescale can be performed.
|
|
|
|
## Server Roles and Private IP Plan
|
|
|
|
| Server | Private IP | Role |
|
|
| --- | --- | --- |
|
|
| `iklim-app-01` | `10.20.10.11` | Swarm manager + app worker + runner; primary, receives FIP |
|
|
| `iklim-app-02` | `10.20.10.12` | Swarm manager + app worker + runner |
|
|
| `iklim-app-03` | `10.20.10.13` | Swarm manager + app worker + runner |
|
|
| `iklim-db-01` | `10.20.20.11` | Manual DB cluster node |
|
|
| `iklim-db-02` | `10.20.20.12` | Manual DB cluster node |
|
|
| `iklim-db-03` | `10.20.20.13` | Manual DB cluster node |
|
|
|
|
Private IPs are statically defined inside `locals.tf` as the `app_private_ips` and `db_private_ips` maps. The server list is derived from these maps with `for_each`.
|
|
|
|
## Recommended Resources and Cost
|
|
|
|
| Server | Role | Server Type | CPU | RAM | SSD | Monthly |
|
|
| --- | --- | --- | ---: | ---: | ---: | ---: |
|
|
| `iklim-app-01` | Swarm manager + app worker + runner | `cpx42` | 8 AMD | 16 GB | 320 GB | $29.99 |
|
|
| `iklim-app-02` | Swarm manager + app worker + runner | `cpx42` | 8 AMD | 16 GB | 320 GB | $29.99 |
|
|
| `iklim-app-03` | Swarm manager + app worker + runner | `cpx42` | 8 AMD | 16 GB | 320 GB | $29.99 |
|
|
| `iklim-db-01` | DB cluster node | `cpx32` | 4 AMD | 8 GB | 160 GB | $16.49 |
|
|
| `iklim-db-02` | DB cluster node | `cpx32` | 4 AMD | 8 GB | 160 GB | $16.49 |
|
|
| `iklim-db-03` | DB cluster node | `cpx32` | 4 AMD | 8 GB | 160 GB | $16.49 |
|
|
| **Total** | 6 servers | | **36 vCPU** | **72 GB** | **1,440 GB** | **$139.44** |
|
|
|
|
## Placement Group Decision
|
|
|
|
Two separate spread placement groups for prod:
|
|
|
|
```text
|
|
iklim-prod-app-spread: iklim-app-01/02/03
|
|
iklim-prod-db-spread: iklim-db-01/02/03
|
|
```
|
|
|
|
This aims to place Swarm quorum nodes on different physical hosts from each other, and DB nodes on different physical hosts from each other.
|
|
|
|
Notes:
|
|
|
|
- Hetzner does not provide direct cabinet selection.
|
|
- A spread placement group targets different physical hosts.
|
|
- Disaster recovery across different locations/regions is outside the scope of this phase.
|
|
- Multi-location DR must be designed separately later when scale grows.
|
|
|
|
## Floating IP
|
|
|
|
An IPv4 floating IP named `iklim-prod-app-fip` is created and assigned to `iklim-app-01`. The DNS A record is pointed to this IP. If failover is needed, the floating IP can be moved to another app node.
|
|
|
|
## Public Firewall
|
|
|
|
Public ingress:
|
|
|
|
| Port | Source | Target |
|
|
| --- | --- | --- |
|
|
| `22/tcp` | `admin_allowed_cidrs` | All prod nodes |
|
|
| `80/tcp` | `0.0.0.0/0`, `::/0` | `iklim-app-*` through Floating IP |
|
|
| `443/tcp` | `0.0.0.0/0`, `::/0` | `iklim-app-*` through Floating IP |
|
|
|
|
The following ports will not be opened publicly in prod:
|
|
|
|
- `8200/tcp` Vault
|
|
- `5432/tcp` PostgreSQL
|
|
- `27017/tcp` MongoDB
|
|
- `5672/tcp`, `15672/tcp`, `61613/tcp`, `15674/tcp` RabbitMQ
|
|
- `2377/tcp`, `7946/tcp`, `7946/udp`, `4789/udp` Docker Swarm
|
|
- `9180/tcp` APISIX Admin API
|
|
- `9090/tcp` Prometheus
|
|
- `3000/tcp` Grafana
|
|
|
|
## Private Firewall
|
|
|
|
Firewall placement follows the Swarm placement model:
|
|
|
|
- DB/cluster services on `iklim-db-*` nodes: Patroni/PostgreSQL, MongoDB, and etcd.
|
|
- App/service-node infrastructure on `iklim-app-*` nodes: Vault, RabbitMQ, APISIX, Prometheus, Grafana, SWAG, and the Redis/Sentinel services from `docker-stack-infra_db-prod.yml`.
|
|
|
|
RabbitMQ ports are therefore documented under the app firewall. Redis and Redis Sentinel do not publish host-mode ports in the current prod stack; they stay on the Docker overlay network and do not need Hetzner firewall openings.
|
|
|
|
### App (swarm) Firewall — Private Ingress
|
|
|
|
Source from app subnet (`10.20.10.0/24`):
|
|
|
|
| Port | Service | Access method |
|
|
| --- | --- | --- |
|
|
| `2377/tcp` | Docker Swarm control plane | From app subnet |
|
|
| `7946/tcp,udp` | Docker Swarm node discovery | From app subnet |
|
|
| `4789/udp` | Docker Swarm VXLAN overlay | From app subnet |
|
|
| `8200/tcp` | Vault | Docker overlay / private network |
|
|
| `5672/tcp` | RabbitMQ AMQP | From app subnet |
|
|
| `61613/tcp` | RabbitMQ STOMP | From app subnet |
|
|
| `15674/tcp` | RabbitMQ Web STOMP | From app subnet |
|
|
| `15672/tcp` | RabbitMQ Management | Behind SWAG `443` — IP restricted |
|
|
| `9000/tcp` | APISIX Dashboard | Behind SWAG `443` — IP restricted |
|
|
| `9180/tcp` | APISIX Admin API | Only Dashboard accesses it from Docker overlay |
|
|
| `9090/tcp` | Prometheus | Behind SWAG `443` — IP restricted |
|
|
| `3000/tcp` | Grafana | Behind SWAG `443` — IP restricted |
|
|
|
|
Source from DB subnet, because `iklim-db-*` nodes join Swarm as workers:
|
|
|
|
| Port | Service | Source |
|
|
| --- | --- | --- |
|
|
| `2377/tcp` | Docker Swarm control plane | `10.20.20.0/24` |
|
|
| `7946/tcp,udp` | Docker Swarm node discovery | `10.20.20.0/24` |
|
|
| `4789/udp` | Docker Swarm VXLAN overlay | `10.20.20.0/24` |
|
|
|
|
### DB Firewall — Private Ingress
|
|
|
|
Admin access:
|
|
|
|
| Port | Service | Source |
|
|
| --- | --- | --- |
|
|
| `22/tcp` | SSH | `admin_allowed_cidrs` |
|
|
|
|
Source from app subnet (`10.20.10.0/24`):
|
|
|
|
| Port | Service | Note |
|
|
| --- | --- | --- |
|
|
| `5432/tcp` | PostgreSQL (Patroni primary) | App subnet access |
|
|
| `27017/tcp` | MongoDB replica set endpoint | App subnet access |
|
|
| `2379/tcp` | etcd client (Patroni + APISIX) | App subnet access |
|
|
| `2377/tcp` | Docker Swarm control plane | From app subnet |
|
|
| `7946/tcp,udp` | Docker Swarm node discovery | From app subnet |
|
|
| `4789/udp` | Docker Swarm VXLAN overlay | From app subnet |
|
|
|
|
Mutual access inside the DB subnet (`10.20.20.0/24`):
|
|
|
|
| Port | Service | Note |
|
|
| --- | --- | --- |
|
|
| `5432/tcp` | PostgreSQL Patroni replication | Between DB nodes |
|
|
| `27017/tcp` | MongoDB replica set internal | Between DB nodes |
|
|
| `2379/tcp` | etcd client | Patroni -> etcd access |
|
|
| `2380/tcp` | etcd peer | etcd cluster internal |
|
|
| `8008/tcp` | Patroni REST API | Patroni leader election and health check |
|
|
|
|
IP restriction is done in the SWAG nginx configuration, not in the Hetzner firewall.
|
|
|
|
## Outputs
|
|
|
|
The following values can be obtained after `terraform apply` or `terraform output`:
|
|
|
|
| Output | Description |
|
|
| --- | --- |
|
|
| `ansible_inventory_yaml` | Ansible inventory YAML — written to `ansible/prod/inventory/generated/prod.yml` |
|
|
| `prod_private_ips` | Private IP map of all nodes, with `app` and `db` subkeys |
|
|
| `prod_public_ips` | Public IPv4 map of all nodes |
|
|
| `prod_floating_ip` | Floating IP address for the Swarm entry point; DNS A record points to this IP |
|
|
|
|
To extract the Ansible inventory:
|
|
|
|
```bash
|
|
terraform output -raw ansible_inventory_yaml > \
|
|
../../../ansible/prod/inventory/generated/prod.yml
|
|
```
|
|
|
|
## Lifecycle and Resize Policy
|
|
|
|
### `server_type` Change (Resize)
|
|
|
|
Changing `server_type` does **not** trigger Terraform destroy+create. The `hcloud` provider supports this natively: it stops the server, calls the Hetzner Resize API, and starts it again. Update the value in `terraform.tfvars` and run `terraform apply`.
|
|
|
|
There is downtime, because the server stops and starts, but disk, installed software, and Docker volumes are preserved. No `ignore_changes` or manual step is required.
|
|
|
|
### Which Changes Force Server Recreation?
|
|
|
|
| Changed field | Behavior | Note |
|
|
| --- | --- | --- |
|
|
| `server_type` | In-place resize (provider native) | `terraform apply` is enough |
|
|
| `hcloud_server_network` | Only attachment is updated | Because a separate resource is used |
|
|
| `hcloud_firewall_attachment` | Only attachment is updated | Because a separate resource is used |
|
|
| `placement_group_id` | Hetzner API does not allow changing it -> destroy+create | Do not change |
|
|
| `image` | Disk image changes -> destroy+create | Do not change |
|
|
| `location` | Cannot be moved to another datacenter -> destroy+create | Do not change |
|
|
|
|
### Network and Firewall Attachment Separation
|
|
|
|
The `network` block and `firewall_ids` are not embedded inside `hcloud_server`. Instead, separate resources are defined:
|
|
|
|
- `hcloud_server_network` — private IP assignment, for each node with `for_each`
|
|
- `hcloud_firewall_attachment` — firewall relationship, using the server list derived with `for_each`
|
|
|
|
### `prevent_destroy` Protection
|
|
|
|
Each server gets `lifecycle { prevent_destroy = true }`. To intentionally delete a server, temporarily remove the lifecycle block first.
|
|
|
|
## How to Run
|
|
|
|
### Preparation
|
|
|
|
**1. Create tfvars once:**
|
|
|
|
```bash
|
|
cd Environment_Infrastructure/terraform/hetzner/prod
|
|
cp terraform.tfvars.example terraform.tfvars
|
|
# Fill terraform.tfvars with real values
|
|
# (hcloud_token, admin_allowed_cidrs, etc.)
|
|
```
|
|
|
|
`terraform.tfvars` is not committed; it is protected with `.gitignore`.
|
|
|
|
**2. Install the provider once:**
|
|
|
|
```bash
|
|
terraform init
|
|
```
|
|
|
|
### First Apply
|
|
|
|
```bash
|
|
# Show what will be created; do not make changes
|
|
terraform plan
|
|
|
|
# Approve and create
|
|
terraform apply
|
|
```
|
|
|
|
After `apply`, 6 servers, 2 firewalls, 1 floating IP, and network resources are visible in Hetzner.
|
|
|
|
### Get Ansible Inventory
|
|
|
|
```bash
|
|
terraform output -raw ansible_inventory_yaml > \
|
|
../../../ansible/prod/inventory/generated/prod.yml
|
|
```
|
|
|
|
### Gitea Variable: `PROD_FLOATING_IP`
|
|
|
|
The deploy pipeline needs this variable to manage DNS records automatically. It is set once after `terraform apply`:
|
|
|
|
```bash
|
|
terraform output prod_floating_ip
|
|
```
|
|
|
|
Add the resulting IP address in Gitea -> project settings -> **Variables** with the name `PROD_FLOATING_IP`. The pipeline reads it with `vars.PROD_FLOATING_IP` and updates GoDaddy A records idempotently.
|
|
|
|
### Resize (Change Server Type)
|
|
|
|
Change the `server_type_app` or `server_type_db` value inside `terraform.tfvars`:
|
|
|
|
```bash
|
|
terraform apply
|
|
```
|
|
|
|
The server is stopped, the Hetzner Resize API is called, and the server is started again. Disk and Docker volumes are preserved. There is downtime.
|
|
|
|
### Server Deletion (Forced)
|
|
|
|
Because `prevent_destroy = true` exists, normal `terraform destroy` fails. First, temporarily remove the `lifecycle` block inside `servers.tf`:
|
|
|
|
```hcl
|
|
# lifecycle {
|
|
# prevent_destroy = true
|
|
# }
|
|
```
|
|
|
|
Then:
|
|
|
|
```bash
|
|
terraform destroy -target=hcloud_server.app["iklim-app-01"]
|
|
```
|
|
|
|
After completing the operation, add the lifecycle block back.
|
|
|
|
### State Management
|
|
|
|
Local state is used for now (`terraform.tfstate`). The state file is not committed to the repo. If more than one person works on the team, Hetzner Object Storage or HCP Terraform remote state must be used.
|
|
|
|
## Acceptance Criteria
|
|
|
|
- `terraform plan` works only with the prod Hetzner Project token.
|
|
- 6 servers are created: `iklim-app-01/02/03`, `iklim-db-01/02/03`.
|
|
- Swarm nodes are inside the `iklim-prod-app-spread` placement group.
|
|
- DB nodes are inside the `iklim-prod-db-spread` placement group.
|
|
- Public firewall allows only `22`, `80`, and `443` ingress.
|
|
- Private firewall is compatible with `01-private-network-port-matrix.md`.
|
|
- DB replication ports are accessible only from the DB subnet.
|
|
- Floating IP is created and assigned to `iklim-app-01`.
|
|
- Terraform state and secret tfvars are not committed.
|