Environment_Infrastructure/setup/06-prod-terraform-iac.md
Murat ÖZDEMİR 67dc2986dd docs(infra): restructure and update infrastructure setup documentation
- Anglicized setup and facts markdown file names for better consistency.

- Updated 01-swarm-init-multinode.md to highlight Ansible automation of Swarm initialization and labeling.

- Overhauled 03-infra-stack-changes.md to describe the single monolithic file strategy and reflect current Redis, RabbitMQ, and etcd cluster configurations.

- Fixed minor overrides and typos in Patroni templates and Ansible bootstrap documents.

- Restructured README and roadmap mapping to align with the renamed setup documents.
2026-06-15 16:42:18 +03:00

354 lines
13 KiB
Markdown

# 06 - Prod Terraform IaC
The purpose of this phase is to create HA-focused IaaS resources inside the prod Hetzner Cloud Project with Terraform. This document can be given to the prod Terraform agent on its own.
## Scope
Terraform creates the following in the prod environment:
- Private network: `iklim-prod-net`
- Subnets:
- App/Swarm subnet: `10.20.10.0/24`
- DB subnet: `10.20.20.0/24`
- Firewall:
- Public ingress: only `22/tcp`, `80/tcp`, `443/tcp`
- Private ingress: prod rules in `01-private-network-port-matrix.md`
- SSH key
- Placement groups:
- `iklim-prod-app-spread`
- `iklim-prod-db-spread`
- Floating IP: stable IPv4 for the app entry point, assigned to `iklim-app-01`
- Servers:
- `iklim-app-01`
- `iklim-app-02`
- `iklim-app-03`
- `iklim-db-01`
- `iklim-db-02`
- `iklim-db-03`
- Ansible inventory output
DB cluster software will not be installed with Terraform. DB nodes will be prepared only at the machine, network, and firewall level.
## Version Requirements
```text
Terraform >= 1.6
hcloud provider ~> 1.49
```
## Recommended File Structure
```text
terraform/
hetzner/
prod/
versions.tf
providers.tf
variables.tf
locals.tf
network.tf
firewall.tf
placement.tf
servers.tf
floating_ip.tf
outputs.tf
terraform.tfvars.example
```
`terraform.tfvars`, state files, and tokens will not be committed to the repo.
## Variables
The `environment` constant is in `locals.tf`; it is not overridden with `tfvars`.
Minimum variables:
```hcl
hcloud_token = "secret"
location = "fsn1"
image = "rocky-10"
server_type_app = "cpx42"
server_type_db = "cpx32"
admin_ssh_public_key_path = "~/.ssh/id_rsa.pub"
admin_allowed_cidrs = ["X.X.X.X/32"]
```
The server type decision was made by considering the current test environment metrics in `../hetzner-sizing-report.md` and the prod cluster topology. `cpx42` is recommended for prod app nodes because of Java microservice memory pressure, and the more economical `cpx32` is recommended for prod DB nodes because the cluster starts with 3 nodes. When capacity needs are validated with metrics, nodes can be added or in-place rescale can be performed.
## Server Roles and Private IP Plan
| Server | Private IP | Role |
| --- | --- | --- |
| `iklim-app-01` | `10.20.10.11` | Swarm manager + app worker + runner; primary, receives FIP |
| `iklim-app-02` | `10.20.10.12` | Swarm manager + app worker + runner |
| `iklim-app-03` | `10.20.10.13` | Swarm manager + app worker + runner |
| `iklim-db-01` | `10.20.20.11` | Manual DB cluster node |
| `iklim-db-02` | `10.20.20.12` | Manual DB cluster node |
| `iklim-db-03` | `10.20.20.13` | Manual DB cluster node |
Private IPs are statically defined inside `locals.tf` as the `app_private_ips` and `db_private_ips` maps. The server list is derived from these maps with `for_each`.
## Recommended Resources and Cost
| Server | Role | Server Type | CPU | RAM | SSD | Monthly |
| --- | --- | --- | ---: | ---: | ---: | ---: |
| `iklim-app-01` | Swarm manager + app worker + runner | `cpx42` | 8 AMD | 16 GB | 320 GB | $29.99 |
| `iklim-app-02` | Swarm manager + app worker + runner | `cpx42` | 8 AMD | 16 GB | 320 GB | $29.99 |
| `iklim-app-03` | Swarm manager + app worker + runner | `cpx42` | 8 AMD | 16 GB | 320 GB | $29.99 |
| `iklim-db-01` | DB cluster node | `cpx32` | 4 AMD | 8 GB | 160 GB | $16.49 |
| `iklim-db-02` | DB cluster node | `cpx32` | 4 AMD | 8 GB | 160 GB | $16.49 |
| `iklim-db-03` | DB cluster node | `cpx32` | 4 AMD | 8 GB | 160 GB | $16.49 |
| **Total** | 6 servers | | **36 vCPU** | **72 GB** | **1,440 GB** | **$139.44** |
## Placement Group Decision
Two separate spread placement groups for prod:
```text
iklim-prod-app-spread: iklim-app-01/02/03
iklim-prod-db-spread: iklim-db-01/02/03
```
This aims to place Swarm quorum nodes on different physical hosts from each other, and DB nodes on different physical hosts from each other.
Notes:
- Hetzner does not provide direct cabinet selection.
- A spread placement group targets different physical hosts.
- Disaster recovery across different locations/regions is outside the scope of this phase.
- Multi-location DR must be designed separately later when scale grows.
## Floating IP
An IPv4 floating IP named `iklim-prod-app-fip` is created and assigned to `iklim-app-01`. The DNS A record is pointed to this IP. If failover is needed, the floating IP can be moved to another app node.
## Public Firewall
Public ingress:
| Port | Source | Target |
| --- | --- | --- |
| `22/tcp` | `admin_allowed_cidrs` | All prod nodes |
| `80/tcp` | `0.0.0.0/0`, `::/0` | `iklim-app-*` through Floating IP |
| `443/tcp` | `0.0.0.0/0`, `::/0` | `iklim-app-*` through Floating IP |
The following ports will not be opened publicly in prod:
- `8200/tcp` Vault
- `5432/tcp` PostgreSQL
- `27017/tcp` MongoDB
- `5672/tcp`, `15672/tcp`, `61613/tcp`, `15674/tcp` RabbitMQ
- `2377/tcp`, `7946/tcp`, `7946/udp`, `4789/udp` Docker Swarm
- `9180/tcp` APISIX Admin API
- `9090/tcp` Prometheus
- `3000/tcp` Grafana
## Private Firewall
Firewall placement follows the Swarm placement model:
- DB/cluster services on `iklim-db-*` nodes: Patroni/PostgreSQL, MongoDB, and etcd.
- App/service-node infrastructure on `iklim-app-*` nodes: Vault, RabbitMQ, APISIX, Prometheus, Grafana, SWAG, and the Redis/Sentinel services from `docker-stack-infra_db-prod.yml`.
RabbitMQ ports are therefore documented under the app firewall. Redis and Redis Sentinel do not publish host-mode ports in the current prod stack; they stay on the Docker overlay network and do not need Hetzner firewall openings.
### App (swarm) Firewall — Private Ingress
Source from app subnet (`10.20.10.0/24`):
| Port | Service | Access method |
| --- | --- | --- |
| `2377/tcp` | Docker Swarm control plane | From app subnet |
| `7946/tcp,udp` | Docker Swarm node discovery | From app subnet |
| `4789/udp` | Docker Swarm VXLAN overlay | From app subnet |
| `8200/tcp` | Vault | Docker overlay / private network |
| `5672/tcp` | RabbitMQ AMQP | From app subnet |
| `61613/tcp` | RabbitMQ STOMP | From app subnet |
| `15674/tcp` | RabbitMQ Web STOMP | From app subnet |
| `15672/tcp` | RabbitMQ Management | Behind SWAG `443` — IP restricted |
| `9000/tcp` | APISIX Dashboard | Behind SWAG `443` — IP restricted |
| `9180/tcp` | APISIX Admin API | Only Dashboard accesses it from Docker overlay |
| `9090/tcp` | Prometheus | Behind SWAG `443` — IP restricted |
| `3000/tcp` | Grafana | Behind SWAG `443` — IP restricted |
Source from DB subnet, because `iklim-db-*` nodes join Swarm as workers:
| Port | Service | Source |
| --- | --- | --- |
| `2377/tcp` | Docker Swarm control plane | `10.20.20.0/24` |
| `7946/tcp,udp` | Docker Swarm node discovery | `10.20.20.0/24` |
| `4789/udp` | Docker Swarm VXLAN overlay | `10.20.20.0/24` |
### DB Firewall — Private Ingress
Admin access:
| Port | Service | Source |
| --- | --- | --- |
| `22/tcp` | SSH | `admin_allowed_cidrs` |
Source from app subnet (`10.20.10.0/24`):
| Port | Service | Note |
| --- | --- | --- |
| `5432/tcp` | PostgreSQL (Patroni primary) | App subnet access |
| `27017/tcp` | MongoDB replica set endpoint | App subnet access |
| `2379/tcp` | etcd client (Patroni + APISIX) | App subnet access |
| `2377/tcp` | Docker Swarm control plane | From app subnet |
| `7946/tcp,udp` | Docker Swarm node discovery | From app subnet |
| `4789/udp` | Docker Swarm VXLAN overlay | From app subnet |
Mutual access inside the DB subnet (`10.20.20.0/24`):
| Port | Service | Note |
| --- | --- | --- |
| `5432/tcp` | PostgreSQL Patroni replication | Between DB nodes |
| `27017/tcp` | MongoDB replica set internal | Between DB nodes |
| `2379/tcp` | etcd client | Patroni -> etcd access |
| `2380/tcp` | etcd peer | etcd cluster internal |
| `8008/tcp` | Patroni REST API | Patroni leader election and health check |
IP restriction is done in the SWAG nginx configuration, not in the Hetzner firewall.
## Outputs
The following values can be obtained after `terraform apply` or `terraform output`:
| Output | Description |
| --- | --- |
| `ansible_inventory_yaml` | Ansible inventory YAML — written to `ansible/prod/inventory/generated/prod.yml` |
| `prod_private_ips` | Private IP map of all nodes, with `app` and `db` subkeys |
| `prod_public_ips` | Public IPv4 map of all nodes |
| `prod_floating_ip` | Floating IP address for the Swarm entry point; DNS A record points to this IP |
To extract the Ansible inventory:
```bash
terraform output -raw ansible_inventory_yaml > \
../../../ansible/prod/inventory/generated/prod.yml
```
## Lifecycle and Resize Policy
### `server_type` Change (Resize)
Changing `server_type` does **not** trigger Terraform destroy+create. The `hcloud` provider supports this natively: it stops the server, calls the Hetzner Resize API, and starts it again. Update the value in `terraform.tfvars` and run `terraform apply`.
There is downtime, because the server stops and starts, but disk, installed software, and Docker volumes are preserved. No `ignore_changes` or manual step is required.
### Which Changes Force Server Recreation?
| Changed field | Behavior | Note |
| --- | --- | --- |
| `server_type` | In-place resize (provider native) | `terraform apply` is enough |
| `hcloud_server_network` | Only attachment is updated | Because a separate resource is used |
| `hcloud_firewall_attachment` | Only attachment is updated | Because a separate resource is used |
| `placement_group_id` | Hetzner API does not allow changing it -> destroy+create | Do not change |
| `image` | Disk image changes -> destroy+create | Do not change |
| `location` | Cannot be moved to another datacenter -> destroy+create | Do not change |
### Network and Firewall Attachment Separation
The `network` block and `firewall_ids` are not embedded inside `hcloud_server`. Instead, separate resources are defined:
- `hcloud_server_network` — private IP assignment, for each node with `for_each`
- `hcloud_firewall_attachment` — firewall relationship, using the server list derived with `for_each`
### `prevent_destroy` Protection
Each server gets `lifecycle { prevent_destroy = true }`. To intentionally delete a server, temporarily remove the lifecycle block first.
## How to Run
### Preparation
**1. Create tfvars once:**
```bash
cd Environment_Infrastructure/terraform/hetzner/prod
cp terraform.tfvars.example terraform.tfvars
# Fill terraform.tfvars with real values
# (hcloud_token, admin_allowed_cidrs, etc.)
```
`terraform.tfvars` is not committed; it is protected with `.gitignore`.
**2. Install the provider once:**
```bash
terraform init
```
### First Apply
```bash
# Show what will be created; do not make changes
terraform plan
# Approve and create
terraform apply
```
After `apply`, 6 servers, 2 firewalls, 1 floating IP, and network resources are visible in Hetzner.
### Get Ansible Inventory
```bash
terraform output -raw ansible_inventory_yaml > \
../../../ansible/prod/inventory/generated/prod.yml
```
### Gitea Variable: `PROD_FLOATING_IP`
The deploy pipeline needs this variable to manage DNS records automatically. It is set once after `terraform apply`:
```bash
terraform output prod_floating_ip
```
Add the resulting IP address in Gitea -> project settings -> **Variables** with the name `PROD_FLOATING_IP`. The pipeline reads it with `vars.PROD_FLOATING_IP` and updates GoDaddy A records idempotently.
### Resize (Change Server Type)
Change the `server_type_app` or `server_type_db` value inside `terraform.tfvars`:
```bash
terraform apply
```
The server is stopped, the Hetzner Resize API is called, and the server is started again. Disk and Docker volumes are preserved. There is downtime.
### Server Deletion (Forced)
Because `prevent_destroy = true` exists, normal `terraform destroy` fails. First, temporarily remove the `lifecycle` block inside `servers.tf`:
```hcl
# lifecycle {
# prevent_destroy = true
# }
```
Then:
```bash
terraform destroy -target=hcloud_server.app["iklim-app-01"]
```
After completing the operation, add the lifecycle block back.
### State Management
Local state is used for now (`terraform.tfstate`). The state file is not committed to the repo. If more than one person works on the team, Hetzner Object Storage or HCP Terraform remote state must be used.
## Acceptance Criteria
- `terraform plan` works only with the prod Hetzner Project token.
- 6 servers are created: `iklim-app-01/02/03`, `iklim-db-01/02/03`.
- Swarm nodes are inside the `iklim-prod-app-spread` placement group.
- DB nodes are inside the `iklim-prod-db-spread` placement group.
- Public firewall allows only `22`, `80`, and `443` ingress.
- Private firewall is compatible with `01-private-network-port-matrix.md`.
- DB replication ports are accessible only from the DB subnet.
- Floating IP is created and assigned to `iklim-app-01`.
- Terraform state and secret tfvars are not committed.