Environment_Infrastructure/setup/06-prod-terraform-iaac.md
Murat ÖZDEMİR 8780c7c05e docs(db): implement direct cluster access strategy for production
- Updated roadmap (03-infra-stack-changes.md) to deprecate database proxies in prod.
- Detailed direct subnet access via WireGuard for production developers.
- Provided multi-host connection parameters for Patroni and MongoDB Replica Sets in setup guide (08-prod-db-cluster-kurulum.md).
- Added environment comparison table to developer access guide.
2026-05-18 14:25:26 +03:00

12 KiB

06 - Prod Terraform IaC

The purpose of this phase is to create HA-focused IaaS resources inside the prod Hetzner Cloud Project with Terraform. This document can be given to the prod Terraform agent on its own.

Scope

Terraform creates the following in the prod environment:

  • Private network: iklim-prod-net
  • Subnets:
    • App/Swarm subnet: 10.20.10.0/24
    • DB subnet: 10.20.20.0/24
  • Firewall:
    • Public ingress: only 22/tcp, 80/tcp, 443/tcp
    • Private ingress: prod rules in 01-private-network-port-matrisi.md
  • SSH key
  • Placement groups:
    • iklim-prod-app-spread
    • iklim-prod-db-spread
  • Floating IP: stable IPv4 for the app entry point, assigned to iklim-app-01
  • Servers:
    • iklim-app-01
    • iklim-app-02
    • iklim-app-03
    • iklim-db-01
    • iklim-db-02
    • iklim-db-03
  • Ansible inventory output

DB cluster software will not be installed with Terraform. DB nodes will be prepared only at the machine, network, and firewall level.

Version Requirements

Terraform  >= 1.6
hcloud provider  ~> 1.49
terraform/
  hetzner/
    prod/
      versions.tf
      providers.tf
      variables.tf
      locals.tf
      network.tf
      firewall.tf
      placement.tf
      servers.tf
      floating_ip.tf
      outputs.tf
      terraform.tfvars.example

terraform.tfvars, state files, and tokens will not be committed to the repo.

Variables

The environment constant is in locals.tf; it is not overridden with tfvars.

Minimum variables:

hcloud_token              = "secret"
location                  = "fsn1"
image                     = "rocky-10"
server_type_swarm         = "cpx42"
server_type_db            = "cpx32"
admin_ssh_public_key_path = "~/.ssh/id_ed25519.pub"
admin_allowed_cidrs       = ["X.X.X.X/32"]

The server type decision was made by considering the current test environment metrics in ../hetzner-sizing-report.md and the prod cluster topology. cpx42 is recommended for prod app nodes because of Java microservice memory pressure, and the more economical cpx32 is recommended for prod DB nodes because the cluster starts with 3 nodes. When capacity needs are validated with metrics, nodes can be added or in-place rescale can be performed.

Server Roles and Private IP Plan

Server Private IP Role
iklim-app-01 10.20.10.11 Swarm manager + app worker + runner; primary, receives FIP
iklim-app-02 10.20.10.12 Swarm manager + app worker + runner
iklim-app-03 10.20.10.13 Swarm manager + app worker + runner
iklim-db-01 10.20.20.11 Manual DB cluster node
iklim-db-02 10.20.20.12 Manual DB cluster node
iklim-db-03 10.20.20.13 Manual DB cluster node

Private IPs are statically defined inside locals.tf as the swarm_private_ips and db_private_ips maps. The server list is derived from these maps with for_each.

Server Role Server Type CPU RAM SSD Monthly
iklim-app-01 Swarm manager + app worker + runner cpx42 8 AMD 16 GB 320 GB $29.99
iklim-app-02 Swarm manager + app worker + runner cpx42 8 AMD 16 GB 320 GB $29.99
iklim-app-03 Swarm manager + app worker + runner cpx42 8 AMD 16 GB 320 GB $29.99
iklim-db-01 DB cluster node cpx32 4 AMD 8 GB 160 GB $16.49
iklim-db-02 DB cluster node cpx32 4 AMD 8 GB 160 GB $16.49
iklim-db-03 DB cluster node cpx32 4 AMD 8 GB 160 GB $16.49
Total 6 servers 36 vCPU 72 GB 1,440 GB $139.44

Placement Group Decision

Two separate spread placement groups for prod:

iklim-prod-app-spread: iklim-app-01/02/03
iklim-prod-db-spread:  iklim-db-01/02/03

This aims to place Swarm quorum nodes on different physical hosts from each other, and DB nodes on different physical hosts from each other.

Notes:

  • Hetzner does not provide direct cabinet selection.
  • A spread placement group targets different physical hosts.
  • Disaster recovery across different locations/regions is outside the scope of this phase.
  • Multi-location DR must be designed separately later when scale grows.

Floating IP

An IPv4 floating IP named iklim-prod-app-fip is created and assigned to iklim-app-01. The DNS A record is pointed to this IP. If failover is needed, the floating IP can be moved to another app node.

Public Firewall

Public ingress:

Port Source Target
22/tcp admin_allowed_cidrs All prod nodes
80/tcp 0.0.0.0/0, ::/0 iklim-app-* through Floating IP
443/tcp 0.0.0.0/0, ::/0 iklim-app-* through Floating IP

The following ports will not be opened publicly in prod:

  • 8200/tcp Vault
  • 5432/tcp PostgreSQL
  • 27017/tcp MongoDB
  • 6379/tcp Redis
  • 5672/tcp, 15672/tcp, 61613/tcp, 15674/tcp RabbitMQ
  • 2377/tcp, 7946/tcp, 7946/udp, 4789/udp Docker Swarm
  • 9180/tcp APISIX Admin API
  • 9090/tcp Prometheus
  • 3000/tcp Grafana

Private Firewall

App (swarm) Firewall — Private Ingress

Source from app subnet (10.20.10.0/24):

Port Service Access method
2377/tcp Docker Swarm control plane From app subnet
7946/tcp,udp Docker Swarm node discovery From app subnet
4789/udp Docker Swarm VXLAN overlay From app subnet
8200/tcp Vault Docker overlay / private network
6379/tcp Redis From app subnet
5672/tcp RabbitMQ AMQP From app subnet
61613/tcp RabbitMQ STOMP From app subnet
15674/tcp RabbitMQ Web STOMP From app subnet
15672/tcp RabbitMQ Management Behind SWAG 443 — IP restricted
9000/tcp APISIX Dashboard Behind SWAG 443 — IP restricted
9180/tcp APISIX Admin API Only Dashboard accesses it from Docker overlay
9090/tcp Prometheus Behind SWAG 443 — IP restricted
3000/tcp Grafana Behind SWAG 443 — IP restricted

Source from DB subnet, because iklim-db-* nodes join Swarm as workers:

Port Service Source
2377/tcp Docker Swarm control plane 10.20.20.0/24
7946/tcp,udp Docker Swarm node discovery 10.20.20.0/24
4789/udp Docker Swarm VXLAN overlay 10.20.20.0/24

DB Firewall — Private Ingress

Admin access:

Port Service Source
22/tcp SSH admin_allowed_cidrs

Source from app subnet (10.20.10.0/24):

Port Service Note
5432/tcp PostgreSQL (Patroni primary) App subnet access
27017/tcp MongoDB replica set endpoint App subnet access
2379/tcp etcd client (Patroni + APISIX) App subnet access
2377/tcp Docker Swarm control plane From app subnet
7946/tcp,udp Docker Swarm node discovery From app subnet
4789/udp Docker Swarm VXLAN overlay From app subnet

Mutual access inside the DB subnet (10.20.20.0/24):

Port Service Note
5432/tcp PostgreSQL Patroni replication Between DB nodes
27017/tcp MongoDB replica set internal Between DB nodes
2379/tcp etcd client Patroni -> etcd access
2380/tcp etcd peer etcd cluster internal
8008/tcp Patroni REST API Patroni leader election and health check

IP restriction is done in the SWAG nginx configuration, not in the Hetzner firewall.

Outputs

The following values can be obtained after terraform apply or terraform output:

Output Description
ansible_inventory_yaml Ansible inventory YAML — written to ansible/inventory/generated/prod.yml
prod_private_ips Private IP map of all nodes, with swarm and db subkeys
prod_public_ips Public IPv4 map of all nodes
prod_floating_ip Floating IP address for the Swarm entry point; DNS A record points to this IP

To extract the Ansible inventory:

terraform output -raw ansible_inventory_yaml > \
  ../../ansible/inventory/generated/prod.yml

Lifecycle and Resize Policy

server_type Change (Resize)

Changing server_type does not trigger Terraform destroy+create. The hcloud provider supports this natively: it stops the server, calls the Hetzner Resize API, and starts it again. Update the value in terraform.tfvars and run terraform apply.

There is downtime, because the server stops and starts, but disk, installed software, and Docker volumes are preserved. No ignore_changes or manual step is required.

Which Changes Force Server Recreation?

Changed field Behavior Note
server_type In-place resize (provider native) terraform apply is enough
hcloud_server_network Only attachment is updated Because a separate resource is used
hcloud_firewall_attachment Only attachment is updated Because a separate resource is used
placement_group_id Hetzner API does not allow changing it -> destroy+create Do not change
image Disk image changes -> destroy+create Do not change
location Cannot be moved to another datacenter -> destroy+create Do not change

Network and Firewall Attachment Separation

The network block and firewall_ids are not embedded inside hcloud_server. Instead, separate resources are defined:

  • hcloud_server_network — private IP assignment, for each node with for_each
  • hcloud_firewall_attachment — firewall relationship, using the server list derived with for_each

prevent_destroy Protection

Each server gets lifecycle { prevent_destroy = true }. To intentionally delete a server, temporarily remove the lifecycle block first.

How to Run

Preparation

1. Create tfvars once:

cd Environment_Infrastructure/terraform/hetzner/prod
cp terraform.tfvars.example terraform.tfvars
# Fill terraform.tfvars with real values
# (hcloud_token, admin_allowed_cidrs, etc.)

terraform.tfvars is not committed; it is protected with .gitignore.

2. Install the provider once:

terraform init

First Apply

# Show what will be created; do not make changes
terraform plan

# Approve and create
terraform apply

After apply, 6 servers, 2 firewalls, 1 floating IP, and network resources are visible in Hetzner.

Get Ansible Inventory

terraform output -raw ansible_inventory_yaml > \
  ../../ansible/inventory/generated/prod.yml

Gitea Variable: PROD_FLOATING_IP

The deploy pipeline needs this variable to manage DNS records automatically. It is set once after terraform apply:

terraform output prod_floating_ip

Add the resulting IP address in Gitea -> project settings -> Variables with the name PROD_FLOATING_IP. The pipeline reads it with vars.PROD_FLOATING_IP and updates GoDaddy A records idempotently.

Resize (Change Server Type)

Change the server_type_swarm or server_type_db value inside terraform.tfvars:

terraform apply

The server is stopped, the Hetzner Resize API is called, and the server is started again. Disk and Docker volumes are preserved. There is downtime.

Server Deletion (Forced)

Because prevent_destroy = true exists, normal terraform destroy fails. First, temporarily remove the lifecycle block inside servers.tf:

# lifecycle {
#   prevent_destroy = true
# }

Then:

terraform destroy -target=hcloud_server.swarm["iklim-app-01"]

After completing the operation, add the lifecycle block back.

State Management

Local state is used for now (terraform.tfstate). The state file is not committed to the repo. If more than one person works on the team, Hetzner Object Storage or HCP Terraform remote state must be used.

Acceptance Criteria

  • terraform plan works only with the prod Hetzner Project token.
  • 6 servers are created: iklim-app-01/02/03, iklim-db-01/02/03.
  • Swarm nodes are inside the iklim-prod-app-spread placement group.
  • DB nodes are inside the iklim-prod-db-spread placement group.
  • Public firewall allows only 22, 80, and 443 ingress.
  • Private firewall is compatible with 01-private-network-port-matrisi.md.
  • DB replication ports are accessible only from the DB subnet.
  • Floating IP is created and assigned to iklim-app-01.
  • Terraform state and secret tfvars are not committed.