Environment_Infrastructure/setup/00-general-roadmap.md
Murat ÖZDEMİR 67dc2986dd docs(infra): restructure and update infrastructure setup documentation
- Anglicized setup and facts markdown file names for better consistency.

- Updated 01-swarm-init-multinode.md to highlight Ansible automation of Swarm initialization and labeling.

- Overhauled 03-infra-stack-changes.md to describe the single monolithic file strategy and reflect current Redis, RabbitMQ, and etcd cluster configurations.

- Fixed minor overrides and typos in Patroni templates and Ansible bootstrap documents.

- Restructured README and roadmap mapping to align with the renamed setup documents.
2026-06-15 16:42:18 +03:00

6.2 KiB

00 - General Roadmap

This file is the main context for agents that will set up the test/prod infrastructure on Hetzner Cloud with Terraform and Ansible in the Environment_Infrastructure repo. Each phase file is written to be self-sufficient; nevertheless, this document is the general decision record.

Goal

The Iklim.co infrastructure will be set up on two separate Hetzner Cloud Projects:

  • test Hetzner Cloud Project
  • prod Hetzner Cloud Project

This separation is considered mandatory. API tokens, networks, firewalls, placement groups, servers, costs, and accidental deletion risks are separated by environment.

Terraform and Ansible Responsibility Boundary

Terraform creates only IaaS resources:

  • Hetzner Cloud server
  • Private network and subnet
  • Firewall
  • SSH key
  • Placement group
  • Optional volume, floating IP, load balancer, or DNS record
  • Ansible inventory output

Ansible prepares the created Linux machines:

  • Linux base packages
  • Security hardening
  • Docker Engine installation
  • Docker Swarm init/join
  • Gitea Actions act_runner systemd installation
  • Shared directories and deploy prerequisites

Docker, Swarm, runner, or application deployment will not be done inside Terraform. Hetzner Cloud resources will not be created inside Ansible.

Environment Topologies

Test

Minimum topology for the test environment:

Node Role Note
iklim-app-01 Swarm manager + app worker + Gitea runner CI/CD test deploy runs through this node
iklim-db-01 DB node / Swarm worker DB host prerequisites are prepared by Ansible; DB services are deployed as Swarm services by the environment stack/pipeline

The test DB setup is brought up to OS, Docker, Swarm worker, config directory, and WireGuard preparation with Terraform/Ansible. PostgreSQL/MongoDB runtime services are not installed directly on the OS; they run as Docker Swarm services.

Prod

HA topology for the prod environment:

Node group Count Role
iklim-app-* 3 Each one is a Swarm manager + app worker
iklim-db-* 3 DB cluster nodes

Prod DB host prerequisites are prepared by Terraform/Ansible. Runtime DB services are part of the current prod Swarm stack: etcd, Patroni/PostgreSQL, and MongoDB replica set are deployed by the prod root pipeline through docker-stack-infra_db-prod.yml.

Public Port Policy

Ports open to the public internet are normally only:

  • 22/tcp SSH, only from admin IP/CIDR sources
  • 80/tcp HTTP
  • 443/tcp HTTPS

Test has one explicit exception: 51820/udp is opened on the DB node for WireGuard VPN, authenticated cryptographically. Prod currently does not expose 51820/udp in Terraform.

8200/tcp Vault will not be opened to the public internet. Vault must be reachable only from the private network or Docker overlay.

Current prod stack behavior is aligned with this policy: docker-stack-infra_db-prod.yml publishes public traffic through SWAG on 80/443. Vault is deployed separately by vault-bootstrap.sh using docker-stack-vault.yml; it is not publicly exposed.

Private Network Policy

The detailed matrix of ports that must be opened inside the private network is in 01-private-network-port-matrix.md. Agents must treat that file as the source when writing Terraform Hetzner firewall rules and Ansible firewalld rules.

Gitea Actions Runner Decision

act_runner will not run as a Docker container, and the Docker socket will not be mounted into a container.

Preferred installation:

  • act_runner is installed as a Linux systemd service.
  • A separate gitea-runner user is created for the runner.
  • CI/CD jobs can create containers when needed; for this, the runner host needs Docker CLI/daemon access.
  • Because Docker group membership grants permissions close to root level, only trusted Gitea repos/jobs should use these runner labels.

For prod HA, act_runner will be installed not on a single machine but on all 3 Swarm manager nodes. This allows pipelines to continue when one manager/runner is lost. Runner labels must be both shared and node-specific:

  • Shared: prod-runner
  • Node specific: iklim-app-01, iklim-app-02, iklim-app-03

For test, a single runner is enough:

  • Shared: test-runner
  • Node specific: iklim-app-01

Deploy Serialization Decision

Because of the 3-runner HA model in prod, multiple deploy jobs can run at the same time. Gitea Actions concurrency is used to prevent concurrent deploys; a StorageBox-based lock mechanism is not required.

concurrency:
  group: prod-deploy
  cancel-in-progress: false

With cancel-in-progress: false, a new run in the same group is queued by Gitea until the previous one finishes; it appears as "queued" in the UI and is not shown as an error. All prod deploy workflows, including infrastructure and microservices, must use the same group: prod-deploy value so infra deploy and microservice deploy cannot overlap.

Hetzner Physical Host Separation

Hetzner Cloud does not allow direct cabinet selection. Placement Group is used for the requirement of avoiding the same physical host. A placement group of type spread aims to place the cloud servers in the group on different physical hosts.

Constraints:

  • A spread placement group reduces the impact of a single physical host failure.
  • It does not guarantee protection against a wider failure inside the same datacenter or location.
  • For location-level disaster recovery, a different location/region distribution must be designed later.
  • According to Hetzner documentation, there is a maximum limit of 10 servers per spread placement group.

At least two placement groups are recommended for prod:

  • iklim-prod-app-spread: 3 Swarm manager/app nodes
  • iklim-prod-db-spread: 3 DB nodes

Optional for test:

  • iklim-test-spread: iklim-app-01 and iklim-db-01

Sources: