Environment_Infrastructure/setup/00-general-roadmap.md
Murat ÖZDEMİR 67dc2986dd docs(infra): restructure and update infrastructure setup documentation
- Anglicized setup and facts markdown file names for better consistency.

- Updated 01-swarm-init-multinode.md to highlight Ansible automation of Swarm initialization and labeling.

- Overhauled 03-infra-stack-changes.md to describe the single monolithic file strategy and reflect current Redis, RabbitMQ, and etcd cluster configurations.

- Fixed minor overrides and typos in Patroni templates and Ansible bootstrap documents.

- Restructured README and roadmap mapping to align with the renamed setup documents.
2026-06-15 16:42:18 +03:00

140 lines
6.2 KiB
Markdown

# 00 - General Roadmap
This file is the main context for agents that will set up the test/prod infrastructure on Hetzner Cloud with Terraform and Ansible in the `Environment_Infrastructure` repo. Each phase file is written to be self-sufficient; nevertheless, this document is the general decision record.
## Goal
The Iklim.co infrastructure will be set up on two separate Hetzner Cloud Projects:
- `test` Hetzner Cloud Project
- `prod` Hetzner Cloud Project
This separation is considered mandatory. API tokens, networks, firewalls, placement groups, servers, costs, and accidental deletion risks are separated by environment.
## Terraform and Ansible Responsibility Boundary
Terraform creates only IaaS resources:
- Hetzner Cloud server
- Private network and subnet
- Firewall
- SSH key
- Placement group
- Optional volume, floating IP, load balancer, or DNS record
- Ansible inventory output
Ansible prepares the created Linux machines:
- Linux base packages
- Security hardening
- Docker Engine installation
- Docker Swarm init/join
- Gitea Actions `act_runner` systemd installation
- Shared directories and deploy prerequisites
Docker, Swarm, runner, or application deployment will not be done inside Terraform. Hetzner Cloud resources will not be created inside Ansible.
## Environment Topologies
### Test
Minimum topology for the test environment:
| Node | Role | Note |
| --- | --- | --- |
| `iklim-app-01` | Swarm manager + app worker + Gitea runner | CI/CD test deploy runs through this node |
| `iklim-db-01` | DB node / Swarm worker | DB host prerequisites are prepared by Ansible; DB services are deployed as Swarm services by the environment stack/pipeline |
The test DB setup is brought up to OS, Docker, Swarm worker, config directory, and WireGuard preparation with Terraform/Ansible. PostgreSQL/MongoDB runtime services are not installed directly on the OS; they run as Docker Swarm services.
### Prod
HA topology for the prod environment:
| Node group | Count | Role |
| --- | ---: | --- |
| `iklim-app-*` | 3 | Each one is a Swarm manager + app worker |
| `iklim-db-*` | 3 | DB cluster nodes |
Prod DB host prerequisites are prepared by Terraform/Ansible. Runtime DB services are part of the current prod Swarm stack: etcd, Patroni/PostgreSQL, and MongoDB replica set are deployed by the prod root pipeline through `docker-stack-infra_db-prod.yml`.
## Public Port Policy
Ports open to the public internet are normally only:
- `22/tcp` SSH, only from admin IP/CIDR sources
- `80/tcp` HTTP
- `443/tcp` HTTPS
Test has one explicit exception: `51820/udp` is opened on the DB node for WireGuard VPN, authenticated cryptographically. Prod currently does not expose `51820/udp` in Terraform.
`8200/tcp` Vault will not be opened to the public internet. Vault must be reachable only from the private network or Docker overlay.
Current prod stack behavior is aligned with this policy: `docker-stack-infra_db-prod.yml` publishes public traffic through SWAG on 80/443. Vault is deployed separately by `vault-bootstrap.sh` using `docker-stack-vault.yml`; it is not publicly exposed.
## Private Network Policy
The detailed matrix of ports that must be opened inside the private network is in `01-private-network-port-matrix.md`. Agents must treat that file as the source when writing Terraform Hetzner firewall rules and Ansible `firewalld` rules.
## Gitea Actions Runner Decision
`act_runner` will not run as a Docker container, and the Docker socket will not be mounted into a container.
Preferred installation:
- `act_runner` is installed as a Linux systemd service.
- A separate `gitea-runner` user is created for the runner.
- CI/CD jobs can create containers when needed; for this, the runner host needs Docker CLI/daemon access.
- Because Docker group membership grants permissions close to root level, only trusted Gitea repos/jobs should use these runner labels.
For prod HA, `act_runner` will be installed not on a single machine but on all 3 Swarm manager nodes. This allows pipelines to continue when one manager/runner is lost. Runner labels must be both shared and node-specific:
- Shared: `prod-runner`
- Node specific: `iklim-app-01`, `iklim-app-02`, `iklim-app-03`
For test, a single runner is enough:
- Shared: `test-runner`
- Node specific: `iklim-app-01`
## Deploy Serialization Decision
Because of the 3-runner HA model in prod, multiple deploy jobs can run at the same time. Gitea Actions `concurrency` is used to prevent concurrent deploys; a StorageBox-based lock mechanism is not required.
```yaml
concurrency:
group: prod-deploy
cancel-in-progress: false
```
With `cancel-in-progress: false`, a new run in the same group is queued by Gitea until the previous one finishes; it appears as "queued" in the UI and is not shown as an error. All prod deploy workflows, including infrastructure and microservices, must use the same `group: prod-deploy` value so infra deploy and microservice deploy cannot overlap.
## Hetzner Physical Host Separation
Hetzner Cloud does not allow direct cabinet selection. `Placement Group` is used for the requirement of avoiding the same physical host. A placement group of type `spread` aims to place the cloud servers in the group on different physical hosts.
Constraints:
- A spread placement group reduces the impact of a single physical host failure.
- It does not guarantee protection against a wider failure inside the same datacenter or location.
- For location-level disaster recovery, a different location/region distribution must be designed later.
- According to Hetzner documentation, there is a maximum limit of 10 servers per spread placement group.
At least two placement groups are recommended for prod:
- `iklim-prod-app-spread`: 3 Swarm manager/app nodes
- `iklim-prod-db-spread`: 3 DB nodes
Optional for test:
- `iklim-test-spread`: `iklim-app-01` and `iklim-db-01`
Sources:
- Hetzner Terraform provider: https://registry.terraform.io/providers/hetznercloud/hcloud/latest
- Hetzner Networks: https://docs.hetzner.com/cloud/networks/overview/
- Hetzner Firewalls: https://docs.hetzner.com/cloud/firewalls/overview
- Hetzner Placement Groups: https://docs.hetzner.com/cloud/placement-groups/overview
- Docker Swarm overlay ports: https://docs.docker.com/engine/network/drivers/overlay/
- Gitea act_runner: https://docs.gitea.com/usage/actions/act-runner