Environment_Infrastructure/setup/07-prod-ansible-bootstrap.md
Murat ÖZDEMİR 27f4f83f73 docs(prod): resolve cross-layer inconsistencies and complete prod env implementation
Ansible roles:
- act_runner/defaults: set act_runner_name to inventory_hostname (was
  hardcoded to iklim-test-app); added vault_gitea_runner_token to vault.yml
- prod/group_vars/all: restructured from flat files to all/ directory;
  added act_runner_labels override (prod-runner,ubuntu-24.04,hostname);
  added storagebox_managed_directories; added swarm_manager_ip and other
  prod-specific vars
- prod/roles/db_stack: prod-specific db_node tasks using StorageBox paths
  (/mnt/storagebox/db/...) instead of local paths
- docker/tasks: split firewalld loop into all-nodes (Swarm ports) and
  app-only (80/443) tasks
- swarm/tasks: added --advertise-addr private_ip to join commands for
  correct multi-homed node advertisement
- hardening/tasks: corrected firewalld drop zone configuration
- node_dirs/tasks: added /opt/iklimco/vault/data for Vault Raft volume
- db_stack/tasks/app_node: updated stale comment (removed pg-proxy reference)
- db_stack/templates: removed pg-proxy and mongo-proxy service blocks
- test/host_vars/iklim-app-01: added act_runner_name override to preserve
  existing test runner registration

Roadmap and setup docs:
- roadmap/03-infra-stack-changes: added replicas:0 for etcd/postgresql/
  mongodb/pg-proxy/mongo-proxy in prod overlay; updated placement table;
  fixed grafana/data mkdir (auto-created by Ansible); translated Turkish
  note to English
- roadmap/08-deploy-pipeline-update: updated stale "remains idle" note
  for standalone etcd (now disabled with replicas:0)
- roadmap/01-swarm-init-multinode: consistency fixes
- setup/06: added Outputs section and etcd firewall port documentation
- setup/07: removed prometheus/data from StorageBox acceptance criteria;
  replaced manual StorageBox mkdir section with Ansible auto-creation note;
  updated prod README section with full bootstrap instructions and vault docs;
  added act_runner_labels prod policy
- setup/08: extensive rewrite — aligned with Patroni etcd overlay DNS,
  corrected hcloud_firewall.app reference, updated all StorageBox paths
  from /prod/db/ to /db/
- setup/09: removed prometheus/data from acceptance criteria; updated
  runner label policy (removed docker/swarm-manager labels); added
  acceptance criterion for disabled services absent from docker service ls

Terraform:
- prod/firewall.tf: added missing DB subnet mutual rules (etcd, Patroni)
- prod/outputs.tf: added prod_floating_ip and prod_private_ips outputs
- prod/servers.tf: aligned placement group and naming
- prod/variables.tf: corrected variable descriptions
- prod/terraform.tfvars.example: updated defaults
- terraform/hetzner/README.md: new comprehensive README covering both
  test and prod environments with firewall tables and inventory instructions

ansible/README.md: expanded prod section with inventory groups, bootstrap
  run order, runner label policy, and vault variable documentation
2026-05-18 19:17:56 +03:00

10 KiB
Raw Blame History

07 - Prod Ansible Bootstrap

The purpose of this phase is to prepare the prod machines created by Terraform for Linux, security hardening, Docker, and Swarm. DB cluster software is not installed by this playbook; however, DB nodes join Swarm as workers.

Ansible Installation

Ansible must be installed on the control machine, meaning your own computer. No agent is installed on target servers; SSH access is enough.

Installation by Operating System

  • Ubuntu / Debian:

    sudo apt update  
    sudo apt install -y pipx python3-venv  
    
    pipx ensurepath  
    export PATH="$HOME/.local/bin:$PATH"  
    
    pipx install --include-deps ansible
    pipx install ansible-lint
    
  • Fedora / Rocky Linux / RHEL:

    sudo dnf install -y pipx python3-virtualenv
    
    pipx ensurepath
    export PATH="$HOME/.local/bin:$PATH"
    
    pipx install --include-deps ansible
    pipx install ansible-lint
    
  • macOS (Homebrew):

    brew install ansible
    
  • With Python Pip, on any platform:

    pipx install --include-deps ansible
    pipx install ansible-lint
    

Additional Python Dependencies

passlib is required on the control machine for the password_hash filter:

pipx inject ansible passlib

If you installed with pip: pip install passlib

Verify the Installation

Whichever method you used to install it, use the following commands to verify that the installation succeeded:

# Check the Ansible version and configuration paths
ansible --version

# Check which location the Ansible binary is running from
which -a ansible

Running Ansible Commands

All commands must be run from the ansible/prod/ directory. ansible.cfg automatically defines the inventory and roles_path.

0. Install Required Collections Once During Initial Setup

ansible-galaxy collection install -r ../requirements.yml

1. Connection Test (Ping)

ansible all -m ping

2. Run the Bootstrap Playbook

ansible-playbook prod-bootstrap.yml --ask-vault-pass

Note: The --ask-vault-pass parameter asks for the Ansible Vault password; the StorageBox password is decrypted this way.

3. Run Only a Specific Role (Tags)

ansible-playbook prod-bootstrap.yml --tags "hardening" --ask-vault-pass

Target Machines

Host Role
iklim-app-01 Swarm manager + app worker
iklim-app-02 Swarm manager + app worker
iklim-app-03 Swarm manager + app worker
iklim-db-01 Manual DB cluster node
iklim-db-02 Manual DB cluster node
iklim-db-03 Manual DB cluster node
ansible/
  prod/
    ansible.cfg
    inventory/
      generated/
        prod.yml
    group_vars/
      all/
        vars.yml
        vault.yml
    prod-bootstrap.yml
  roles/
    base/
    hardening/
    docker/
    swarm/
    node_dirs/
    storagebox/
    storagebox_ssh_key/
    act_runner/
    db_stack/

Base Role

Applied to all prod nodes:

  • Package cache update
  • epel-release — installed first as a separate task; fail2ban, davfs2, htop, and btop depend on this repo
  • base packages, after epel-release is active:
    • curl
    • wget
    • git
    • jq
    • tar
    • unzip
    • bash-completion
    • gettext — required for envsubst in CI/CD deploy pipelines
    • tree
    • ca-certificates
    • fail2ban
    • chrony
    • python3
    • python3-pip
    • python3-passlib — for the password_hash filter (EPEL)
    • htop — interactive process monitoring (EPEL)
    • btop — resource monitor with graphical interface (EPEL)
  • timezone: Europe/Istanbul
  • hostname setup
  • keyboard layout: trq (Turkish Q)
  • chrony/NTP active

Security Hardening Role

Applied to all prod nodes:

  • SSH password auth is disabled.
  • Root SSH login is disabled.
  • Only SSH key auth remains.
  • PermitEmptyPasswords no
  • MaxAuthTries 3
  • fail2ban is enabled.
  • Automatic security updates are enabled with dnf-automatic.
  • The iklim system user is created and added to the wheel group; the password is read from vault.
  • firewalld default: incoming deny (drop zone), outgoing allow.
  • The SSH rule is first written as a rich rule to the drop zone, then the default zone is set to drop.
  • SSH is opened only from the admin CIDR.
  • DB ports are not opened publicly.

The Hetzner Cloud Firewall is considered the actual perimeter. firewalld is the second defense layer on the host.

Docker Role

Required on all prod nodes, both app and db. Because DB nodes join the network as Swarm Workers, Docker Engine must be installed on every machine.

Packages to install:

  • docker-ce
  • docker-ce-cli
  • containerd.io
  • docker-buildx-plugin
  • docker-compose-plugin

Installation will be done through the official Docker dnf repository (https://download.docker.com/linux/rhel/docker-ce.repo).

Swarm Role

Prod Swarm will be set up with 3 managers:

  1. docker swarm init on iklim-app-01 (Advertise/data path addr: 10.20.10.11)
  2. iklim-app-02 and iklim-app-03 join as managers.
  3. iklim-db-01/02/03 join as workers.
  4. Overlay network is created: iklimco-net
  5. Node labels:
    • iklim-app-* -> type=service
    • iklim-db-* -> role=db, db-index=01/02/03, for Patroni node coordination
  6. All nodes remain AVAILABILITY=Active.

The db-index labels are added through iklim-app-01 in a separate play inside prod-bootstrap.yml, not by the swarm role.

Node Directory Role

On all iklim-app-* nodes:

/opt/iklimco/ssl
/opt/iklimco/init
/opt/iklimco/stacks
/opt/iklimco/vault/data

/opt/iklimco/vault/data is the host path volume of the Vault Raft node; it must be created separately on every app node. Swarm does not manage this directory as an overlay volume; if it is missing, the Vault container will not start.

On DB nodes:

/opt/iklimco/db
/opt/iklimco/backup

StorageBox DAVFS Mount Role

Applied to every node, all iklim-app-* and iklim-db-*.

Prod Sub-Account

Parameter Variable Value
Main account storagebox_account u469968
Sub-account storagebox_user u469968-sub5
WebDAV URL storagebox_url https://u469968-sub5.your-storagebox.de/
Mount point storagebox_mount_point /mnt/storagebox

StorageBox SSH Key Role

Applied to every node. The /root/.ssh/id_ed25519_storagebox ed25519 key pair is generated on the server. Uploading the generated public key to the StorageBox main account (SSH authorized_keys) is a separate manual step:

# For each node:
cat /root/.ssh/id_ed25519_storagebox.pub | \
  ssh -p 23 STORAGEBOX_USER@STORAGEBOX_USER.your-storagebox.de \
  "cat >> .ssh/authorized_keys"

Act Runner Role

Applied to iklim-app-* nodes. Gitea Act Runner is installed on each app node and started as a systemd service. In prod, the runner runs on 3 app nodes; the deploy pipeline can be triggered on any of these runners.

DB Stack Role

Applied to iklim-db-* nodes. On each DB node, it creates /opt/iklimco/db and /opt/iklimco/backup directories, as well as a local reference directory for MongoDB. The actual production configuration, including node-specific mongod.conf, replica set auth key, and Patroni configurations, is set up on StorageBox at /mnt/storagebox/db/mongodb-0X/config/ and /mnt/storagebox/db/postgresql-0X/config/ in the 08-prod-db-cluster-kurulum.md step. etcd data is stored on local Docker named volumes (not StorageBox).

/opt/iklimco/stacks/.env

Password variables required by the DB cluster stacks are stored in the /opt/iklimco/stacks/.env file. This file is stored on StorageBox as prod/secrets/iklim.co/.env.stacks. Before the first deploy, it is fetched on iklim-app-01 with the following command:

scp -P 23 STORAGEBOX_USER@STORAGEBOX_USER.your-storagebox.de:prod/secrets/iklim.co/.env.stacks \
  /opt/iklimco/stacks/.env
chmod 600 /opt/iklimco/stacks/.env

StorageBox Directory Structure

The storagebox Ansible rolü storagebox_managed_directories (group_vars/all/vars.yml) aracılığıyla aşağıdaki dizinleri bootstrap sırasında otomatik oluşturur. Manüel adım gerekmez:

  • /mnt/storagebox/sslSWAG_CERT_DIR
  • /mnt/storagebox/swag/configSWAG_CONFIG_DIR
  • /mnt/storagebox/swag/site-confsSWAG_SITE_CONFS_DIR
  • /mnt/storagebox/grafana/dataGRAFANA_DATA_DIR
  • /mnt/storagebox/precipitation/images

StorageBox tüm app node'larında /mnt/storagebox olarak mount edildiğinden dizinler yalnızca bir kez oluşturulur; tüm node'lar ortaklaşa erişir. Prometheus yerel Docker named volume kullanır, StorageBox değil.

Swarm Setup Verification

After bootstrap, check the Swarm status with the following commands:

# 6 nodes: 3 managers (Leader/Reachable), 3 workers (Ready)
docker node ls

# App node label
docker node inspect iklim-app-01 --format '{{.Spec.Labels}}'
# Expected: map[type:service]

# DB node label
docker node inspect iklim-db-01 --format '{{.Spec.Labels}}'
# Expected: map[db-index:01 role:db]

# swarm-init.sh idempotency — do not attempt init again in an already active Swarm
grep -n "swarm init\|swarm join" init/swarm-init.sh

Acceptance Criteria

  • ansible all -m ping succeeds.
  • 3 Swarm manager nodes appear as Leader/Reachable in docker node ls.
  • 3 DB nodes appear as Workers in docker node ls.
  • Manager quorum is provided: 3 managers, 1 loss tolerated.
  • The iklimco-net overlay network exists.
  • Node labels (type=service, role=db, db-index=01/02/03) are verified with inspect.
  • swarm-init.sh does not attempt init again in an active Swarm; it is idempotent.
  • /mnt/storagebox is mounted on every node.
  • The /opt/iklimco/vault/data directory exists on every app node.
  • The ssl, swag/config, swag/site-confs, grafana/data, and precipitation/images directories exist on StorageBox.
  • The Gitea Act Runner service is running on every app node.
  • /opt/iklimco/db and /opt/iklimco/backup directories exist on DB nodes. Node-specific mongod.conf and other DB configurations are created on StorageBox (/mnt/storagebox/db/...) in the 08-prod-db-cluster-kurulum.md step.
  • Public firewall allows only 22, 80, and 443 ingress.