Environment_Infrastructure/roadmap/prod-env/01-swarm-init-multinode.md

# 01 — Docker Swarm Init (Prod — Multi-Node)

## Context
- **Repo:** `iklim.co` root
- **Environment:** prod
- **Topology:**
  - 3 × service nodes — all act as **Swarm managers AND app workers** (Raft quorum: 1 can fail)
  - 3 × DB nodes — **NOT part of Docker Swarm** (separate DB cluster, out of scope)
- All 6 nodes are in the same private network.
- Pipeline trigger: push to `prod-env` branch → Gitea runner on `prod-runner` (first service node).
- Swarm has 3 nodes total; all are manager-eligible and carry workloads (no dedicated worker-only nodes).

## Node labeling plan

| Node | Role | Swarm role | Labels |
|------|------|------------|--------|
| service-1 | API services, SWAG, Vault | Manager + Worker | `type=service` |
| service-2 | API services replicas | Manager + Worker | `type=service` |
| service-3 | API services replicas | Manager + Worker | `type=service` |

> DB nodes (`db-1/2/3`) are **not part of Docker Swarm**. They run as a separate cluster
> and are provisioned independently. No Swarm join or label step applies to them.

## Step 1 — Init Swarm on service-1 (the prod-runner node)

```bash
MANAGER_IP=$(hostname -I | awk '{print $1}')
if ! docker info --format '{{.Swarm.LocalNodeState}}' | grep -q "active"; then
  docker swarm init --advertise-addr "$MANAGER_IP"
  echo "✅ Swarm initialized on $MANAGER_IP"
else
  echo "ℹ️  Swarm already active"
fi
```

## Step 2 — Get manager join token

```bash
docker swarm join-token manager  # for service-2, service-3
```

Save this token — needed on service-2 and service-3.

## Step 3 — Join service-2 and service-3 as managers

SSH into service-2 and service-3, run:
```bash
docker swarm join --token <MANAGER_TOKEN> <service-1-ip>:2377
```

## Step 4 — Label all Swarm nodes

On service-1, after service-2 and service-3 have joined:

```bash
for node in service-1 service-2 service-3; do
  docker node update --label-add type=service "$node"
done
```

> Replace `service-1`, etc. with actual node hostnames shown in `docker node ls`.
> DB nodes are not in Swarm — no join or label step for them.

## Step 5 — Verify

```bash
docker node ls
```

Expected: 3 nodes, all with `MANAGER STATUS` = `Leader` or `Reachable`.
All 3 nodes remain in `AVAILABILITY=Active` (not drained) so they also carry workloads.

```bash
docker node inspect service-1 --format '{{.Spec.Labels}}'
```

Expected: `map[type:service]`.

## Step 6 — Confirm `init/swarm-init.sh` multi-node awareness

The script is idempotent (skips init if already active). Verify:

```bash
grep -n "swarm init\|swarm join" init/swarm-init.sh
```

The prod pipeline runs on service-1 only. service-2/3 are joined via Ansible (`swarm` role),
not via the Gitea pipeline.

## Placement constraints used in `docker-stack-infra.yml`

| Constraint | Resolves to |
|------------|-------------|
| `node.role == manager` | service-1, service-2, service-3 |
| `node.labels.type == service` | service-1, service-2, service-3 |

SWAG, Vault, cert-reloader: pinned to `node.role == manager`.
Microservices: no constraint (distributed across all 3 service nodes by Swarm scheduler).

> `node.labels.type == db` constraint is **not used** — DB nodes are not in Swarm.
> PostgreSQL and MongoDB run outside Swarm as a separately managed cluster.