- Vault: Wrap resp.json() in a try-except block to prevent JSONDecodeError when hitting an HTML error page (e.g. 502/503). This prevents the entire agent from crashing and missing heartbeats.
- Uptime Kuma DNS: Explicitly set dns_resolve_server to 1.1.1.1 in Python API payload to prevent Uptime Kuma backend from crashing on null properties.
Update all hardcoded push monitor names in check files to match the new Title Case With Space format in monitors.yml. The uk_tokens.yml keys are derived from monitor names so the push() calls must match exactly.
- config.py: Replace exists()+open() with try/except open() to avoid TOCTOU race on SSHFS mounts where stat can succeed but open can fail with FileNotFoundError.
- uptime_kuma.py: Rename msg key to push_msg in logger extra dicts. Python LogRecord reserves the msg field; passing it in extra raises ValueError which was being silently swallowed by the except block, masking successful pushes as errors.
Deploy workflows:
- Integrate health-agent build (test) and image promotion (prod) into monitoring stack workflows
- Add storagebox download of health-agent runtime (.env.monitoring.health-agent-runtime → health-agent/.env) and setup (.env.monitoring.health-agent-setup → health-agent/.env.setup) env files
- Add "Run Uptime Kuma Setup" step: runs setup_uptime_kuma.py inside the built image only when uk_tokens.yml is missing, writes tokens to HEALTH_AGENT_CONFIG_GENERATED_DIR (/mnt/storagebox/monitoring/uk_generated)
- Add health-agent/** and health-agent/deploy/prod.env path triggers to test and prod workflows respectively
- Add HARBOR_CI_TOKEN login and HARBOR_PULL_TOKEN login before stack deploy in both workflows
- Source health-agent/.env before docker stack deploy to expose HEALTH_AGENT_CONFIG_GENERATED_DIR
Dockerfile:
- Copy config/ and scripts/ into image so setup_uptime_kuma.py can run inside the container
setup_uptime_kuma.py:
- Load .env and .env.setup automatically via python-dotenv (no manual export needed)
- Write uk_tokens.yml to config/generated/ (aligned with container volume mount)
Health checks:
- PATRONI_HOSTS and VAULT_HOSTS are now configurable via env vars (comma-separated host:port); no code change needed when node count changes
- REDIS_SENTINEL_HOSTS now correctly parses host:port format; default updated to redis-sentinel:26379
- Fix NameError in check_patroni_cluster() caused by leftover node variable after loop refactor
- Remove verify_ssl=False from Vault check; vault.iklim.co has a valid certificate
Ops:
- Add ops/build-and-push-health-agent.sh for manual bypass of CI pipeline
- Add health-agent/deploy/prod.env template for prod image promotion manifest
Project structure:
- Move .env.example and .env.setup.example to health-agent/env-example/ (root .gitignore excludes health-agent/.env*)
- Add root .gitignore: excludes uk_tokens.yml, __pycache__, .venv, and env files
- Remove health-agent/.gitignore (superseded by root .gitignore)