11 Commits

Author SHA1 Message Date
8a056a381b fix(monitoring): prevent Vault crash and DNS null error
- Vault: Wrap resp.json() in a try-except block to prevent JSONDecodeError when hitting an HTML error page (e.g. 502/503). This prevents the entire agent from crashing and missing heartbeats.
- Uptime Kuma DNS: Explicitly set dns_resolve_server to 1.1.1.1 in Python API payload to prevent Uptime Kuma backend from crashing on null properties.
2026-06-26 23:23:02 +03:00
b49ca276f0 fix(monitoring): support existing monitor updates and vault nodes
- setup_uptime_kuma: Use api.edit_monitor to update existing monitors with new configuration instead of skipping them.
- setup_uptime_kuma: Add port and accepted_statuscodes to DNS monitors to prevent NodeJS null reading errors in Kuma.
- http.py: Parse VAULT_HOSTS environment variable for Vault cluster nodes instead of hardcoding 'vault'.
2026-06-26 23:07:37 +03:00
b73ae4e5fb revert(health-agent): revert ping monitors back to PING type 2026-06-26 21:55:42 +03:00
94e6b57c52 fix(health-agent): check all 3 patroni node configs on storagebox; switch ping monitors to TCP port 22 (ICMP blocked from Docker) 2026-06-26 21:54:49 +03:00
2827b227d5 fix(health-agent): fix notification param name and type — notificationIDList expects a list of IDs not a dict 2026-06-26 21:23:45 +03:00
e4acd0e57b fix(health-agent): skip uk_tokens.yml write when tokens dict is empty to prevent setup skip loop 2026-06-26 21:10:10 +03:00
8b10653ff4 fix(health-agent): fix ping maxretries param and status page group lookup
Fix ping monitor creation error ('max_retries' is not a valid uptime-kuma-api param; correct name is 'maxretries'). Fix status pages never linking groups: re-fetching get_monitors() after add_monitor() races with WebSocket delivery so newly created groups are missing; use group_map populated in Section 1 directly instead.
2026-06-26 21:07:11 +03:00
bc8b3d0934 refactor: convert all monitor names to Title Case and update health-agent digest 2026-06-26 20:47:31 +03:00
0ef4f0b6f8 refactor: rename iklimco-monitoring stack to monitoring 2026-06-26 19:24:01 +03:00
58d5c24f41 feat(health-agent): add CI/CD pipeline, Uptime Kuma setup, and runtime configuration
Some checks failed
Deploy Environment Monitoring to Production Environment / deploy (push) Failing after 10s
Deploy workflows:
- Integrate health-agent build (test) and image promotion (prod) into monitoring stack workflows
- Add storagebox download of health-agent runtime (.env.monitoring.health-agent-runtime → health-agent/.env) and setup (.env.monitoring.health-agent-setup → health-agent/.env.setup) env files
- Add "Run Uptime Kuma Setup" step: runs setup_uptime_kuma.py inside the built image only when uk_tokens.yml is missing, writes tokens to HEALTH_AGENT_CONFIG_GENERATED_DIR (/mnt/storagebox/monitoring/uk_generated)
- Add health-agent/** and health-agent/deploy/prod.env path triggers to test and prod workflows respectively
- Add HARBOR_CI_TOKEN login and HARBOR_PULL_TOKEN login before stack deploy in both workflows
- Source health-agent/.env before docker stack deploy to expose HEALTH_AGENT_CONFIG_GENERATED_DIR

Dockerfile:
- Copy config/ and scripts/ into image so setup_uptime_kuma.py can run inside the container

setup_uptime_kuma.py:
- Load .env and .env.setup automatically via python-dotenv (no manual export needed)
- Write uk_tokens.yml to config/generated/ (aligned with container volume mount)

Health checks:
- PATRONI_HOSTS and VAULT_HOSTS are now configurable via env vars (comma-separated host:port); no code change needed when node count changes
- REDIS_SENTINEL_HOSTS now correctly parses host:port format; default updated to redis-sentinel:26379
- Fix NameError in check_patroni_cluster() caused by leftover node variable after loop refactor
- Remove verify_ssl=False from Vault check; vault.iklim.co has a valid certificate

Ops:
- Add ops/build-and-push-health-agent.sh for manual bypass of CI pipeline
- Add health-agent/deploy/prod.env template for prod image promotion manifest

Project structure:
- Move .env.example and .env.setup.example to health-agent/env-example/ (root .gitignore excludes health-agent/.env*)
- Add root .gitignore: excludes uk_tokens.yml, __pycache__, .venv, and env files
- Remove health-agent/.gitignore (superseded by root .gitignore)
2026-06-26 18:45:17 +03:00
f742bfdd11 feat(health-agent): add monitors.yml with env-aware node IP mapping from Ansible inventory 2026-06-25 18:59:14 +03:00