45 Commits

Author SHA1 Message Date
30fe75d383 health-agent redeploy with new image
All checks were successful
Deploy Environment Monitoring to Production Environment / deploy (push) Successful in 20s
2026-06-26 23:56:44 +03:00
c290882492 fix(monitoring): add missing conditions array to DNS monitors
Uptime Kuma 1.23+ evaluates monitor.conditions.length internally.
While HTTP monitors seem to bypass this check safely if conditions is null,
DNS monitors crash the NodeJS backend with 'Cannot read properties of null (reading length)'
if conditions is not explicitly initialized as an empty array.
2026-06-26 23:54:20 +03:00
6f3bf6cef1 health-agent redeploy with new image
All checks were successful
Deploy Environment Monitoring to Production Environment / deploy (push) Successful in 18s
2026-06-26 23:48:58 +03:00
a7ecfc4b2d fix(monitoring): add missing url property to DNS monitors
The Node.js backend of Uptime Kuma 2.4.0 seems to crash on DNS monitors with 'Cannot read properties of null (reading length)' if the 'url' field is not explicitly set, because the API defaults it to null instead of 'https://' like the UI does.
2026-06-26 23:46:08 +03:00
c1cda0b38a health-agent redeploy with new image
All checks were successful
Deploy Environment Monitoring to Production Environment / deploy (push) Successful in 20s
2026-06-26 23:31:39 +03:00
8a056a381b fix(monitoring): prevent Vault crash and DNS null error
- Vault: Wrap resp.json() in a try-except block to prevent JSONDecodeError when hitting an HTML error page (e.g. 502/503). This prevents the entire agent from crashing and missing heartbeats.
- Uptime Kuma DNS: Explicitly set dns_resolve_server to 1.1.1.1 in Python API payload to prevent Uptime Kuma backend from crashing on null properties.
2026-06-26 23:23:02 +03:00
475eb762b9 health-agent redeploy with new image
All checks were successful
Deploy Environment Monitoring to Production Environment / deploy (push) Successful in 17s
2026-06-26 23:13:38 +03:00
b49ca276f0 fix(monitoring): support existing monitor updates and vault nodes
- setup_uptime_kuma: Use api.edit_monitor to update existing monitors with new configuration instead of skipping them.
- setup_uptime_kuma: Add port and accepted_statuscodes to DNS monitors to prevent NodeJS null reading errors in Kuma.
- http.py: Parse VAULT_HOSTS environment variable for Vault cluster nodes instead of hardcoding 'vault'.
2026-06-26 23:07:37 +03:00
2a482ce4df health-agent redeploy with new image
All checks were successful
Deploy Environment Monitoring to Production Environment / deploy (push) Successful in 16s
2026-06-26 22:53:35 +03:00
969c4a2301 fix(monitoring): resolve health-agent bugs and flapping monitors
- Vault flapping: Fix resp evaluation on HTTP 429
- Storagebox block: Move mount check to a daemon thread
- Push monitors: Increase interval to 75s and restore 60s sleep
- Redis Sentinel: Fix authentication in sentinel_kwargs
- Ext Https Api: Update URL to /health
2026-06-26 22:51:15 +03:00
b73ae4e5fb revert(health-agent): revert ping monitors back to PING type 2026-06-26 21:55:42 +03:00
94e6b57c52 fix(health-agent): check all 3 patroni node configs on storagebox; switch ping monitors to TCP port 22 (ICMP blocked from Docker) 2026-06-26 21:54:49 +03:00
fa7ed41063 fix(health-agent): reload uk_tokens.yml on every push call instead of caching at startup 2026-06-26 21:35:44 +03:00
0551b01c64 health-agent redeploy with new image
All checks were successful
Deploy Environment Monitoring to Production Environment / deploy (push) Successful in 16s
2026-06-26 21:27:14 +03:00
2827b227d5 fix(health-agent): fix notification param name and type — notificationIDList expects a list of IDs not a dict 2026-06-26 21:23:45 +03:00
a5fc058978 health-agent redeploy with new image
Some checks failed
Deploy Environment Monitoring to Production Environment / deploy (push) Failing after 15s
2026-06-26 21:15:06 +03:00
e4acd0e57b fix(health-agent): skip uk_tokens.yml write when tokens dict is empty to prevent setup skip loop 2026-06-26 21:10:10 +03:00
8b10653ff4 fix(health-agent): fix ping maxretries param and status page group lookup
Fix ping monitor creation error ('max_retries' is not a valid uptime-kuma-api param; correct name is 'maxretries'). Fix status pages never linking groups: re-fetching get_monitors() after add_monitor() races with WebSocket delivery so newly created groups are missing; use group_map populated in Section 1 directly instead.
2026-06-26 21:07:11 +03:00
95dd439a34 health-agent redeploy with new image
All checks were successful
Deploy Environment Monitoring to Production Environment / deploy (push) Successful in 36s
2026-06-26 20:53:59 +03:00
3c2e872bf4 refactor(health-agent): rename monitor keys to Title Case With Space
Update all hardcoded push monitor names in check files to match the new Title Case With Space format in monitors.yml. The uk_tokens.yml keys are derived from monitor names so the push() calls must match exactly.
2026-06-26 20:52:35 +03:00
bc8b3d0934 refactor: convert all monitor names to Title Case and update health-agent digest 2026-06-26 20:47:31 +03:00
d51c073556 fix(health-agent): fix uk_tokens.yml load race and LogRecord msg conflict
- config.py: Replace exists()+open() with try/except open() to avoid TOCTOU race on SSHFS mounts where stat can succeed but open can fail with FileNotFoundError.
- uptime_kuma.py: Rename msg key to push_msg in logger extra dicts. Python LogRecord reserves the msg field; passing it in extra raises ValueError which was being silently swallowed by the except block, masking successful pushes as errors.
2026-06-26 20:37:42 +03:00
8d5fe55b14 health-agent redeploy with new image
All checks were successful
Deploy Environment Monitoring to Production Environment / deploy (push) Successful in 31s
2026-06-26 20:06:48 +03:00
9fbc74d498 fix(workflow): use -s flag to trigger Uptime Kuma setup on empty uk_tokens.yml
All checks were successful
Deploy Environment Monitoring to Production Environment / deploy (push) Successful in 27s
The previous ! -f check skipped setup when uk_tokens.yml existed but was empty (0 bytes). Switching to ! -s triggers setup whenever the file is missing or empty.
2026-06-26 19:39:48 +03:00
0ef4f0b6f8 refactor: rename iklimco-monitoring stack to monitoring 2026-06-26 19:24:01 +03:00
344ab4ac13 ci(workflow): remove redundant paths-ignore filter, gitignore already excludes those paths
All checks were successful
Deploy Environment Monitoring to Production Environment / deploy (push) Successful in 36s
2026-06-26 18:56:13 +03:00
656968823b ci(workflow): replace paths filter with paths-ignore to trigger on any change except .venv and __pycache__ 2026-06-26 18:55:41 +03:00
e812a3b454 Merge branch 'main' into prod-env 2026-06-26 18:51:44 +03:00
07b8db8de2 fix(common-functions): add no-op to empty refresh_calculated_env_vars to fix bash syntax error 2026-06-26 18:51:19 +03:00
8347b7e25d fix(common-functions): add no-op to empty refresh_calculated_env_vars to fix bash syntax error 2026-06-26 18:50:31 +03:00
58d5c24f41 feat(health-agent): add CI/CD pipeline, Uptime Kuma setup, and runtime configuration
Some checks failed
Deploy Environment Monitoring to Production Environment / deploy (push) Failing after 10s
Deploy workflows:
- Integrate health-agent build (test) and image promotion (prod) into monitoring stack workflows
- Add storagebox download of health-agent runtime (.env.monitoring.health-agent-runtime → health-agent/.env) and setup (.env.monitoring.health-agent-setup → health-agent/.env.setup) env files
- Add "Run Uptime Kuma Setup" step: runs setup_uptime_kuma.py inside the built image only when uk_tokens.yml is missing, writes tokens to HEALTH_AGENT_CONFIG_GENERATED_DIR (/mnt/storagebox/monitoring/uk_generated)
- Add health-agent/** and health-agent/deploy/prod.env path triggers to test and prod workflows respectively
- Add HARBOR_CI_TOKEN login and HARBOR_PULL_TOKEN login before stack deploy in both workflows
- Source health-agent/.env before docker stack deploy to expose HEALTH_AGENT_CONFIG_GENERATED_DIR

Dockerfile:
- Copy config/ and scripts/ into image so setup_uptime_kuma.py can run inside the container

setup_uptime_kuma.py:
- Load .env and .env.setup automatically via python-dotenv (no manual export needed)
- Write uk_tokens.yml to config/generated/ (aligned with container volume mount)

Health checks:
- PATRONI_HOSTS and VAULT_HOSTS are now configurable via env vars (comma-separated host:port); no code change needed when node count changes
- REDIS_SENTINEL_HOSTS now correctly parses host:port format; default updated to redis-sentinel:26379
- Fix NameError in check_patroni_cluster() caused by leftover node variable after loop refactor
- Remove verify_ssl=False from Vault check; vault.iklim.co has a valid certificate

Ops:
- Add ops/build-and-push-health-agent.sh for manual bypass of CI pipeline
- Add health-agent/deploy/prod.env template for prod image promotion manifest

Project structure:
- Move .env.example and .env.setup.example to health-agent/env-example/ (root .gitignore excludes health-agent/.env*)
- Add root .gitignore: excludes uk_tokens.yml, __pycache__, .venv, and env files
- Remove health-agent/.gitignore (superseded by root .gitignore)
2026-06-26 18:45:17 +03:00
062d3ff90d docs(health-agent): document --once and --dry-run flags in README 2026-06-26 16:47:53 +03:00
7ab186b961 feat(health-agent): add --once and --dry-run flags to main.py 2026-06-26 16:43:21 +03:00
c49616ac10 refactor(workflow): use source_env_file and require_env_file from common-functions-base.sh 2026-06-26 14:01:46 +03:00
6fc9ff45aa feat(workflow): add common-functions-base.sh and replace echo with log_message 2026-06-26 13:59:18 +03:00
28d726d2d8 fix(health-agent): correct Uptime Kuma URLs in example env files 2026-06-25 20:55:58 +03:00
07a364b2bc fix(health-agent): correct UK_URL placeholder in .env.setup.example 2026-06-25 20:54:28 +03:00
208f4768b9 chore(health-agent): switch to uptime-kuma-api-v2, fix .env.setup.example credentials 2026-06-25 20:50:23 +03:00
21965d4183 fix(workflow): remove unnecessary concurrency block from test monitoring workflow 2026-06-25 19:22:19 +03:00
72a91072fb feat(health-agent): add README, workflows, and translate monitors.yml to English
- Add health-agent README with architecture, config, and deployment docs
- Add deploy-monitoring-test.yml workflow (mirrors prod, test-runner, test storagebox paths)
- Add health-agent service to docker-stack-monitoring.yml
- Add .env.example with all runtime variables and .gitignore for generated files
- Add config/generated/.gitkeep to track empty generated directory
- Translate all Turkish group names and status page titles in monitors.yml to English
- Remove users.yml.example (Dozzle was removed in previous commit)
2026-06-25 19:20:25 +03:00
f742bfdd11 feat(health-agent): add monitors.yml with env-aware node IP mapping from Ansible inventory 2026-06-25 18:59:14 +03:00
a2e8997711 fix(workflow): correct file paths for standalone repo context
paths filter and stack/swag references used Environment_Monitoring/ prefix
which only makes sense in the main repo context. Since this workflow runs
inside the Environment_Monitoring repo itself, all paths are relative to
the repo root.
2026-06-25 17:19:55 +03:00
735d957dfa feat(monitoring): replace Dozzle with full observability stack
Replace the single-purpose Dozzle log viewer with a comprehensive monitoring
stack covering metrics, container telemetry, and persistent log aggregation.

Stack changes (docker-stack-service.yml -> docker-stack-monitoring.yml):
- remove Dozzle service and dozzle_users Docker secret
- add Portainer CE + portainer-agent (Swarm management UI)
- add node-exporter (global) — host CPU, memory, disk, network metrics
- add cAdvisor (global) — per-container resource usage metrics
- add Loki (replicated, service node) — persistent log storage, 31-day retention
- add Promtail (global) — Docker service discovery; ships logs with service,
  stack, container, and project labels; sends to Loki
- rename stack to iklimco-monitoring; add loki-vl persistent volume

Workflow (.gitea/workflows/deploy-prod.yml -> deploy-monitoring-prod.yml):
- rename file and add paths filter (Environment_Monitoring/**)
- remove Dozzle secret creation and auth handling
- add IMAGE_LOKI / IMAGE_PROMTAIL; clean up legacy dozzle_users Docker secret
- update SWAG step to loop swag/site-confs/*.conf.tpl (portainer only)
- remove DOZZLE_SUBDOMAIN; remove dozzle DNS record; keep portainer DNS
- replace "Wait for Dozzle" with "Wait for Loki"

SWAG:
- remove swag/dozzle.conf.tpl (Dozzle no longer in stack)
- add swag/site-confs/portainer.conf.tpl (moved from main repo template dir;
  monitoring stack manages its own SWAG configs independently)
- remove init/apisix-dozzle.sh (superseded by SWAG reverse proxy)

README:
- rewrite in Turkish; document Portainer, node-exporter, cAdvisor, Loki, Promtail
- add Grafana log viewing guide: datasource setup, label filter table, LogQL
  examples, metric-log correlation workflow, adding log panels to dashboards

Requires IMAGE_LOKI and IMAGE_PROMTAIL to be defined in .env and
corresponding custom images (build/loki/, build/promtail/) pushed to Harbor.
2026-06-24 21:21:02 +03:00
94dc1d2fe3 add docker-stack-service.yml, init scripts, and configuration files 2026-06-18 19:19:12 +03:00
446e761eb2 first commit 2026-06-18 19:18:31 +03:00