ops(cmo): self-healing stack — auto-restart, healthchecks, pm2 tunnel, heartbeat
Why
On 2026-04-22 cmoagent-db and cmoagent-app exited with code 255 and stayed dead for 5 days. cmoagent-watcher and cmoagent-notifier crashlooped 289+ times on socket.gaierror (could not resolve "db") and nobody noticed because Slack daily briefs / trends digests just went silent. Cloudflare tunnel "cmoagent" was also stopped, leaving https://cmo.laparedita.com offline.
What
Three independent layers so the same outage cannot happen silently again:
-
docker-compose.yml—restart: unless-stoppedondbandapp(was missing). HTTP healthcheck onapp, file-mtime healthchecks onwatcher/notifierthat flip the containerunhealthyif the loop hasn't iterated within 3× the interval. -
scripts/cmoagent-tunnel.pm2.config.js— Cloudflare named tunnel registered as a pm2 process so it auto-restarts and survives machine reboots (pm2 save && pm2 startup). -
scripts/cmoagent-heartbeat.sh— cron-friendly probe that verifies portal reachable + all 4 containers running + notifier state file fresh. Posts to Slack on failure, silent on success.
Verification
- All 4 containers report
(healthy)afterdocker compose up -d. -
pm2 status cmoagent-tunnel→ online;https://cmo.laparedita.com/→ 401 (portal up + auth). -
bash scripts/cmoagent-heartbeat.sh→[heartbeat] healthy. - Full pytest suite green: 347 passed.
Test plan
-
After merge, run pm2 startup(one-time) so tunnel survives reboot. -
Add the heartbeat cron line documented in README. -
Kill cmoagent-db-1manually to confirm app/watcher/notifier reconnect (and that the heartbeat catches it if longer than 30 min).