Catch AI job failures before clients do.
AgentWatchdog monitors cron-backed agents, stale reports, Slack delivery failures, oversized local models, and service drift. It is built for operators who already know the pain of learning about failures after the damage is done.
Scout job timed out twice, report never refreshed, analyst chain is now blocked.
Heavy model still resident in RAM, swap is filling, next cron run is at risk.
Fresh report written, Slack send confirmed, runtime stayed within budget.
What it catches
The first version is intentionally opinionated. It focuses on the failures that actually hurt small operator-run agent systems.
Job health
Flags failed, paused, or repeatedly erroring agents before they silently age out of your awareness.
- Consecutive error counts
- Last-run status
- Paused jobs that matter
Report freshness
Watches output files directly so you notice stale reporting even when the scheduler looks technically alive.
- Daily and weekly freshness rules
- Missing report detection
- Blocked downstream chains
Host and model pressure
Surfaces the hidden system-level issues that explain why “nothing changed” still ends in broken runs.
- RAM and swap pressure
- Loaded Ollama models
- Service state drift
How it fits
AgentWatchdog is not trying to replace your orchestration stack. It sits above it and tells you when reality no longer matches the plan.
- Cron metadata from the jobs you already run.
- Report mtimes from the files your team actually reads.
- Ops inbox and approvals from human-in-the-loop workflows.
- Service and model state from the host itself.
- Slack-ready summary for daily operator review.
- Immediate alert surface for critical drift and broken chains.
- Dashboard snapshot for at-a-glance health.
- Suggested next actions instead of vague “something failed.”
Start as a service-assisted install, not another ignored dashboard.
The first useful version is simple: install it, tune the thresholds to your actual jobs, push alerts into Slack, and dogfood it until the noise is gone. That is enough to save real time and catch real failures.