Breaking Good
Reliability for agent systems

Catch AI job failures before clients do.

AgentWatchdog monitors cron-backed agents, stale reports, Slack delivery failures, oversized local models, and service drift. It is built for operators who already know the pain of learning about failures after the damage is done.

Built from real incidents Cron jobs, stale reports, broken Slack sends, and runaway local models on the Framework.
Slack-first Alerts and summaries show up where decisions already happen.
Operator-friendly Clear severity, likely cause, and suggested next action.
Critical Income report stale for 52 hours

Scout job timed out twice, report never refreshed, analyst chain is now blocked.

Warning Ollama memory pressure rising

Heavy model still resident in RAM, swap is filling, next cron run is at risk.

Healthy Daily research digest delivered

Fresh report written, Slack send confirmed, runtime stayed within budget.

What it catches

The first version is intentionally opinionated. It focuses on the failures that actually hurt small operator-run agent systems.

Job health

Flags failed, paused, or repeatedly erroring agents before they silently age out of your awareness.

  • Consecutive error counts
  • Last-run status
  • Paused jobs that matter

Report freshness

Watches output files directly so you notice stale reporting even when the scheduler looks technically alive.

  • Daily and weekly freshness rules
  • Missing report detection
  • Blocked downstream chains

Host and model pressure

Surfaces the hidden system-level issues that explain why “nothing changed” still ends in broken runs.

  • RAM and swap pressure
  • Loaded Ollama models
  • Service state drift

How it fits

AgentWatchdog is not trying to replace your orchestration stack. It sits above it and tells you when reality no longer matches the plan.

Input signals
  • Cron metadata from the jobs you already run.
  • Report mtimes from the files your team actually reads.
  • Ops inbox and approvals from human-in-the-loop workflows.
  • Service and model state from the host itself.
Output
  • Slack-ready summary for daily operator review.
  • Immediate alert surface for critical drift and broken chains.
  • Dashboard snapshot for at-a-glance health.
  • Suggested next actions instead of vague “something failed.”
Beta shape

Start as a service-assisted install, not another ignored dashboard.

The first useful version is simple: install it, tune the thresholds to your actual jobs, push alerts into Slack, and dogfood it until the noise is gone. That is enough to save real time and catch real failures.