The Problem: "n8n Runs – Until It Doesn't"
In practice, n8n rarely fails because of the workflow idea, but because of operations:
- Errors pass through unnoticed,
- retries create duplicate actions,
- an API limit crashes the process,
- after an update, "something is different".
Goal: A monitoring set that's small enough to actually operate – and strong enough to show outages early.
1) The 6 Alerts That (Almost) Always Make Sense
Rule: Every alert needs an owner + response time (SLA) + standard action.
- Workflow error rate increases (e.g., >2% of runs)
- Single workflow fails repeatedly (e.g., 3× within 30 min)
- Run duration exceeds normal range (e.g., p95 > X seconds)
- Queue/concurrency backs up (runs "hang")
- External API limits/timeouts (429/5xx spikes)
- Data integrity (e.g., "0 records processed" when expected)
2) Metrics You Should Actually Measure (Copy/Paste)
| Metric | Why | Typical Threshold |
|---|---|---|
| Success rate per workflow | shows drift & dependencies | <98% = investigate |
| p95 runtime | performance regression | +50% vs. baseline |
| Retry rate | precursor to outages | increasing = investigate cause |
| Dead-letter/error path count | shows systemic errors | >0 per day = check |
| 429/Rate limit errors | API health | >5% of requests |
3) Runbook Minimum (So Not Every Issue Escalates)
Per critical workflow 8 lines often suffice:
- Purpose (1 sentence)
- Input/trigger
- Output (what is written where)
- Owner + backup
- Most common errors (2–3)
- Standard action (retry/stop/manual)
- Data checks (e.g., "number of records")
- Link to documentation
Without this, monitoring is just "noise".
4) Typical Anti-Patterns
- Too many alerts → nobody responds.
- No data check → workflow "runs" but produces garbage.
- Retries without idempotency → duplicate emails/tickets.
KPI Block (Operations)
- MTTA (Mean Time To Acknowledge): How quickly is an error seen?
- MTTR (Mean Time To Repair): How quickly is it fixed?
- Error rate per workflow (trend, not snapshot)
Next Step
If you operate n8n in law firms, monitoring is not nice-to-have, but a prerequisite.