Why This Checklist?
The most common mistake: "Workflow runs" is confused with "workflow is operational". A workflow that works is far from production-ready.
Go-live means: Errors are noticed, understood, and fixed – without gut feeling, without "checking manually", without client complaints as an early warning system.
The Problem with "It Works"
A workflow works in testing. Great. But what happens when:
- The API does not respond?
- A required field is empty?
- The same record arrives twice?
- The owner is on vacation?
- Nobody notices that nothing has run for 3 days?
Without preparation, each of these scenarios becomes a fire alarm.
The 10 Points Before Go-Live
1. Owner + Backup Assigned
Not "the team", but a specific person. Plus a backup who knows how it works.
Check question: If an alert comes at 10 PM – who do we call?
2. Goal & KPI Defined
Every workflow has exactly one success KPI. Not five, one.
Examples:
- Lead intake: Time to first response <4h
- Document workflow: Error rate <2%
- Reporting: Punctuality rate >95%
3. Trigger Monitoring Set Up
Would you notice if nothing ran for 24 hours? Most workflows have an expected rhythm. If it deviates, an alert must come.
Example: Lead intake normally receives 3-10 inquiries per day. If zero inquiries come for a whole day, something is probably broken.
4. Error Path Defined
What happens on errors?
- Retry: How often, at what intervals?
- Dead-letter queue: Where do failed records go?
- Notification: Who learns about it?
- Manual processing: How does that work?
Rule: Never just "lose" data.
5. Logging Set Up
Relevant IDs and references are logged – but no sensitive content.
Logged:
- External IDs (lead ID, client number)
- Timestamps
- Status transitions
- Errors with context
Not logged:
- Personal data
- Passwords, API keys
- Complete documents
6. Secrets/Keys: Rotation Planned
Who renews API keys when they expire? When do they expire?
Documented:
- List of all secrets with expiration date
- Owner for each secret
- Rotation process (how long does it take?)
7. Data Validation at Entry
Catch bad data early, not mid-workflow.
Minimum:
- Required fields present?
- Format correct (email, phone)?
- Expected values (status from known list)?
8. Idempotency Ensured
If the same record arrives twice (happens more often than you think), it must not be processed twice.
Check question: What happens if a webhook fires twice?
9. Runbook Documented
One page containing everything important:
- What does the workflow do?
- How is it triggered?
- What are the outputs?
- Top 3 errors + solutions
- How does the fallback work?
10. Test Cases Completed
At least 5 real variants, not just the happy path:
- Normal case (everything correct)
- Missing data (required field empty)
- Invalid data (wrong format)
- Duplicate (same record again)
- Outage (API not reachable)
Checklist to Check Off
| # | Point | Status |
|---|---|---|
| 1 | Owner + backup | ☐ |
| 2 | KPI defined | ☐ |
| 3 | Trigger monitoring | ☐ |
| 4 | Error path (retry, dead-letter) | ☐ |
| 5 | Logging (no sensitive data) | ☐ |
| 6 | Secrets rotation planned | ☐ |
| 7 | Data validation at entry | ☐ |
| 8 | Idempotency verified | ☐ |
| 9 | Runbook present | ☐ |
| 10 | 5 test cases passed | ☐ |
Rule: All 10 points must be green before go-live. No exceptions.
Mini Runbook Template
Workflow: [Name]
Owner: [Name + Contact]
Backup: [Name + Contact]
Trigger: [What starts the workflow?]
Expected frequency: [e.g., 5-20x/day]
Outputs:
- [Where does the data go?]
- [Which systems are updated?]
Error cases:
1) API not reachable
→ Retry 3x, then dead-letter + alert
2) Required field missing
→ Validation error, manual queue
3) Duplicate detected
→ Skip, log, no alert
Fallback (manual):
[How to continue working if the workflow fails completely?]
Last updated: [Date]
Next Step
Take your most important workflow and go through the 10 points. Where are the gaps?