Pipeline Outage Triage
A nightly ETL job failed — find why and unblock reporting.
The situation
The orders_fact rebuild failed overnight. Six downstream dashboards are stale and the Sales team is asking why the morning numbers are wrong.
Context
- The job failed during the dedupe step.
- A schema change shipped to source_orders two days ago.
- There is a manual rerun path but it takes 90 minutes.
Your objectives
- Diagnose the failure without losing data.
- Restore the pipeline before 11 AM.
- Write a short postmortem with prevention steps.
Phases
Read the failure
Inspect logs and the failing step.
Patch and rerun
Apply a safe fix and validate.
Postmortem
Short writeup with prevention.
Tasks
- Read the dedupe step failure logToday, 08:30 AM
- Diff the source_orders schema changeToday, 09:00 AM
- Write a guarded patch + dry-runToday, 09:45 AM
- Rerun pipeline and validate countsToday, 11:00 AM
- Draft postmortem with preventionToday, 04:00 PM
Inbox for this scenario
Open inboxSameer Kapoor · Sales Ops
08:05 AMMorning dashboards are wrong
Pipeline is showing yesterday's numbers. Reps are pushing back on their targets — when will this be fixed?
Devika Varma · Data Platform Lead
07:30 AMFailure in orders_fact rebuild
Job failed at the dedupe step around 02:14 AM. I've paused the downstream DAGs — over to you.
Ravi Bose · Source Systems
YesterdayFYI — schema change on source_orders
We shipped order_status_v2 two days ago. Old enum values are still valid for a deprecation window.
Success criteria
- Pipeline succeeds on the next run.
- No double-counted orders in downstream tables.
- Postmortem reviewed by the data platform lead.
Stakeholders
- DVneutral
Devika Varma
Data Platform Lead
- SKtense
Sameer Kapoor
Sales Ops
Deliverables
Fix PR
pendingMinimal, reviewed patch to the pipeline.
Postmortem
pendingWhat happened, why, how to prevent.
Competencies assessed
- DebuggingWeight 40%
- Risk AwarenessWeight 30%
- Written CommunicationWeight 30%