SQL Developer
Available
Advanced ~3h

Pipeline Outage Triage

A nightly ETL job failed — find why and unblock reporting.

The situation

The orders_fact rebuild failed overnight. Six downstream dashboards are stale and the Sales team is asking why the morning numbers are wrong.

Context

  • The job failed during the dedupe step.
  • A schema change shipped to source_orders two days ago.
  • There is a manual rerun path but it takes 90 minutes.

Your objectives

  • Diagnose the failure without losing data.
  • Restore the pipeline before 11 AM.
  • Write a short postmortem with prevention steps.

Phases

  1. Read the failure

    Inspect logs and the failing step.

  2. Patch and rerun

    Apply a safe fix and validate.

  3. Postmortem

    Short writeup with prevention.

Tasks

  • Read the dedupe step failure log
    Today, 08:30 AM
  • Diff the source_orders schema change
    Today, 09:00 AM
  • Write a guarded patch + dry-run
    Today, 09:45 AM
  • Rerun pipeline and validate counts
    Today, 11:00 AM
  • Draft postmortem with prevention
    Today, 04:00 PM

Inbox for this scenario

Open inbox
SK

Sameer Kapoor · Sales Ops

08:05 AM

Morning dashboards are wrong

Pipeline is showing yesterday's numbers. Reps are pushing back on their targets — when will this be fixed?

Urgent
DV

Devika Varma · Data Platform Lead

07:30 AM

Failure in orders_fact rebuild

Job failed at the dedupe step around 02:14 AM. I've paused the downstream DAGs — over to you.

High
RB

Ravi Bose · Source Systems

Yesterday

FYI — schema change on source_orders

We shipped order_status_v2 two days ago. Old enum values are still valid for a deprecation window.

FYI

Success criteria

  • Pipeline succeeds on the next run.
  • No double-counted orders in downstream tables.
  • Postmortem reviewed by the data platform lead.

Stakeholders

  • DV

    Devika Varma

    Data Platform Lead

    neutral
  • SK

    Sameer Kapoor

    Sales Ops

    tense

Deliverables

  • Fix PR

    pending

    Minimal, reviewed patch to the pipeline.

  • Postmortem

    pending

    What happened, why, how to prevent.

Competencies assessed

  • DebuggingWeight 40%
  • Risk AwarenessWeight 30%
  • Written CommunicationWeight 30%

Tools

SQL WorkbenchSchema BrowserPipeline Console