Skip to content

Worked Example: Repairing a Deceptive Pipeline

Page Maps

graph LR
  family["Reproducible Research"]
  program["Deep Dive DVC"]
  section["Truthful Pipelines Declared Dependencies"]
  page["Worked Example: Repairing a Deceptive Pipeline"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone
flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

This example shows how Module 04 fits together when a pipeline looks reasonable but the graph is not telling the full truth.

The goal is not to make the YAML longer. The goal is to make the result explainable.

The situation

A team has a small incident escalation pipeline:

stages:
  prepare:
    cmd: python -m incident_escalation_capstone.prepare
    deps:
      - data/raw/service_incidents.csv
    outs:
      - data/prepared/incidents.parquet
  fit:
    cmd: python -m incident_escalation_capstone.fit
    deps:
      - data/prepared/incidents.parquet
    outs:
      - models/escalation-model.json
  evaluate:
    cmd: python -m incident_escalation_capstone.evaluate
    deps:
      - models/escalation-model.json
    outs:
      - reports/evaluation.json

The graph looks tidy: prepare, fit, evaluate.

But you notice something uncomfortable. The evaluation report changes between two manual runs even though DVC does not see a reason to rerun evaluation.

That is the Module 04 alarm bell:

A tidy graph can still be deceptive.

Step 1: Read the command, not the stage name

The weak review says:

The stage is named evaluate, so it probably evaluates the model.

The stronger review asks:

What does the command actually read?

You inspect evaluate.py and find these reads:

  • models/escalation-model.json
  • data/prepared/incidents.parquet
  • data/reference/escalation_policy.csv
  • params.yaml key evaluate.threshold

Only one of those is declared. The stage can skip after a data or policy change because DVC only knows about the model dependency.

Step 2: Separate dependency from parameter

You do not dump everything into deps.

The prepared data and policy CSV are file reads, so they belong in deps.

The threshold is a reviewed control value, so it belongs in params.

The repaired evaluation stage becomes:

stages:
  evaluate:
    cmd: python -m incident_escalation_capstone.evaluate
    deps:
      - models/escalation-model.json
      - data/prepared/incidents.parquet
      - data/reference/escalation_policy.csv
      - src/incident_escalation_capstone/evaluate.py
    params:
      - evaluate.threshold
    outs:
      - reports/evaluation.json

Now a reviewer can explain what should make evaluation stale.

Step 3: Check the producer-consumer edge

You now ask whether data/prepared/incidents.parquet is owned correctly.

The prepare stage already lists it in outs, and evaluate now lists it in deps. That creates a real edge:

flowchart LR
  raw["raw incidents"] --> prepare["prepare"]
  prepare --> prepared["prepared incidents"]
  prepared --> fit["fit"]
  prepared --> evaluate["evaluate"]
  policy["escalation policy"] --> evaluate
  fit --> model["model"]
  model --> evaluate

The graph now says: evaluation depends on the model, prepared incidents, and policy.

That is a better causal story than "evaluation depends on the model."

Step 4: Predict rerun behavior

Before running anything, you write predictions:

  • if evaluate.threshold changes, only evaluation should rerun
  • if data/reference/escalation_policy.csv changes, evaluation should rerun
  • if data/raw/service_incidents.csv changes and preparation changes its output, fitting and evaluation should rerun downstream
  • if fit.model_family changes, fitting should rerun and evaluation should rerun because the model output changed

This prediction is the learning moment. It turns dvc repro from a button into a check against your mental graph.

Step 5: Compare against lock evidence

After running dvc repro, you inspect dvc.lock.

The useful question is not "did a lock file change?"

The useful questions are:

  • does evaluate now record the policy CSV dependency?
  • does it record the threshold value?
  • does it record the prepared dataset dependency?
  • does the output hash update when the report really changes?

If yes, the repair is now visible in recorded evidence.

Step 6: Narrow broad declarations only after safety

During review, someone proposes:

deps:
  - data/

That would make policy changes visible, but it would also make the stage rerun for unrelated files. The team rejects that as too broad because they now know the real read surface.

The final declaration is narrower and safer:

deps:
  - models/escalation-model.json
  - data/prepared/incidents.parquet
  - data/reference/escalation_policy.csv
  - src/incident_escalation_capstone/evaluate.py
params:
  - evaluate.threshold

This is the best Module 04 outcome: not maximum YAML, but honest YAML.

The review note you would want

Evaluation was previously deceptive because the command read prepared incidents, a policy CSV, and a threshold value that were not declared on the stage. That created stale output risk: DVC could skip evaluation after a meaningful input or control change. The repair adds the missing file reads to deps, adds the threshold to params, and keeps the output ownership unchanged. The graph now explains reruns from declared state, and dvc.lock can record the evidence needed for review.

That note is stronger than "fixed dvc.yaml" because it names the risk and the repair.

Why this is a mastery example

This one story exercises the whole module:

  • Core 1: the stage contract was read as a promise, not a label
  • Core 2: dependencies and parameters were placed differently
  • Core 3: rerun predictions were checked against lock evidence
  • Core 4: stale-output risk was prioritized before false rerun cleanup
  • Core 5: the producer-consumer edge for the prepared data was made explicit

The pipeline became more trustworthy because the declared graph became closer to the real work.