Worked Example: Repairing a Deceptive Pipeline¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive DVC"]
section["Truthful Pipelines Declared Dependencies"]
page["Worked Example: Repairing a Deceptive Pipeline"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
This example shows how Module 04 fits together when a pipeline looks reasonable but the graph is not telling the full truth.
The goal is not to make the YAML longer. The goal is to make the result explainable.
The situation¶
A team has a small incident escalation pipeline:
stages:
prepare:
cmd: python -m incident_escalation_capstone.prepare
deps:
- data/raw/service_incidents.csv
outs:
- data/prepared/incidents.parquet
fit:
cmd: python -m incident_escalation_capstone.fit
deps:
- data/prepared/incidents.parquet
outs:
- models/escalation-model.json
evaluate:
cmd: python -m incident_escalation_capstone.evaluate
deps:
- models/escalation-model.json
outs:
- reports/evaluation.json
The graph looks tidy: prepare, fit, evaluate.
But you notice something uncomfortable. The evaluation report changes between two manual runs even though DVC does not see a reason to rerun evaluation.
That is the Module 04 alarm bell:
A tidy graph can still be deceptive.
Step 1: Read the command, not the stage name¶
The weak review says:
The stage is named
evaluate, so it probably evaluates the model.
The stronger review asks:
What does the command actually read?
You inspect evaluate.py and find these reads:
models/escalation-model.jsondata/prepared/incidents.parquetdata/reference/escalation_policy.csvparams.yamlkeyevaluate.threshold
Only one of those is declared. The stage can skip after a data or policy change because DVC only knows about the model dependency.
Step 2: Separate dependency from parameter¶
You do not dump everything into deps.
The prepared data and policy CSV are file reads, so they belong in deps.
The threshold is a reviewed control value, so it belongs in params.
The repaired evaluation stage becomes:
stages:
evaluate:
cmd: python -m incident_escalation_capstone.evaluate
deps:
- models/escalation-model.json
- data/prepared/incidents.parquet
- data/reference/escalation_policy.csv
- src/incident_escalation_capstone/evaluate.py
params:
- evaluate.threshold
outs:
- reports/evaluation.json
Now a reviewer can explain what should make evaluation stale.
Step 3: Check the producer-consumer edge¶
You now ask whether data/prepared/incidents.parquet is owned correctly.
The prepare stage already lists it in outs, and evaluate now lists it in deps.
That creates a real edge:
flowchart LR
raw["raw incidents"] --> prepare["prepare"]
prepare --> prepared["prepared incidents"]
prepared --> fit["fit"]
prepared --> evaluate["evaluate"]
policy["escalation policy"] --> evaluate
fit --> model["model"]
model --> evaluate
The graph now says: evaluation depends on the model, prepared incidents, and policy.
That is a better causal story than "evaluation depends on the model."
Step 4: Predict rerun behavior¶
Before running anything, you write predictions:
- if
evaluate.thresholdchanges, only evaluation should rerun - if
data/reference/escalation_policy.csvchanges, evaluation should rerun - if
data/raw/service_incidents.csvchanges and preparation changes its output, fitting and evaluation should rerun downstream - if
fit.model_familychanges, fitting should rerun and evaluation should rerun because the model output changed
This prediction is the learning moment. It turns dvc repro from a button into a check
against your mental graph.
Step 5: Compare against lock evidence¶
After running dvc repro, you inspect dvc.lock.
The useful question is not "did a lock file change?"
The useful questions are:
- does
evaluatenow record the policy CSV dependency? - does it record the threshold value?
- does it record the prepared dataset dependency?
- does the output hash update when the report really changes?
If yes, the repair is now visible in recorded evidence.
Step 6: Narrow broad declarations only after safety¶
During review, someone proposes:
That would make policy changes visible, but it would also make the stage rerun for unrelated files. The team rejects that as too broad because they now know the real read surface.
The final declaration is narrower and safer:
deps:
- models/escalation-model.json
- data/prepared/incidents.parquet
- data/reference/escalation_policy.csv
- src/incident_escalation_capstone/evaluate.py
params:
- evaluate.threshold
This is the best Module 04 outcome: not maximum YAML, but honest YAML.
The review note you would want¶
Evaluation was previously deceptive because the command read prepared incidents, a policy CSV, and a threshold value that were not declared on the stage. That created stale output risk: DVC could skip evaluation after a meaningful input or control change. The repair adds the missing file reads to
deps, adds the threshold toparams, and keeps the output ownership unchanged. The graph now explains reruns from declared state, anddvc.lockcan record the evidence needed for review.
That note is stronger than "fixed dvc.yaml" because it names the risk and the repair.
Why this is a mastery example¶
This one story exercises the whole module:
- Core 1: the stage contract was read as a promise, not a label
- Core 2: dependencies and parameters were placed differently
- Core 3: rerun predictions were checked against lock evidence
- Core 4: stale-output risk was prioritized before false rerun cleanup
- Core 5: the producer-consumer edge for the prepared data was made explicit
The pipeline became more trustworthy because the declared graph became closer to the real work.