Module 04: Truthful Pipelines and Declared Dependencies¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive DVC"]
section["Truthful Pipelines Declared Dependencies"]
page["Module 04: Truthful Pipelines and Declared Dependencies"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
Module 04 turns DVC from a storage story into an execution story.
By now, you already know two important facts:
- data identity should come from content, not path names
- runtime can influence a result even when code and data look unchanged
The next question is sharper:
can the pipeline itself tell the truth about why a result exists?
A DVC pipeline is only reviewable when its declared graph matches the real work. If a script reads a file that is not listed, writes an output that is not owned, or uses a control value that is hidden inside code, DVC can still run. The danger is that it may run for reasons the team cannot explain, or skip work when the result is already stale.
This module is about replacing "the script usually works" with a declared execution graph that a reviewer can inspect, challenge, and reproduce.
The capstone corroboration surface for this module is the set of files that show declared
execution truth: capstone/dvc.yaml, capstone/dvc.lock, capstone/params.yaml,
capstone/docs/stage-contract-guide.md, capstone/docs/experiment-guide.md, and
the make -C capstone verify route.
Why this module exists¶
Many pipelines fail quietly because the graph is incomplete.
Common examples look ordinary at first:
- a preparation script reads a lookup table that is not in
deps - a model script uses a threshold literal instead of a value in
params.yaml - an evaluation command writes a report but only the metric file is declared as an output
- a shared intermediate directory is reused by two commands without a clear owner
- a stage reruns so often that the team stops investigating why
These are not cosmetic problems. They decide whether dvc repro has enough declared
information to make a correct decision.
The point of Module 04 is not to memorize every DVC field. The point is to learn the review habit:
If this result changed, could I explain the change from declared dependencies, parameters, outputs, command text, and recorded lock evidence?
If the answer is no, the pipeline is not yet truthful enough.
Study route¶
flowchart LR
overview["Overview"] --> core1["Core 1: stage contract"]
core1 --> core2["Core 2: dependency, parameter, output boundaries"]
core2 --> core3["Core 3: staleness and lock evidence"]
core3 --> core4["Core 4: false reruns and stale outputs"]
core4 --> core5["Core 5: refactoring and shared outputs"]
core5 --> example["Worked example"]
example --> practice["Exercises and answers"]
practice --> glossary["Glossary"]
Read the module in that order the first time.
If the problem is already partly clear, use this shortcut:
- open Core 1 when the main confusion is "what is a DVC stage promising?"
- open Core 2 when the main confusion is "what belongs in
deps,params, orouts?" - open Core 3 when the main confusion is "why did
dvc reprorerun or skip?" - open Core 4 when the main confusion is "which failure is merely annoying and which is dangerous?"
- open Core 5 when the main confusion is "how do I change pipeline shape without losing provenance?"
Module map¶
| Page | Purpose |
|---|---|
| Overview | explains the module promise and study route |
| Stage Contracts and Declared Truth | teaches the promise each DVC stage makes |
| Dependency, Parameter, and Output Boundaries | teaches how to place inputs, controls, and outputs in the graph |
| DVC Repro, Staleness, and Lock Evidence | teaches how dvc repro decides and how dvc.lock records evidence |
| False Reruns and Stale Outputs | teaches how to distinguish wasteful reruns from dangerous stale results |
| Safe Pipeline Refactoring and Shared Outputs | teaches graph change, shared intermediates, and multi-output care |
| Worked Example: Repairing a Deceptive Pipeline | walks through one realistic pipeline review and repair |
| Exercises | gives five mastery exercises |
| Exercise Answers | explains model answers and review logic |
| Glossary | keeps the module vocabulary stable |
What should be clear by the end¶
By the end of this module, you should be able to explain:
- what makes a DVC stage truthful rather than merely convenient
- how to decide whether a file belongs in
deps,outs, or neither - how parameter declaration turns control values into reviewable change
- how
dvc reprouses declared state and lock evidence to decide what reruns - why stale outputs are more dangerous than extra reruns
- how to refactor a pipeline without breaking the provenance story
Commands to keep close¶
These commands form the evidence loop for Module 04:
Use make -C capstone verify and make -C capstone walkthrough from the capstone root
when you want the course-provided review route. Use dvc dag, dvc status, and
dvc repro inside a DVC workspace when you want to inspect the declared graph and rerun
decision directly.
Capstone route¶
Use the capstone after you can describe a single stage in plain language.
Best corroboration surfaces for this module:
capstone/dvc.yamlcapstone/dvc.lockcapstone/params.yamlcapstone/docs/stage-contract-guide.mdcapstone/docs/experiment-guide.mdcapstone/docs/publish-contract.md
Useful proof route:
The point of that route is not to trust the graph because it exists. It is to practice asking whether each declared edge matches the real read, write, and control behavior of the pipeline.