DVC Repro, Staleness, and Lock Evidence¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive DVC"]
section["Truthful Pipelines Declared Dependencies"]
page["DVC Repro, Staleness, and Lock Evidence"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
dvc repro is easiest to misunderstand when you treat it like a smarter shell script.
It is better to treat it as a graph reviewer:
Compare declared current state with recorded previous state, then rerun the stages whose declaration no longer matches the evidence.
This is why Module 04 keeps pairing dvc.yaml and dvc.lock. One file declares the
pipeline. The other records what happened when that declaration was last executed.
The two files answer different questions¶
dvc.yaml answers:
- which stages exist
- which command each stage runs
- which files, directories, and parameters each stage declares
- which outputs each stage owns
dvc.lock answers:
- which dependency versions were recorded
- which parameter values were used
- which outputs were produced
- which command text was associated with the recorded run
Reading only dvc.yaml tells you the intended contract. Reading only dvc.lock tells
you recorded evidence without enough design context. Reading them together gives you a
reviewable execution story.
flowchart LR
yaml["dvc.yaml: declared graph"] --> repro["dvc repro comparison"]
lock["dvc.lock: recorded evidence"] --> repro
workspace["workspace state"] --> repro
repro --> skip["skip unchanged stage"]
repro --> run["rerun stale stage"]
run --> newlock["updated lock evidence"]
What staleness means¶
A stage is stale when its current declared state no longer matches the recorded lock evidence.
Common reasons include:
- a declared dependency changed content
- a declared parameter value changed
- the command text changed
- a declared output is missing
- an upstream stage changed and the downstream stage depends on its output
Notice the repeated word: declared.
DVC is comparing what it knows. If a command secretly reads an undeclared file, a change to that file does not automatically make the stage stale. That is not a mysterious DVC failure. It is a graph truth failure.
A small rerun story¶
Imagine this graph:
stages:
prepare:
cmd: python -m incident_escalation_capstone.prepare
deps:
- data/raw/service_incidents.csv
params:
- prepare.minimum_severity
outs:
- data/prepared/incidents.parquet
fit:
cmd: python -m incident_escalation_capstone.fit
deps:
- data/prepared/incidents.parquet
params:
- fit.model_family
outs:
- models/escalation-model.json
If prepare.minimum_severity changes, prepare becomes stale. If prepare produces a
new data/prepared/incidents.parquet, then fit becomes stale because its dependency
changed.
That downstream rerun is not noise. It is the graph doing the honest thing.
If fit.model_family changes, fit becomes stale, but prepare does not need to rerun.
The change belongs to model fitting, not preparation.
This is the kind of prediction you should make before running dvc repro.
What the lock file records¶
A simplified lock excerpt might look like this:
stages:
fit:
cmd: python -m incident_escalation_capstone.fit
deps:
- path: data/prepared/incidents.parquet
hash: md5
md5: 7d793037a0760186574b0282f2f435e7
params:
params.yaml:
fit.model_family: logistic_regression
outs:
- path: models/escalation-model.json
hash: md5
md5: 0c4d6a2f1eb4c6e4f2e2a2f2b913f5c8
The exact lock schema can vary by DVC version and output type. The lesson is more stable than the formatting:
- dependencies have recorded content identity
- parameters have recorded values
- outputs have recorded content identity
- the command is part of the recorded stage evidence
Treating dvc.lock as disposable generated clutter breaks the review chain. Without it,
the repository keeps a declaration but loses the recorded result of that declaration.
Why prediction matters¶
Before running dvc repro, make a prediction:
- which stage should rerun
- which stage should stay unchanged
- which downstream stages should rerun because their inputs changed
- which lock evidence should update afterward
Then compare the prediction with the actual result.
If a stage reruns unexpectedly, the graph may be broader than you thought. If a stage does not rerun when it should, the graph may be missing a real influence. Both outcomes are useful evidence.
The point is not to memorize DVC output text. The point is to make your mental graph meet the declared graph.
Review checkpoint¶
You understand this core when you can answer:
- what
dvc.yamldeclares - what
dvc.lockrecords - why a declared dependency or parameter change makes a stage stale
- why undeclared influence is invisible to
dvc repro - why downstream reruns can be evidence of a correct graph
- why lock evidence should be committed with the declaration it records
Once that is clear, dvc repro stops feeling mystical. It becomes a consistency check
between declaration, workspace state, and recorded evidence.