Module 04: Truthful Pipelines and Declared Dependencies¶
Module Position¶
flowchart TD
family["Reproducible Research"] --> program["Deep Dive DVC"]
program --> module["Module 04: Truthful Pipelines and Declared Dependencies"]
module --> lessons["Lesson pages and worked examples"]
module --> checkpoints["Exercises and closing criteria"]
module --> capstone["Related capstone evidence"]
flowchart TD
purpose["Start with the module purpose and main questions"] --> lesson_map["Use the lesson map to choose reading order"]
lesson_map --> study["Read the lessons and examples with one review question in mind"]
study --> proof["Test the idea with exercises and capstone checkpoints"]
proof --> close["Move on only when the closing criteria feel concrete"]
Read the first diagram as a placement map: this page sits between the course promise, the lesson pages listed below, and the capstone surfaces that pressure-test the module. Read the second diagram as the study route for this page, so the diagrams point you toward the Lesson map, Exercises, and Closing criteria instead of acting like decoration.
From scripts that “usually work” to execution graphs that can be reasoned about
Purpose of this Module¶
This module turns state literacy into execution literacy. Once identity and environment boundaries are clear, the next question is whether the pipeline tells the truth about what actually influences each stage.
Use this module to learn what a truthful DVC graph looks like: declared dependencies, declared outputs, declared parameters, and rerun behavior that follows those declarations instead of private assumptions. If that graph is not truthful, later metrics and promotion rules will be defending the wrong state story.
At a Glance¶
| Focus | Learner question | Capstone timing |
|---|---|---|
| truthful stages | "Why did this stage rerun, or why did it not?" | use the capstone heavily after the state model is already clear |
| dependency declaration | "Which inputs are strong enough to belong in the graph?" | compare dvc.yaml and dvc.lock carefully |
| operational trust | "When does dvc repro become predictable instead of mystical?" |
inspect stage boundaries, not just stage names |
Learning outcomes¶
- explain what makes a DVC stage truthful rather than merely convenient
- identify which reads, writes, params, and outputs belong in the declared graph
- use
dvc.yamlanddvc.locktogether as evidence of declared state transitions
Verification route¶
- Inspect
capstone/dvc.yaml,capstone/dvc.lock, andcapstone/params.yamltogether instead of reading any one file in isolation. - Run
make PROGRAM=reproducible-research/deep-dive-dvc capstone-verify-reportonce you can already predict why the pipeline should rerun or stay stable. - Explain one stage in terms of its declared dependencies, parameters, outputs, and recorded lock evidence before moving on.
Why this module matters in the course¶
This is where the course turns state identity into executable truth. Once a team can name state correctly, the next question is whether its pipeline graph tells the truth about how that state changes.
That matters because dvc repro is not magic. It can only make correct decisions from
the dependencies, parameters, commands, and outputs that the repository declares. If the
graph lies, DVC will behave consistently and still give the wrong operational result.
Questions this module should answer¶
By the end of the module, you should be able to answer:
- What makes a stage truthful rather than merely convenient?
- Which inputs are strong enough to belong in
depsorparams? - What kinds of hidden reads or writes make a pipeline deceptive?
- Why is
dvc.lockevidence of a graph execution rather than just a generated file?
Those answers are the bridge between "the repository runs" and "the repository is reviewable."
This module should make pipeline behavior more explainable, not merely more automated.
What to inspect in the capstone¶
Keep the capstone open while reading this module and inspect:
dvc.yamlas the declared graphparams.yamlas the control surface that should trigger meaningful changedvc.lockas the recorded consequence of the declared graphpublish/v1/as the stable output boundary that downstream consumers should trust
Ask a hard question while you inspect them: if one declared edge disappeared, which wrong result would become possible without an obvious crash?
4.1 The Core Issue: Implicit Dependencies in Scripts¶
ML pipelines typically originate as sequential scripts, such as:
These constructs function adequately in isolation but harbor inherent deceptions through omissions: dependencies remain unspoken, inputs are accessed without formal acknowledgment, outputs are generated sans registration, and parameters are embedded within code or global variables. While the script internally comprehends its requisites, the overarching system lacks this awareness, fostering opacity and unreliability.
DVC pipelines mitigate this by rendering dependencies explicit and amenable to scrutiny, transforming ad-hoc executions into structured, auditable processes.
Illustration:
graph LR
script["Script Execution"]
omissions["Implicit Dependencies<br/>Undeclared Inputs<br/>Unregistered Outputs<br/>Hidden Parameters"]
dvc["DVC Pipeline"]
declarations["Explicit Declarations<br/>Inspectable Graph<br/>Auditable Execution"]
script --> omissions
dvc --> declarations
4.2 Formal Definition of a DVC Stage¶
A DVC stage embodies a pure functional declaration: Given specified inputs, this command yields designated outputs. Formally, it comprises:
- deps: Files or directories consumed during execution.
- params: Configuration values extracted from tracked sources (e.g.,
params.yaml). - cmd: The command invoked for processing.
- outs: Files or directories produced.
No extraneous factors may impinge upon the outcome; undeclared influences constitute a violation, rendering the pipeline deceptive.
Example Stage Definition (from dvc.yaml):
stages:
preprocess:
cmd: python preprocess.py
deps:
- data/raw.csv
params:
- batch_size: 32
outs:
- data/processed.csv
This structure enforces transparency, ensuring all elements are traceable.
4.3 The Directed Acyclic Graph as the Execution Contract¶
Interconnected stages form a DAG in DVC, with nodes representing stages, edges denoting declared dependencies, and directional flow indicating causality. This graph transcends mere visualization; it constitutes the binding execution contract, governing DVC's determinations on execution, omission, and staleness.
Discrepancies in the DAG precipitate erroneous yet consistent DVC behavior, underscoring the imperative for accuracy.
Illustration:
graph LR
preprocess["Preprocess"]
train["Train"]
evaluate["Evaluate"]
preprocess -- processed.csv --> train
train -- model.pkl --> evaluate
4.4 Categories of Pipeline Failures¶
Failures manifest in two primary forms, each with distinct etiologies and ramifications.
False Positives (Excessive Rebuilds)¶
Stages execute redundantly despite unaltered pertinent elements. Contributors include overbroad dependency declarations, granular input specifications (e.g., whole directories), and superfluous parameter linkages. Consequences encompass computational inefficiency, protracted iterations, and operational frustration. While inefficient, these are benign, preserving result integrity.
False Negatives (Stale Outputs)¶
Stages omit reruns amid relevant modifications, stemming from omitted dependencies, undeclared file accesses, concealed parameters, or environmental infiltrations. Ramifications are severe: erroneous outcomes, undetected data corruption, and flawed inferences. DVC's architecture deliberately favors false positives to avert these catastrophic lapses.
Comparative Table:
| Failure Type | Causes | Costs |
|---|---|---|
| False Positives | Over-declaration, coarse inputs | Inefficiency, delays |
| False Negatives | Omissions, leaks | Errors, corruption |
4.5 Mechanics of dvc repro Decision-Making¶
Dispensing with ambiguity, dvc repro adheres to a rigorous protocol:
- Traverse the DAG topologically.
- For each stage: Compute hashes of declared dependencies, retrieve tracked parameters, and juxtapose against
dvc.lockstates. - Upon detection of any divergence: Designate the stage as stale and queue for execution.
- Cascade staleness to downstream stages.
This process eschews speculation; reruns occur exclusively due to declared changes, while omissions reflect unaltered declarations.
Example Command Execution (Illustrative output):
$ dvc repro
Stage 'preprocess' didn't change, skipping
Stage 'train' changed, reproducing...
Running command: python train.py
Stage 'evaluate' is downstream of changed stages, reproducing...
4.6 Significance of dvc.lock as Evidentiary Artifact¶
Far from ancillary metadata, dvc.lock functions as a verifiable record, capturing precise dependency hashes, parameter values, and output artifacts. It resolves inquiries into executed content and inputs. Deletion or disregard impairs historical reasoning; versioning dvc.yaml sans dvc.lock documents aspirations, not realizations.
Sample dvc.lock Excerpt:
stages:
train:
cmd: python train.py
deps:
- path: data/processed.csv
md5: abcdef1234567890
params:
params.yaml:
learning_rate: 0.01
outs:
- path: model.pkl
md5: 0987654321fedcba
4.7 Handling Shared Intermediates and Multi-Output Configurations¶
Authentic pipelines exhibit non-linearity, featuring shared intermediates, multi-output stages, and fan-in/fan-out topologies—these represent standard rather than exceptional patterns. Truthful DAGs mandate explicit intermediate declarations, avoidance of concealed temporary files, and resistance to undeclared artifact repurposing. Incidental file presence signals a defect.
4.8 Safe Pipeline Refactoring with Preserved Provenance¶
Truthful DAGs facilitate refactoring—renaming stages, restructuring directories, or partitioning/merging steps—provided dependencies endure, outputs align, and hashes verify. Path alterations do not fracture identity; undeclared dependencies do. This leverages DVC's content-centric paradigm for robust evolution.
4.9 Failure Modes and Interpretations¶
| Symptom | Interpretation |
|---|---|
| Unexpected stage rerun | Declared input modification |
| Omitted rerun despite necessity | Dependency omission |
| Downstream staleness | Proper propagation |
| Deletion-induced breakage | Overreliance on workspace |
Interpret these as diagnostic indicators, not adversities.
4.10 Predictive Exercise¶
Prior to execution:
1. Modify a single file or parameter.
2. Document anticipated reruns and omissions.
3. Invoke dvc repro.
4. Contrast predictions with observations.
Disparities implicate DAG inaccuracies or conceptual misunderstandings, yielding instructive insights.
Guidance: Employ a minimal pipeline for safety; annotate discrepancies to refine understanding.
4.11 Core Conceptual Framework¶
Pipelines transcend scripts; they embody executable assertions of causality.
Inability to articulate a stage's execution rationale denotes systemic failure.
Module 04: Invariants Checklist¶
Affirm:
- Comprehensive declaration of stage influences.
- Eradication of false negatives.
- Comprehension and tolerance of false positives.
- Authoritativeness of
dvc.lock. - Predictable pipeline dynamics.
Resolve uncertainties by rectifying the DAG before progression.
Transition to Module 05¶
Equipped to identify data, govern environments, and execute truthful pipelines, a profound challenge persists: What do results signify? Metrics fluctuate, parameters shift, and visualizations mislead. Module 05 introduces semantic contracts for equitable temporal comparisons.
Directory glossary¶
Use Glossary when you want the recurring language in this module kept stable while you move between lessons, exercises, and capstone checkpoints.