Worked Example: Auditing a Fragile ML Workflow¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive DVC"]
section["Reproducibility Failures Real Teams"]
page["Worked Example: Auditing a Fragile ML Workflow"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
This example shows what Module 01 looks like when you apply it to a workflow that seems fine at first glance.
The point is not to shame the workflow. The point is to learn how to describe its risks precisely before reaching for DVC.
The starting repository¶
A small team has this repository:
train.pyevaluate.pyrun_all.shconfig.yamldata/train.csvmetrics.csvREADME.md
The team says:
everything important is in Git, and we can rerun the project if we need to.
That statement is common. It is also exactly what Module 01 is meant to test.
Step 1: Ask whether this is repeatable or reproducible¶
The original author can still run:
and gets a metric close to the previous one.
That is a useful signal, but it does not settle the stronger team question:
- could another teammate recover the same result next month
- from the repository alone
- without asking the author what was done by habit
At this point the workflow has shown local repeatability, not team-grade reproducibility.
Step 2: Inventory the hidden state¶
Once the team looks more carefully, they uncover:
train.csvwas cleaned manually from a larger raw export- one threshold is overridden from shell history during some runs
- the author's notebook was used once to generate a feature file that still sits in the working directory
- the Conda environment was updated twice since the last release
metrics.csvis trusted in meetings, but nobody can tie it to exact data identity
This is the moment the workflow starts looking less "simple" and more honestly described.
Step 3: Ask what Git is preserving and what it is not¶
Git is preserving:
- code history
- the visible config file
- the README
Git is not directly preserving:
- the exact identity of the cleaned dataset
- the relationship between raw data, the feature file, and the metric artifact
- the shell-level threshold override
- the durable recovery path for derived results
That is not Git failing. That is Git being asked to carry a larger story than it owns.
Step 4: Ask what DVC would help with¶
At this point you can make a stronger statement:
the problem is not merely that we need version control. The problem is that data identity, derived artifacts, and the path from inputs to outputs are still too implicit.
That is where DVC starts to make sense.
DVC would help make explicit:
- which data artifact is actually being referenced
- which stages produce which outputs
- which derived artifacts can be recovered later
But even here, Module 01 insists on discipline:
DVC still would not fix the scientific meaning of the threshold or the quality of the manual cleaning decision by itself.
Step 5: Write the first honest inventory¶
The team rewrites its self-description like this:
- source input: cleaned
train.csv, but raw-to-clean lineage is weak - control inputs:
config.yamlplus one threshold sometimes overridden manually - execution assumptions: one shared Conda environment and one manually produced feature file
- trusted output:
metrics.csv - weak points: unclear data identity, hidden manual preprocessing, weak artifact recovery
This is a much stronger starting point than:
everything important is in the repo.
What this example teaches¶
This workflow is not unusual.
It has many strengths:
- the team writes code down
- the team has a runnable script
- the team stores outputs they care about
But it still has the exact failure shape Module 01 is trying to expose:
- success depends on more than the recorded repository
- the trusted result is not fully defended by explicit state
- the team has local repeatability but weak transferability
The review note you would want¶
The current workflow is locally runnable but not yet reproducible in a team-grade sense. Git is preserving code and visible config, but the exact identity of the cleaned dataset, the manual preprocessing story, and the control surface for thresholds remain weak. The team's trusted artifact is
metrics.csv, but the repository cannot yet explain it without social memory. This is the right moment to make data identity, stage boundaries, and recoverable artifacts explicit rather than continuing to rely on convention.
That note is calm, specific, and ready for later DVC modules.
Why this is a mastery example¶
This one small story exercises the whole module:
- Core 1: it separates repeatability from reproducibility
- Core 2: it names hidden state directly
- Core 3: it respects Git while naming its limits
- Core 4: it gives DVC a clear, bounded role
- Core 5: it ends with an honest workflow inventory instead of a vague tool wish