Skip to content

Worked Example: Auditing a Fragile ML Workflow

Page Maps

graph LR
  family["Reproducible Research"]
  program["Deep Dive DVC"]
  section["Reproducibility Failures Real Teams"]
  page["Worked Example: Auditing a Fragile ML Workflow"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone
flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

This example shows what Module 01 looks like when you apply it to a workflow that seems fine at first glance.

The point is not to shame the workflow. The point is to learn how to describe its risks precisely before reaching for DVC.

The starting repository

A small team has this repository:

  • train.py
  • evaluate.py
  • run_all.sh
  • config.yaml
  • data/train.csv
  • metrics.csv
  • README.md

The team says:

everything important is in Git, and we can rerun the project if we need to.

That statement is common. It is also exactly what Module 01 is meant to test.

Step 1: Ask whether this is repeatable or reproducible

The original author can still run:

bash run_all.sh

and gets a metric close to the previous one.

That is a useful signal, but it does not settle the stronger team question:

  • could another teammate recover the same result next month
  • from the repository alone
  • without asking the author what was done by habit

At this point the workflow has shown local repeatability, not team-grade reproducibility.

Step 2: Inventory the hidden state

Once the team looks more carefully, they uncover:

  • train.csv was cleaned manually from a larger raw export
  • one threshold is overridden from shell history during some runs
  • the author's notebook was used once to generate a feature file that still sits in the working directory
  • the Conda environment was updated twice since the last release
  • metrics.csv is trusted in meetings, but nobody can tie it to exact data identity

This is the moment the workflow starts looking less "simple" and more honestly described.

Step 3: Ask what Git is preserving and what it is not

Git is preserving:

  • code history
  • the visible config file
  • the README

Git is not directly preserving:

  • the exact identity of the cleaned dataset
  • the relationship between raw data, the feature file, and the metric artifact
  • the shell-level threshold override
  • the durable recovery path for derived results

That is not Git failing. That is Git being asked to carry a larger story than it owns.

Step 4: Ask what DVC would help with

At this point you can make a stronger statement:

the problem is not merely that we need version control. The problem is that data identity, derived artifacts, and the path from inputs to outputs are still too implicit.

That is where DVC starts to make sense.

DVC would help make explicit:

  • which data artifact is actually being referenced
  • which stages produce which outputs
  • which derived artifacts can be recovered later

But even here, Module 01 insists on discipline:

DVC still would not fix the scientific meaning of the threshold or the quality of the manual cleaning decision by itself.

Step 5: Write the first honest inventory

The team rewrites its self-description like this:

  • source input: cleaned train.csv, but raw-to-clean lineage is weak
  • control inputs: config.yaml plus one threshold sometimes overridden manually
  • execution assumptions: one shared Conda environment and one manually produced feature file
  • trusted output: metrics.csv
  • weak points: unclear data identity, hidden manual preprocessing, weak artifact recovery

This is a much stronger starting point than:

everything important is in the repo.

What this example teaches

This workflow is not unusual.

It has many strengths:

  • the team writes code down
  • the team has a runnable script
  • the team stores outputs they care about

But it still has the exact failure shape Module 01 is trying to expose:

  • success depends on more than the recorded repository
  • the trusted result is not fully defended by explicit state
  • the team has local repeatability but weak transferability

The review note you would want

The current workflow is locally runnable but not yet reproducible in a team-grade sense. Git is preserving code and visible config, but the exact identity of the cleaned dataset, the manual preprocessing story, and the control surface for thresholds remain weak. The team's trusted artifact is metrics.csv, but the repository cannot yet explain it without social memory. This is the right moment to make data identity, stage boundaries, and recoverable artifacts explicit rather than continuing to rely on convention.

That note is calm, specific, and ready for later DVC modules.

Why this is a mastery example

This one small story exercises the whole module:

  • Core 1: it separates repeatability from reproducibility
  • Core 2: it names hidden state directly
  • Core 3: it respects Git while naming its limits
  • Core 4: it gives DVC a clear, bounded role
  • Core 5: it ends with an honest workflow inventory instead of a vague tool wish