Why Git and Scripts Are Not Enough¶

Page Maps¶

graph LR
  family["Reproducible Research"]
  program["Deep Dive DVC"]
  section["Reproducibility Failures Real Teams"]
  page["Why Git and Scripts Are Not Enough"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone

flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

Module 01 should not leave you with the impression that Git is weak or useless.

Git is excellent at what it owns.

The problem is that teams often ask Git and a few scripts to carry a reproducibility story that is larger than source control alone.

What Git does well¶

Git is strong at:

versioning text
showing code history
supporting branching and review
preserving deliberate edits to source files and docs

Those are major strengths. Any honest DVC course should say that clearly.

Where the gap opens¶

Reproducibility for data and ML work depends on more than source text:

the exact data identity
the relationship between stages
the parameters that shaped a run
the environment and execution context
the resulting artifacts someone later wants to recover

Git can record some of that indirectly, but it does not model the full problem by itself.

A practical contrast¶

Workflow question	Git answers well	Git alone struggles with
who changed the code	yes	not the main issue
which data bytes produced this result	not directly	this is a core gap
how outputs depend on inputs and stages	only by convention	weak without extra structure
how to recover large or derived artifacts cleanly	not Git's core job	teams improvise here
how to compare experimental runs beyond source diff	partly	often incomplete

This is why "we put it in Git" is not the same claim as "we can defend the result."

Scripts are useful, but they often hide structure¶

Ad hoc scripts can absolutely be part of a good workflow.

The problem appears when the scripts become the only place where important structure lives:

which files are inputs
which outputs are derived
which parameters were chosen
which order the steps are meant to run in

At that point, a teammate can read the code and still not recover the whole operational story without asking the author.

A small example¶

Imagine a repo with:

train.py
evaluate.py
run_all.sh
README.md

That may be enough to get work done.

But now ask stronger questions:

which exact dataset version was used
which artifacts are intermediate versus trusted outputs
whether the shell script assumes files already exist
whether metrics from last month can be recovered exactly

Git history plus shell commands can help, but they do not answer all of those on their own.

Why teams reach for memory instead of structure¶

Git-plus-script workflows often survive for longer than they should because the team has informal compensations:

one person remembers the right command
one folder name is treated as if it were an identity
one server still has the "real" copy
one notebook output is accepted as evidence

That is not reproducibility. It is social caching.

What this means for DVC¶

DVC is not here to replace Git.

It is here to complement Git by making parts of the workflow story more explicit:

data and artifact identity
stage relationships
experiment and metric context
recovery paths for non-source artifacts

That only makes sense once the Git boundary is understood honestly.

The wrong conclusion to avoid¶

Do not leave this page thinking:

Git failed us, so we need a different tool for everything.

The better conclusion is:

Git is solving source history well, but source history is only part of the reproducibility problem.

That is a much more stable foundation for the rest of the course.

Keep this standard¶

When evaluating a fragile workflow, ask two separate questions:

what is Git already preserving well
what important parts of the result story still have no clear owner

That second question is where DVC enters.