Why Git and Scripts Are Not Enough¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive DVC"]
section["Reproducibility Failures Real Teams"]
page["Why Git and Scripts Are Not Enough"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
Module 01 should not leave you with the impression that Git is weak or useless.
Git is excellent at what it owns.
The problem is that teams often ask Git and a few scripts to carry a reproducibility story that is larger than source control alone.
What Git does well¶
Git is strong at:
- versioning text
- showing code history
- supporting branching and review
- preserving deliberate edits to source files and docs
Those are major strengths. Any honest DVC course should say that clearly.
Where the gap opens¶
Reproducibility for data and ML work depends on more than source text:
- the exact data identity
- the relationship between stages
- the parameters that shaped a run
- the environment and execution context
- the resulting artifacts someone later wants to recover
Git can record some of that indirectly, but it does not model the full problem by itself.
A practical contrast¶
| Workflow question | Git answers well | Git alone struggles with |
|---|---|---|
| who changed the code | yes | not the main issue |
| which data bytes produced this result | not directly | this is a core gap |
| how outputs depend on inputs and stages | only by convention | weak without extra structure |
| how to recover large or derived artifacts cleanly | not Git's core job | teams improvise here |
| how to compare experimental runs beyond source diff | partly | often incomplete |
This is why "we put it in Git" is not the same claim as "we can defend the result."
Scripts are useful, but they often hide structure¶
Ad hoc scripts can absolutely be part of a good workflow.
The problem appears when the scripts become the only place where important structure lives:
- which files are inputs
- which outputs are derived
- which parameters were chosen
- which order the steps are meant to run in
At that point, a teammate can read the code and still not recover the whole operational story without asking the author.
A small example¶
Imagine a repo with:
train.pyevaluate.pyrun_all.shREADME.md
That may be enough to get work done.
But now ask stronger questions:
- which exact dataset version was used
- which artifacts are intermediate versus trusted outputs
- whether the shell script assumes files already exist
- whether metrics from last month can be recovered exactly
Git history plus shell commands can help, but they do not answer all of those on their own.
Why teams reach for memory instead of structure¶
Git-plus-script workflows often survive for longer than they should because the team has informal compensations:
- one person remembers the right command
- one folder name is treated as if it were an identity
- one server still has the "real" copy
- one notebook output is accepted as evidence
That is not reproducibility. It is social caching.
What this means for DVC¶
DVC is not here to replace Git.
It is here to complement Git by making parts of the workflow story more explicit:
- data and artifact identity
- stage relationships
- experiment and metric context
- recovery paths for non-source artifacts
That only makes sense once the Git boundary is understood honestly.
The wrong conclusion to avoid¶
Do not leave this page thinking:
Git failed us, so we need a different tool for everything.
The better conclusion is:
Git is solving source history well, but source history is only part of the reproducibility problem.
That is a much more stable foundation for the rest of the course.
Keep this standard¶
When evaluating a fragile workflow, ask two separate questions:
- what is Git already preserving well
- what important parts of the result story still have no clear owner
That second question is where DVC enters.