Exercise Answers¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive DVC"]
section["Reproducibility Failures Real Teams"]
page["Exercise Answers"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
These answers are model explanations, not the only acceptable wording.
What matters is whether the reasoning makes the workflow story more honest.
Answer 1: Separate repeatability from reproducibility¶
What the claim does show:
- one person achieved a local rerun with a matching result
What it does not show yet:
- whether another person could recover the same result later
- whether the same data identity, environment, and parameters were actually preserved
- whether the repository explains the run without relying on memory
Stronger next question:
- what would another teammate need in order to recover this result from the repository and declared artifacts alone
The main lesson is that local rerun success is evidence, but it is not the whole proof.
Answer 2: Find the hidden state¶
Strong answers should include items like:
- the exact identity of
train.csv - the manual cleaning step that produced
train.csv - the threshold sometimes passed via CLI
- the environment used by the original author
- any dependency versions in that environment
- the command history or shell practice used to launch the run
- the exact relationship between the dataset and
results/metrics.csv
The important point is that the real workflow already includes these influences whether or not the repository has described them well.
Answer 3: Name Git's real boundary¶
A strong review note would say:
Git is preserving source code, visible configs, and textual instructions well. The gap is that the full result story still includes data identity, derived artifacts, and execution assumptions that source history alone does not make explicit. That matters because a team can version code carefully and still fail to recover or defend the result later.
The main lesson is to respect Git's strengths while refusing to overload its authority.
Answer 4: Draw the DVC boundary honestly¶
What DVC is likely to help with:
- tracked data and artifact identity
- visible pipeline relationships
- recorded stage outputs and recoverable workflow state
What DVC does not settle by itself:
- whether the data is scientifically valid
- whether the workflow is fully deterministic everywhere
- whether the team's published outputs are interpreted responsibly
- whether undocumented manual behavior has been removed
Why the distinction matters:
- it keeps tool expectations honest and makes later workflow design more precise
Answer 5: Write your first workflow inventory¶
There is no single correct inventory, but strong answers will clearly name:
- where the source data comes from
- what parameters or defaults shape behavior
- what environment or machine assumptions still matter
- which outputs people actually trust
- which parts of the story still depend on memory, luck, or undeclared state
The main lesson is that the inventory should describe the present workflow truth, not an imagined future one.
Self-check¶
If your answers consistently explain:
- why team-grade reproducibility is harder than a local rerun
- which hidden inputs are influencing the workflow
- what Git is and is not preserving
- what DVC can improve without pretending to own everything
then you are using Module 01 correctly.