Skip to content

Module 06: Experiments, Baselines, and Controlled Change

Module Position

flowchart TD
  family["Reproducible Research"] --> program["Deep Dive DVC"]
  program --> module["Module 06: Experiments, Baselines, and Controlled Change"]
  module --> lessons["Lesson pages and worked examples"]
  module --> checkpoints["Exercises and closing criteria"]
  module --> capstone["Related capstone evidence"]
flowchart TD
  purpose["Start with the module purpose and main questions"] --> lesson_map["Use the lesson map to choose reading order"]
  lesson_map --> study["Read the lessons and examples with one review question in mind"]
  study --> proof["Test the idea with exercises and capstone checkpoints"]
  proof --> close["Move on only when the closing criteria feel concrete"]

Read the first diagram as a placement map: this page sits between the course promise, the lesson pages listed below, and the capstone surfaces that pressure-test the module. Read the second diagram as the study route for this page, so the diagrams point you toward the Lesson map, Exercises, and Closing criteria instead of acting like decoration.

Exploration as a first-class, auditable, and reversible process


Purpose of this Module

This module is where the course stops treating reproducibility as only preservation and starts treating it as controlled change. A trustworthy baseline is valuable only if a team can explore without damaging the history that made the baseline trustworthy.

Use this module to answer one practical question: how do you let experiments vary the system while keeping lineage, comparability, and reversal intact? If the answer is still "we tried a few things locally," the repository is not ready for promotion or review.

At a Glance

Focus Learner question Capstone timing
comparable exploration "What makes an experiment different but still comparable?" use the capstone after baseline state already feels stable
reversible change "How do I explore without corrupting the main state story?" inspect params, metrics, and publish state together
declared variation "Which experiment changes belong in the control surface?" avoid treating local tweaks as legitimate lineage

Learning outcomes

  • explain what makes an experiment comparable, reversible, and promotable instead of merely different
  • distinguish controlled experiment variation from baseline-changing boundary work
  • defend why local folklore is not an acceptable experiment lineage story

Verification route

  • Run make PROGRAM=reproducible-research/deep-dive-dvc capstone-experiment-review to inspect baseline params, baseline metrics, and comparable experiment records together.
  • Compare one candidate experiment against the baseline and decide whether it is reversible, comparable, and ready for promotion review.
  • Use the invariants checklist at the end of the module as the exit bar before moving into collaboration and CI policy.

Why this module matters in the course

This is where many teams destroy the discipline they built in earlier modules. Once the baseline becomes trustworthy, the urge to explore returns: new thresholds, new features, different preprocessing, maybe a changed split strategy. That pressure is normal.

The pedagogical point of this module is that experimentation is not an exception to reproducibility. It is one of the places where reproducibility is most likely to fail.

Questions this module should answer

By the end of the module, you should be able to answer:

  • What makes an experiment comparable to the baseline instead of merely different?
  • Which changes belong inside controlled experiment runs, and which require a new baseline?
  • Why is "I tried a few things locally" not an acceptable lineage story?
  • How do experiments stay reversible without cluttering or corrupting main history?

If those answers are still weak, later promotion and governance rules will feel arbitrary.

This module should make experimentation feel more disciplined, not less flexible.

What to inspect in the capstone

Keep the capstone open while reading this module and inspect:

  • params.yaml as the declared experiment surface
  • metrics/metrics.json as the comparison output
  • publish/v1/params.yaml as the promoted parameter contract
  • the recovery and verification targets as a reminder that experiments do not exempt the repository from proof

The capstone is intentionally small, but it should still let you answer the question: "What changed, where was it declared, and why is this run still comparable?"


6.1 The Fundamental Conflict: Exploration Versus Provenance

Exploration and provenance exert opposing forces: the former demands agility, iterative flexibility, and liberty, while the latter necessitates constancy, traceability, and rigor.

Many machine learning (ML) teams implicitly favor exploration through informal methods—notebook-based trials, script duplications, directory renamings, or reliance on recollection—yielding outcomes at the expense of lineage traceability.

DVC experiments formalize this process, preserving contractual obligations without compromise.

Illustration:

graph LR
  exploration["Exploration<br/>Speed<br/>Iteration<br/>Freedom"]
  provenance["Provenance<br/>Stability<br/>Traceability<br/>Discipline"]
  balance["DVC Experiments<br/>Formalized<br/>Contract-Preserving"]
  loss["Potential Lineage Loss"]
  constraints["Exploration Constraints"]
  exploration --> loss
  provenance --> constraints
  exploration --> balance
  provenance --> balance

6.2 Formal Conception of an Experiment

An experiment does not equate to a Git branch, notebook, or ephemeral script.

Within DVC, it denotes: A provisional pipeline execution with altered inputs, documented sans modification to primary history.

Essential attributes include:

  • Identical pipeline architecture.
  • Variations in parameters, data, or code.
  • Segregation from the principal branch.
  • Comprehensive comparability and reproducibility.

Violations of these criteria reclassify the activity as technical debt, not a legitimate experiment.


6.3 Limitations of Git Branches in Isolation

Git branches effectively sequester code, maintain historical records, and enable concurrent development. However, they fall short in:

  • Cleanly isolating data versions.
  • Facilitating metric comparisons.
  • Averting inadvertent promotions.
  • Documenting execution origins.

Exclusive dependence on Git branches for ML trials precipitates branch proliferation, contextual erosion, and queries such as "Which branch yielded the optimal outcome?"

DVC experiments augment Git, operating at a higher abstraction level.


6.4 Conceptual Framework of DVC Experiments

DVC introduces a dual-axis execution paradigm:

  • Mainline History: Governed by Git commits.
  • Experimental Domain: Transient executions.

An experiment involves pipeline invocation, input/output/metric logging, and preservation of the current commit's integrity, with capabilities for listing, differencing, and comparison.

This structure supports rapid cycles, secure evaluations, and intentional elevations, precluding accidental integrations.

Example Workflow:

$ dvc exp run --set-param train.learning_rate=0.005
$ dvc exp list
$ dvc exp diff


6.5 Assurances of Isolation and Their Boundaries

DVC experiments provide:

  • Detachment from Git chronology.
  • Autonomous parameter configurations.
  • Distinct metric repositories.

They exclude:

  • Safeguards against environmental variances.
  • Protections from semantic errors.
  • Assurances of sound experimental methodology.

Thus, experiments uphold mechanical fidelity while deferring to human discernment.


6.6 The Imperative Experiment Lifecycle

Each experiment must culminate in one of three resolutions: promotion, archival, or discard. Deviations foster obsolescence.

Creation Phase

  • Modify inputs (parameters, data, code).
  • Execute the experiment.
  • Capture resultant artifacts.

Evaluation Phase

  • Contrast metrics.
  • Examine differences.
  • Validate semantic coherence.

Decision Phase

  • Intentionally promote or explicitly discard.

Undecided experiments accrue as liabilities.

Illustration:

graph LR
  creation["Creation<br/>Input Changes<br/>Run<br/>Record"]
  evaluation["Evaluation<br/>Compare<br/>Inspect<br/>Assess"]
  decision["Decision<br/>Promote<br/>Archive<br/>Discard"]
  creation --> evaluation --> decision

6.7 Promotion as a Deliberate Governance Procedure

Promotion represents the most precarious juncture, conferring authority, historical integration, and downstream dependencies upon the experiment.

It mandates:

  • Reproducibility confirmation.
  • Metric substantiation.
  • Peer scrutiny where pertinent.

Unverified promotions entrench suboptimal results.

Example Promotion Command (with verification):

$ dvc exp show  # Review metrics
$ dvc exp apply <exp-id>  # Promote to workspace
$ git commit -m "Promote verified experiment"


6.8 Defining a Reproducible Experiment

A reproducible experiment permits clean-machine re-execution, yields equivalent metrics, features declared inputs, and adheres to pipeline invariants.

Non-reproducible instances warrant rejection from promotion, with precedence given to pipeline or environmental rectification. This stringent criterion sustains systemic coherence.


6.9 Prevalent Anti-Patterns

Persistent Experiments

Undecided lingering experiments devolve into ersatz branches, obfuscating history.

Tacit Promotion

Retaining results sans formal assimilation erodes provenance.

Selective Metric Emphasis

Prioritizing singular metrics absent semantic oversight engenders spurious assurance.

Notebook-Centric Inquiry

Non-reproducible endeavors are nonexistent.


6.10 Failure Modes and Analyses

Symptom Interpretation
Experiment Overabundance Decision-making laxity
Irreproducible Optimal Run Environmental/dependency infiltration
Incomparable Metrics Semantic divergence
Mainline Contamination Unverified promotion

These signify governance deficiencies, not instrumental flaws.


6.11 Applied Exercise

Execute systematically: 1. Conduct three experiments: one superior, one inferior, one equivocal. 2. For each: Document inputs, compare metrics, determine disposition. 3. Promote solely one.

Inability to articulate promotion rationale in documentation signals unpreparedness for automation.

Guidance: Leverage dvc exp run with varied parameters; record in a structured log for traceability.


6.12 Essential Conceptual Paradigm

Experiments are ephemeral; history is inviolable.

Equating all elements with significance undermines reliability.


Module 06: Invariants Checklist

Confirm:

  • Experiments segregated from primary history.
  • Defined resolutions for all experiments.
  • Deliberate, verifiable promotions.
  • Rejection of irreproducible experiments.
  • Preservation of provenance amid exploration.

Resolve negotiability prior to advancement.


Transition to Module 07

Individual operations are now viable. However, systemic failures arise from interpersonal dynamics: omitted data pushes, forceful branch overwrites, main-branch experimentations.

Module 07 addresses this reality: Reproducibility constitutes a social challenge mitigated through technical mechanisms.

Directory glossary

Use Glossary when you want the recurring language in this module kept stable while you move between lessons, exercises, and capstone checkpoints.