Skip to content

Baseline Authority and Experiment Intent

Page Maps

graph LR
  family["Reproducible Research"]
  program["Deep Dive DVC"]
  section["Experiments Baselines Controlled Change"]
  page["Baseline Authority and Experiment Intent"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone
flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

An experiment is only meaningful when it has something stable to compare against.

That stable reference is the baseline.

In this module, a baseline is not just "the current files." It is the recorded state that the team is willing to use as the comparison anchor:

  • declared data identity
  • declared pipeline graph
  • declared parameters
  • recorded metrics
  • known environment assumptions
  • published or reviewable evidence

If the baseline is fuzzy, experiments become storytelling.

What the baseline protects

A good baseline protects the question:

Compared with what?

Without that answer, a candidate result can look better for the wrong reason.

Example:

baseline:
  model_family: logistic_regression
  evaluate.threshold: 0.65
  positive_class_f1_at_fixed_threshold: 0.81

candidate:
  model_family: logistic_regression
  evaluate.threshold: 0.50
  positive_class_f1_at_fixed_threshold: 0.84

The candidate may be valuable, but the baseline immediately clarifies that the threshold changed. The review should discuss a threshold-control change, not pretend the model improved under identical controls.

Intent comes before execution

Before running a candidate, write the intent in one sentence.

Weak:

Try some stuff and see if the metric improves.

Stronger:

Lower the escalation threshold from 0.65 to 0.50 to test whether recall improves enough to justify a precision tradeoff.

That sentence tells a reviewer:

  • what changed
  • why the change exists
  • which metric movement would matter
  • which tradeoff needs review

The intent does not need to be long. It needs to be specific enough that the experiment does not become a pile of unrelated changes.

Baseline evidence should be inspectable

The baseline should not live only in memory.

Useful evidence includes:

  • params.yaml for current controls
  • metrics/metrics.json for current results
  • dvc.lock for recorded execution state
  • publish/v1/params.yaml and publish/v1/metrics.json for promoted state
  • review notes or guides that describe release meaning
flowchart LR
  params["params"] --> baseline["baseline"]
  metrics["metrics"] --> baseline
  lock["lock evidence"] --> baseline
  publish["published evidence"] --> baseline
  baseline --> experiment["candidate comparison"]

The point is not to create paperwork. The point is to keep comparison from depending on who remembers the run.

Baseline authority can expire

A baseline is not permanent just because it exists.

It may stop being a fair comparison anchor when:

  • the evaluation population changes
  • the metric definition changes
  • the pipeline graph changes meaningfully
  • the environment strategy changes
  • the release goal changes
  • a data correction invalidates prior results

When that happens, do not hide the break inside an experiment. Name it as baseline boundary work. The team may need a new baseline before candidate comparisons are fair again.

A baseline is not a prison

Baseline discipline does not mean "never change anything important."

It means important changes should be described honestly.

Changing a threshold can be a valid experiment. Changing the evaluation population may be valid too, but it changes the comparison claim. Replacing the metric definition may be necessary, but it should not be smuggled into a candidate run and compared as if nothing else moved.

The baseline gives exploration a stable reference point. It does not forbid learning.

Review checkpoint

You understand this core when you can explain:

  • what state the baseline represents
  • which files make the baseline inspectable
  • what one candidate run intends to test
  • why baseline authority can expire
  • when a change should be treated as baseline boundary work rather than an ordinary experiment

Controlled change starts with a baseline that means something.