Baseline Authority and Experiment Intent¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive DVC"]
section["Experiments Baselines Controlled Change"]
page["Baseline Authority and Experiment Intent"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
An experiment is only meaningful when it has something stable to compare against.
That stable reference is the baseline.
In this module, a baseline is not just "the current files." It is the recorded state that the team is willing to use as the comparison anchor:
- declared data identity
- declared pipeline graph
- declared parameters
- recorded metrics
- known environment assumptions
- published or reviewable evidence
If the baseline is fuzzy, experiments become storytelling.
What the baseline protects¶
A good baseline protects the question:
Compared with what?
Without that answer, a candidate result can look better for the wrong reason.
Example:
baseline:
model_family: logistic_regression
evaluate.threshold: 0.65
positive_class_f1_at_fixed_threshold: 0.81
candidate:
model_family: logistic_regression
evaluate.threshold: 0.50
positive_class_f1_at_fixed_threshold: 0.84
The candidate may be valuable, but the baseline immediately clarifies that the threshold changed. The review should discuss a threshold-control change, not pretend the model improved under identical controls.
Intent comes before execution¶
Before running a candidate, write the intent in one sentence.
Weak:
Try some stuff and see if the metric improves.
Stronger:
Lower the escalation threshold from
0.65to0.50to test whether recall improves enough to justify a precision tradeoff.
That sentence tells a reviewer:
- what changed
- why the change exists
- which metric movement would matter
- which tradeoff needs review
The intent does not need to be long. It needs to be specific enough that the experiment does not become a pile of unrelated changes.
Baseline evidence should be inspectable¶
The baseline should not live only in memory.
Useful evidence includes:
params.yamlfor current controlsmetrics/metrics.jsonfor current resultsdvc.lockfor recorded execution statepublish/v1/params.yamlandpublish/v1/metrics.jsonfor promoted state- review notes or guides that describe release meaning
flowchart LR
params["params"] --> baseline["baseline"]
metrics["metrics"] --> baseline
lock["lock evidence"] --> baseline
publish["published evidence"] --> baseline
baseline --> experiment["candidate comparison"]
The point is not to create paperwork. The point is to keep comparison from depending on who remembers the run.
Baseline authority can expire¶
A baseline is not permanent just because it exists.
It may stop being a fair comparison anchor when:
- the evaluation population changes
- the metric definition changes
- the pipeline graph changes meaningfully
- the environment strategy changes
- the release goal changes
- a data correction invalidates prior results
When that happens, do not hide the break inside an experiment. Name it as baseline boundary work. The team may need a new baseline before candidate comparisons are fair again.
A baseline is not a prison¶
Baseline discipline does not mean "never change anything important."
It means important changes should be described honestly.
Changing a threshold can be a valid experiment. Changing the evaluation population may be valid too, but it changes the comparison claim. Replacing the metric definition may be necessary, but it should not be smuggled into a candidate run and compared as if nothing else moved.
The baseline gives exploration a stable reference point. It does not forbid learning.
Review checkpoint¶
You understand this core when you can explain:
- what state the baseline represents
- which files make the baseline inspectable
- what one candidate run intends to test
- why baseline authority can expire
- when a change should be treated as baseline boundary work rather than an ordinary experiment
Controlled change starts with a baseline that means something.