Skip to content

Comparing Experiments and Selecting Candidates

Page Maps

graph LR
  family["Reproducible Research"]
  program["Deep Dive DVC"]
  section["Experiments Baselines Controlled Change"]
  page["Comparing Experiments and Selecting Candidates"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone
flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

Experiment comparison is not a leaderboard ritual.

The candidate with the highest single metric is not automatically the candidate that should move forward. A good comparison asks whether the candidate is comparable, what tradeoff it makes, and whether the result supports the intent that created it.

Start with comparability

Before ranking candidates, ask whether they can be compared.

Useful checks:

  • same baseline or an explicitly named baseline change
  • same evaluation population
  • same metric definitions
  • declared parameter changes
  • no hidden data or environment drift
  • no unrelated pipeline change mixed into the candidate

If these checks fail, the right next step is not to pick a winner. It is to repair the comparison story.

Compare the whole review surface

A candidate can improve one metric and make another worse.

Example:

baseline:
  threshold: 0.65
  f1: 0.81
  precision: 0.78
  recall: 0.84

candidate:
  threshold: 0.50
  f1: 0.84
  precision: 0.75
  recall: 0.95

This is not simply "better." It is a threshold tradeoff.

If the release goal is to reduce missed escalations, the candidate may be promising. If the release goal is to avoid false alarms, it may be unacceptable. The metric values do not decide without the review objective.

Use candidate tables carefully

Candidate tables are helpful when they do not hide meaning.

candidate                         threshold    f1     precision    recall    review note
baseline                          0.65         0.81   0.78         0.84      current release
lower-threshold-for-recall        0.50         0.84   0.75         0.95      recall gain, precision cost
stricter-threshold-for-precision  0.75         0.77   0.86         0.68      precision gain, recall cost

This table is useful because it shows the control that moved and the tradeoff, not only a ranked metric.

Weak table:

candidate    f1
a            0.84
b            0.81
c            0.77

That table invites a winner without explaining what changed.

Selection is a decision, not a discovery

The review should distinguish:

  • observed metric movement
  • parameter or data changes that explain the movement
  • known tradeoffs
  • release objective
  • reason to keep, discard, or promote the candidate

A strong candidate note might say:

Keep lower-threshold-for-recall for promotion review because it improves recall from 0.84 to 0.95 on the same evaluation population, with an expected precision drop from 0.78 to 0.75. This matches the current release objective only if the precision cost remains acceptable.

That is a decision argument. It is stronger than "best F1."

Treat inconclusive runs honestly

Not every candidate needs to be promoted or fully explained.

Some runs are inconclusive:

  • metric movement is within expected noise
  • tradeoff does not match the release objective
  • comparability evidence is incomplete
  • output changed but the reason is unclear
  • candidate combined too many changes to interpret

Inconclusive is a valid outcome. The bad outcome is pretending uncertainty is a win.

Review checkpoint

You understand this core when you can:

  • check comparability before ranking candidates
  • compare metrics with parameters and review intent
  • explain tradeoffs instead of naming only the highest metric
  • identify inconclusive candidates
  • write a selection note that another reviewer can challenge

Candidate selection is where experiments become engineering judgment instead of metric shopping.