Skip to content

Worked Example: Promoting a Controlled Threshold Experiment

Page Maps

graph LR
  family["Reproducible Research"]
  program["Deep Dive DVC"]
  section["Experiments Baselines Controlled Change"]
  page["Worked Example: Promoting a Controlled Threshold Experiment"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone
flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

This example shows how Module 06 fits together when a candidate run looks promising but still needs a promotion decision.

The goal is not to promote the best-looking number. The goal is to decide whether the candidate should become part of history.

The situation

The current baseline says:

evaluate:
  threshold: 0.65

and reports:

{
  "incident_escalation": {
    "positive_class_f1_at_fixed_threshold": 0.81,
    "precision_at_fixed_threshold": 0.78,
    "recall_at_fixed_threshold": 0.84,
    "evaluation_population_size": 420
  }
}

The release objective is to reduce missed escalations, even if precision drops slightly.

You write the candidate intent before running it:

Lower evaluate.threshold to 0.50 to test whether recall improves enough to justify a precision tradeoff.

That sentence keeps the candidate narrow.

Step 1: Run a declared candidate

You run a candidate by changing the declared control:

dvc exp run --set-param evaluate.threshold=0.50

The important part is not the exact command syntax. The important part is that the threshold change is visible in the parameter surface.

This is better than editing a Python literal and hoping everyone remembers what changed.

Step 2: Compare against the baseline

The candidate reports:

{
  "incident_escalation": {
    "positive_class_f1_at_fixed_threshold": 0.84,
    "precision_at_fixed_threshold": 0.75,
    "recall_at_fixed_threshold": 0.95,
    "evaluation_population_size": 420
  }
}

You compare the whole review surface:

baseline threshold: 0.65
candidate threshold: 0.50

f1:        0.81 -> 0.84
precision: 0.78 -> 0.75
recall:    0.84 -> 0.95
population size: 420 -> 420

This is not a generic model improvement. It is a threshold tradeoff: recall improves, precision decreases, and the population size appears stable.

Step 3: Check scope

You confirm:

  • no model family change
  • no evaluation population change
  • no metric key change
  • no unrelated workspace edits
  • only the threshold control moved

That gives the candidate a coherent story.

If you had also changed feature filtering and metric definition, the candidate would not be a clean threshold experiment anymore. It might still be useful, but it would need a different review claim.

Step 4: Decide whether promotion is justified

The release objective favors recall. The candidate improves recall from 0.84 to 0.95. Precision drops from 0.78 to 0.75.

You write a promotion argument:

Promote the lower-threshold candidate because it substantially reduces missed escalations under the same metric schema and apparent population size. The precision drop is acceptable for this release objective.

That is a decision, not just a metric observation.

Step 5: Apply and inspect before committing

You apply the candidate:

dvc exp apply <candidate-id>
git diff
git status

You check that the workspace contains only intended changes:

  • params.yaml threshold change
  • updated metric evidence
  • any expected lock or output evidence

If unrelated local files appear, you stop and clean up before committing.

Applying is not promotion by itself. Promotion happens when the reviewed state is committed with a clear rationale.

The review note you would want

Promote the threshold candidate that changes evaluate.threshold from 0.65 to 0.50. Compared with the baseline, fixed-threshold F1 moves from 0.81 to 0.84, recall moves from 0.84 to 0.95, and precision moves from 0.78 to 0.75 on the same reported evaluation population size. This is a recall-oriented policy tradeoff, not a pure model improvement claim. Promotion is justified only because the current release objective prioritizes reducing missed escalations.

That note is strong because it names the control change, metric movement, tradeoff, and release basis.

Why this is a mastery example

This one story exercises the whole module:

  • Core 1: the baseline anchored the comparison
  • Core 2: the candidate stayed scoped to one review question
  • Core 3: DVC experiment mechanics preserved a candidate record
  • Core 4: selection considered tradeoffs, not only best F1
  • Core 5: promotion happened only after applying and inspecting the workspace

The candidate became part of history because the review argument was defensible, not because the number looked better.