Worked Example: Promoting a Controlled Threshold Experiment¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive DVC"]
section["Experiments Baselines Controlled Change"]
page["Worked Example: Promoting a Controlled Threshold Experiment"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
This example shows how Module 06 fits together when a candidate run looks promising but still needs a promotion decision.
The goal is not to promote the best-looking number. The goal is to decide whether the candidate should become part of history.
The situation¶
The current baseline says:
and reports:
{
"incident_escalation": {
"positive_class_f1_at_fixed_threshold": 0.81,
"precision_at_fixed_threshold": 0.78,
"recall_at_fixed_threshold": 0.84,
"evaluation_population_size": 420
}
}
The release objective is to reduce missed escalations, even if precision drops slightly.
You write the candidate intent before running it:
Lower
evaluate.thresholdto0.50to test whether recall improves enough to justify a precision tradeoff.
That sentence keeps the candidate narrow.
Step 1: Run a declared candidate¶
You run a candidate by changing the declared control:
The important part is not the exact command syntax. The important part is that the threshold change is visible in the parameter surface.
This is better than editing a Python literal and hoping everyone remembers what changed.
Step 2: Compare against the baseline¶
The candidate reports:
{
"incident_escalation": {
"positive_class_f1_at_fixed_threshold": 0.84,
"precision_at_fixed_threshold": 0.75,
"recall_at_fixed_threshold": 0.95,
"evaluation_population_size": 420
}
}
You compare the whole review surface:
baseline threshold: 0.65
candidate threshold: 0.50
f1: 0.81 -> 0.84
precision: 0.78 -> 0.75
recall: 0.84 -> 0.95
population size: 420 -> 420
This is not a generic model improvement. It is a threshold tradeoff: recall improves, precision decreases, and the population size appears stable.
Step 3: Check scope¶
You confirm:
- no model family change
- no evaluation population change
- no metric key change
- no unrelated workspace edits
- only the threshold control moved
That gives the candidate a coherent story.
If you had also changed feature filtering and metric definition, the candidate would not be a clean threshold experiment anymore. It might still be useful, but it would need a different review claim.
Step 4: Decide whether promotion is justified¶
The release objective favors recall. The candidate improves recall from 0.84 to 0.95.
Precision drops from 0.78 to 0.75.
You write a promotion argument:
Promote the lower-threshold candidate because it substantially reduces missed escalations under the same metric schema and apparent population size. The precision drop is acceptable for this release objective.
That is a decision, not just a metric observation.
Step 5: Apply and inspect before committing¶
You apply the candidate:
You check that the workspace contains only intended changes:
params.yamlthreshold change- updated metric evidence
- any expected lock or output evidence
If unrelated local files appear, you stop and clean up before committing.
Applying is not promotion by itself. Promotion happens when the reviewed state is committed with a clear rationale.
The review note you would want¶
Promote the threshold candidate that changes
evaluate.thresholdfrom0.65to0.50. Compared with the baseline, fixed-threshold F1 moves from0.81to0.84, recall moves from0.84to0.95, and precision moves from0.78to0.75on the same reported evaluation population size. This is a recall-oriented policy tradeoff, not a pure model improvement claim. Promotion is justified only because the current release objective prioritizes reducing missed escalations.
That note is strong because it names the control change, metric movement, tradeoff, and release basis.
Why this is a mastery example¶
This one story exercises the whole module:
- Core 1: the baseline anchored the comparison
- Core 2: the candidate stayed scoped to one review question
- Core 3: DVC experiment mechanics preserved a candidate record
- Core 4: selection considered tradeoffs, not only best F1
- Core 5: promotion happened only after applying and inspecting the workspace
The candidate became part of history because the review argument was defensible, not because the number looked better.