Prediction Review Guide¶
Guide Maps¶
graph LR
evaluate["evaluate.py"] --> predictions["predictions.csv"]
predictions --> queue["review-queue"]
predictions --> threshold["threshold-review"]
queue --> release["Release review"]
threshold --> release
flowchart LR
question["Which incidents deserve closer review?"] --> predictions["Read the prediction review surfaces"]
predictions --> compare["Compare misclassifications and borderline rows"]
compare --> release["Return to the release decision with concrete records"]
Use this guide when aggregate metrics look acceptable but you still need to know which
records deserve human attention. The goal is to make predictions.csv, review-queue,
and threshold-review work together as one honest review surface.
Review layers¶
| Surface | Best question |
|---|---|
predictions.csv |
what happened on each promoted eval row |
make review-queue |
which false positives and false negatives most need immediate inspection |
make threshold-review |
which promoted predictions are closest to the current decision line |
report.md |
which rows are worth mentioning in the human release summary |
Review rules¶
- use
review-queuewhen you need known mistakes first - use
threshold-reviewwhen the decision line itself is under pressure - return to raw
predictions.csvwhen one team or incident pattern needs closer inspection - do not treat aggregate metrics as a substitute for record-level review when the release question is operational
Best companion guides¶
- read CONTROL_SURFACE_GUIDE.md when the next question is whether threshold changes are still comparable
- read RELEASE_REVIEW_GUIDE.md when the next question is whether record-level evidence changes downstream trust
- read MODEL_GUIDE.md when the next question is whether a record pattern points back to the promoted scoring behavior