Runbooks, Escalation, and Operational Review¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive Snakemake"]
section["Performance Observability Incident Response"]
page["Runbooks, Escalation, and Operational Review"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
A workflow becomes easier to maintain when incident handling stops depending on memory.
That is what a runbook is for.
A good runbook does not try to explain the whole repository. It gives the next maintainer the shortest reliable route through a stressful question.
What a Module 09 runbook should answer¶
At minimum, the runbook should tell a maintainer:
- how to confirm the symptom without mutating the repository
- which command gives the narrowest honest answer first
- where the matching logs, benchmarks, and summaries live
- how to decide whether the issue is semantic, operational, or publish-related
- when to escalate instead of tuning locally
If any of those are missing, the team will fill the gap with folklore.
A simple runbook shape¶
flowchart LR
preflight["preflight question"] --> command["first command"]
command --> evidence["matching evidence surface"]
evidence --> classify["classify the issue"]
classify --> repair["local repair"]
classify --> escalate["documented escalation route"]
That shape is enough for most teams.
The five sections worth keeping¶
1. Symptom check¶
Name the smallest command that confirms the problem:
snakemake -n -psnakemake --summarymake -C capstone wf-dryrun
This section should prevent unnecessary real runs.
2. Evidence route¶
Say where to look next:
- rule-local logs for one failing or slow target
- benchmark files for the suspicious rule family
- provenance or profile evidence when context differs
- published verification evidence when trust in outputs is the real question
3. Decision boundary¶
The runbook should help a maintainer decide whether the issue belongs to:
- workflow semantics
- operating context or profile policy
- storage and staging behavior
- tool implementation
- publish-boundary verification
This is the difference between a calm repair and an aimless investigation.
4. Escalation triggers¶
Escalate when:
- the proposed fix changes workflow meaning
- the issue appears only in one operating context and profile review is required
- the published contract may no longer be trustworthy
- the same incident keeps returning and should become an executable check
5. Proof route¶
End the runbook with the commands that prove the repair honestly:
make -C capstone evidence-summary
make -C capstone tour
make -C capstone verify-report
make -C capstone profile-audit
These commands are not always all required. They are the module's reliable escalation surfaces.
Turning recurring incidents into reviewable operations¶
One of the best runbook improvements is converting a repeated manual check into a stable artifact or command.
Examples:
- a repeated "which profiles differ?" question becomes
make profile-audit - a repeated "what evidence exists from the last real run?" question becomes
make evidence-summary - a repeated "can I review the whole execution route in one place?" question becomes
make tour
This is how operations become calmer without becoming opaque.
What escalation should look like¶
Escalation is not failure. It is boundary recognition.
Escalation is healthy when a note says:
The evidence points away from rule-local runtime and toward profile or publish drift, so I am moving this from local tuning into profile review and verification.
That sentence protects the workflow from the wrong kind of "quick fix."
A short runbook example¶
Here is a minimal pattern:
- Confirm with
snakemake -n -pormake -C capstone wf-dryrun. - Use
snakemake --summaryto check whether the workflow state matches the report. - Inspect the matching rule log and benchmark only for the affected target family.
- If context differs, inspect provenance or run
make -C capstone profile-audit. - If published trust is in question, run
make -C capstone verify-report. - Record the incident class before proposing a repair.
That is short enough to use and strong enough to guide review.
Common runbook mistakes¶
- listing every command in the repository instead of the first honest one
- mixing semantic repair steps with profile-only operating advice
- treating scratch or temporary outputs as if they were trusted contract surfaces
- ending with "investigate further" instead of a real escalation route
The runbook exists to remove ambiguity, not to preserve it.
Keep this standard¶
By the end of this module, a teammate should be able to answer:
- what do I run first?
- what do I inspect second?
- when do I stop patching and escalate?
If the runbook does not answer those three questions, it is still notes, not operations.