Worked Example: Investigating a Slow and Noisy Workflow¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive Snakemake"]
section["Performance Observability Incident Response"]
page["Worked Example: Investigating a Slow and Noisy Workflow"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
This example shows how Module 09 fits together when a workflow starts feeling wrong under real pressure.
The point is not to memorize this exact story. The point is to see a calm route from symptom to explanation.
The situation¶
A maintainer reports:
CI is much slower than last week,
make tourfeels noisy, and the publish summary was rebuilt for every sample even though we only changed one helper script.
That report contains three different concerns:
- slower execution
- noisier evidence
- surprising rebuild scope
A weak response would jump straight to retries, threads, or job grouping.
Instead, start with the Module 09 ladder.
Step 1: Name the symptom¶
The incident note becomes:
make tournow takes roughly twice as long in CI. The publish summary and report were rebuilt for all samples. The maintainer also reports much noisier execution logs than the last reviewed run.
That is already better than "the workflow got weird."
Step 2: Check planned work before real execution¶
Run:
What you learn:
- dry-run plans far more per-sample jobs than expected
--summaryconfirms that summary-oriented outputs are stale because many upstream sample outputs are now considered missing or changed--list-changespoints mostly to input and code changes rather than parameter drift
That is the first big clue.
The workflow is not only slow. It is planning more work.
Step 3: Separate cost classes¶
At this point, do not say "the tools got slower."
Nothing yet suggests that.
The likely dominant cost class is now:
- planning and discovery, because the workflow scope widened
- scheduler overhead, because the widened scope creates many more short jobs
Tool runtime is still only a possibility, not the current lead explanation.
Step 4: Open the narrowest evidence surfaces¶
Now inspect:
- one benchmark file from the previously suspicious rule family
- one log file from a sample that should not have changed
- the discovery artifact that lists which samples were found
The benchmarks show that rule runtime per sample is almost unchanged.
The logs show many jobs running on sample names that look wrong:
- expected names such as
sampleA - unexpected names such as
sampleA.fastq.gz.md5
That is the second big clue.
The workflow did not mostly get slower because the tools changed. It got slower because discovery widened and created extra tiny jobs.
Step 5: Find the boundary that moved¶
A recent helper edit changed sample discovery from a tight file pattern to a broader glob.
The repository now treats checksum sidecars as if they were real samples.
That causes three visible effects:
- dry-run plans many extra jobs
- scheduler overhead rises because most of those jobs are tiny
- logs become noisy because rule-local messages now mention invalid sample identities
This is a strong example of why performance and observability are connected.
The performance symptom came from a workflow-semantics mistake, and the noisy evidence came from the same mistake.
Step 6: Repair the right thing¶
The honest repair is not:
- add more cores
- raise retries
- delete noisy logs
- group the jobs and hope the problem becomes less visible
The honest repair is:
- restore a reviewed sample-discovery rule
- make the discovered sample list easy to inspect
- keep the log and benchmark surfaces tied to valid sample identities
Only after that repair should you reconsider whether any remaining speed issue is still worth tuning.
Step 7: Prove the repair honestly¶
Use the same route again:
What you want to see:
- dry-run target count falls back to the expected range
--summarystops showing unnecessary rebuild scope- evidence becomes quieter because it now reflects real sample identities
- the run time improves without any semantic shortcuts
What this example teaches¶
This incident matters because it is easy to misread.
A rushed maintainer could easily conclude:
- CI needs more resources
- Snakemake scheduling is inefficient
- the logs are too verbose
Those claims all point away from the root problem.
The real issue was a widened discovery boundary that created false work and false noise.
The review note you would want in the pull request¶
The slowdown was not primarily tool runtime. Dry-run and summary evidence showed that the workflow had started planning extra jobs after sample discovery widened to include checksum sidecars. Benchmarks for valid samples stayed close to their previous timings, which argues against a tool-level regression. The repair restores a reviewed discovery pattern and keeps the discovered sample list inspectable. Speed improved because the extra work disappeared, not because the workflow was taught to skip truth.
That is the standard this module is aiming for.
Why this is a mastery example¶
This one story exercises all five cores:
- Core 1: cost classes were separated before tuning
- Core 2: the right evidence surfaces were chosen in order
- Core 3: the incident ladder prevented random edits
- Core 4: the repair preserved workflow semantics
- Core 5: the route is short enough to become a runbook entry