Workflow Cost Models and Timing Surfaces¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive Snakemake"]
section["Performance Observability Incident Response"]
page["Workflow Cost Models and Timing Surfaces"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
When a team says "the workflow is slow," they usually mean at least four different things.
That vagueness is the first problem to fix.
Snakemake spends time in layers, and those layers do not point to the same repair:
- workflow planning and discovery
- scheduler and job-launch overhead
- filesystem or staging latency
- real tool runtime
If you merge them into one complaint, every later decision gets worse.
The four cost classes¶
| Cost class | What it means | Typical symptoms | Best first evidence |
|---|---|---|---|
| planning cost | time spent reading config, expanding targets, resolving wildcards, and building the DAG | dry-run feels slow before any job starts | snakemake -n -p, target count, discovery artifacts |
| scheduler cost | time spent launching, tracking, and finishing many jobs | thousands of short jobs make the run feel busy but not productive | job count, per-rule runtime, executor logs |
| storage drag | time lost to staging, filesystem latency, or slow visibility of outputs | jobs finish but downstream rules wait, rerun, or see files late | rule logs, storage-specific timings, profile settings |
| tool runtime | time spent inside scripts or external tools | one or two rules dominate elapsed time | benchmark: files, tool logs, resource usage |
The job is not to memorize these names. The job is to stop reaching for the wrong fix.
A practical mental model¶
flowchart LR
targets["target request"] --> plan["planning and DAG construction"]
plan --> launch["job launch and scheduling"]
launch --> storage["staging and file visibility"]
storage --> tool["tool execution"]
tool --> publish["declared outputs and published evidence"]
This is not a strict clock trace. It is a teaching model.
A slow run can touch more than one layer, but you should still ask which layer dominates.
How to ask the first honest question¶
Start with these questions in order:
- Is the workflow planning more work than expected?
- Are there too many tiny jobs for the chosen executor and storage context?
- Are jobs waiting on files rather than on computation?
- Is one rule or tool genuinely expensive?
That order matters because it keeps you from blaming a tool for a workflow-shape problem or blaming Snakemake for a storage problem.
A small example¶
Imagine a workflow with 800 samples and three light preprocessing rules per sample.
Each rule takes about one second of real tool time.
If the executor needs roughly the same amount of time to launch and finalize each job, the run may feel slow even though no single tool is expensive. That is a scheduler-shape problem, not a "buy a faster aligner" problem.
Now change the story:
- dry-run is already slow
- the discovered sample list doubled after a helper edit
- benchmarks for the tools look normal
That is not primarily scheduler cost. It is a planning and discovery problem.
The right fix is to repair discovery or target expansion, not to tune threads.
Timing surfaces you can trust¶
Use more than one surface, but keep each surface narrow.
| Surface | What it helps you decide |
|---|---|
snakemake -n -p |
whether the planned work matches your mental model |
snakemake --summary |
which outputs exist, are pending, or were rebuilt |
benchmark: files |
which rules actually consume time once launched |
| per-rule logs | whether time was spent computing, waiting, or failing |
make -C capstone evidence-summary |
whether logs, benchmarks, provenance, and published paths still agree |
No single artifact explains the whole run. That is normal.
Common misreadings¶
Mistaking planner cost for tool cost¶
If dry-run is already surprisingly slow, launching the real run earlier will not explain the problem. It only adds more moving parts.
Mistaking scheduler cost for parallel speedup opportunity¶
Many tiny jobs do not automatically justify more cores. In some contexts, more concurrency makes the scheduler and filesystem work harder while the tools stay tiny.
Mistaking storage drag for nondeterminism¶
Late file visibility, staging delays, or slow shared storage can look random. Before you call a run flaky, ask whether files are arriving where the workflow expects them on the timeline the executor and storage actually provide.
Mistaking one noisy log for the whole cost story¶
A loud log is not the same thing as an expensive rule. Volume and runtime are different signals.
What a good first note looks like¶
Before proposing a fix, write a note no longer than five lines:
- which cost class looks dominant
- which artifact suggests that
- what you have ruled out already
- what narrower measurement you will collect next
Example:
The current slowdown looks scheduler-dominated rather than tool-dominated. Dry-run plans 2,400 short jobs, while existing benchmark files still show sub-second rule runtimes. I have not seen evidence of slower tool behavior yet. Next I want the per-rule job count and one representative benchmark from the busiest rule family.
That note is already more useful than "workflow feels slow."
Keep this standard¶
Do not approve performance work until the review names the cost class first.
If the diagnosis starts with threads, retries, grouping, or profile edits before the cost class is named, the workflow is already at risk of being tuned blindly.