Module 04: Scaling Workflows and Interface Boundaries¶
Module Position¶
flowchart TD
family["Reproducible Research"] --> program["Deep Dive Snakemake"]
program --> module["Module 04: Scaling Workflows and Interface Boundaries"]
module --> lessons["Lesson pages and worked examples"]
module --> checkpoints["Exercises and closing criteria"]
module --> capstone["Related capstone evidence"]
flowchart TD
purpose["Start with the module purpose and main questions"] --> lesson_map["Use the lesson map to choose reading order"]
lesson_map --> study["Read the lessons and examples with one review question in mind"]
study --> proof["Test the idea with exercises and capstone checkpoints"]
proof --> close["Move on only when the closing criteria feel concrete"]
Read the first diagram as a placement map: this page sits between the course promise, the lesson pages listed below, and the capstone surfaces that pressure-test the module. Read the second diagram as the study route for this page, so the diagrams point you toward the Lesson map, Exercises, and Closing criteria instead of acting like decoration.
Module 04 turns one working workflow into a repository another engineer can still review. Interfaces, file APIs, module boundaries, and CI gates matter because they keep growth from turning into hidden coupling.
Capstone exists here as corroboration. The module should already make interface boundaries meaningful before you inspect the repository layout and proof surfaces in the reference workflow.
Scaling Workflows and Interface Boundaries¶
Version & scope contract
- Target: Snakemake 9.14.x (CLI reporting, profiles, modules, schema validation).
- This module is about scaling correctness, not tuning a specific cluster (Module 03) and not checkpoints/dynamic DAGs (Module 02).
- Verify:
Why this module matters¶
Teams often say a workflow has “outgrown itself” when the real problem is that nobody defined its interfaces. Files, schemas, modules, CI gates, and resource claims become implicit, so every change starts to feel dangerous.
This module is about restoring that safety by making boundaries explicit enough that workflows can scale in size, team count, and execution context without becoming fragile.
Reading path¶
- Start with module boundaries and file APIs.
- Read CI and drift control after the interface story is clear.
- Read resource semantics after you know what is being protected.
- Finish with workflow-as-product thinking so scaling is understood as a contract problem.
Capstone connection¶
The capstone’s FILE_API.md, workflow module layout, and confirmation targets are the concrete
expression of this module. If you want to see what “executor-proof semantics” means in one small
repository, this is the module that explains that shape.
At a Glance¶
| Focus | Learner question | Capstone timing |
|---|---|---|
| modular boundaries | "Where does one workflow responsibility end and another begin?" | inspect the capstone once interface boundaries already feel meaningful |
| file APIs | "Which paths are public contracts and which are internal detail?" | compare FILE_API.md with rule files intentionally |
| CI gates | "What should scale review protect before the workflow gets bigger?" | use confirmation targets as the operational side of the interface story |
Orientation: scaling fails at boundaries¶
When a workflow “doesn’t scale”, it’s usually not CPU. It’s:
- hidden coupling across files,
- implicit file contracts,
- nondeterministic outputs,
- or resources that were never proved to the executor.
Unified scaling model¶
Total scaling failure ≈ hidden coupling + implicit contracts + entropy + unproven resources + upgrade drift
flowchart LR
P[Provider module] -->|File API: paths + formats| C[Consumer workflow]
C --> V["Schema validate (fail fast)"]
C --> G[CI gates: lint + dryrun + tests + diffs]
C --> D[Drift evidence: list-changes + summary]
C --> R[Resource evidence: logs show threads/mem]
V --> S[Scale without fear]
G --> S
D --> S
R --> S
Runnable lab (single repo, two assemblies)¶
You will have:
- Module assembly (real boundaries):
Snakefile - Single-file sanity (fast baseline):
Snakefile.reference
Golden layout¶
graph TD
lab["lab/"]
lab --> snakefile["Snakefile"]
lab --> reference["Snakefile.reference"]
lab --> config["config/"]
lab --> data["data/"]
lab --> modules["modules/"]
lab --> workflow["workflow/"]
lab --> profiles["profiles/"]
lab --> ci["ci/"]
config --> configYaml["config.yaml"]
config --> schemaYaml["schema.yaml"]
config --> schemas["schemas/"]
schemas --> refSchema["ref.schema.yaml"]
schemas --> brokenSchema["broken.schema.yaml"]
data --> a["A.txt"]
data --> b["B.txt"]
modules --> provider["provider/"]
provider --> providerSnakefile["Snakefile"]
workflow --> contracts["contracts/"]
workflow --> rules["rules/"]
contracts --> fileApi["FILE_API.md"]
rules --> consumer["consumer.smk"]
rules --> entropy["entropy.smk"]
rules --> resources["resources.smk"]
profiles --> local["local/"]
local --> profileConfig["config.yaml"]
ci --> gate["gate.sh"]
profiles/local/config.yaml¶
config/config.yaml¶
config/schema.yaml¶
type: object
required: [results_prefix, samples]
properties:
results_prefix: {type: string, minLength: 1}
samples:
type: array
minItems: 1
items: {type: string, pattern: "^[A-Za-z0-9._-]+$"}
additionalProperties: false
workflow/contracts/FILE_API.md¶
# File API (v1)
Provider outputs:
- path: results/v1/provider/{sample}.upper.txt
- semantics: uppercase of data/{sample}.txt
Consumer outputs:
- path: results/v1/consumer/all.upper.txt
- semantics: concatenation of provider outputs in config.samples order
Breaking changes:
- Any change to output paths, patterns, or semantics => bump results_prefix to results/v2
data/A.txt, data/B.txt¶
Commissioning sequence (module assembly)¶
snakemake --profile profiles/local --lint
snakemake --profile profiles/local -n
snakemake --profile profiles/local --list-rules
snakemake --profile profiles/local --rulegraph mermaid-js > .proof/rulegraph.mmd
snakemake --profile profiles/local --cores 2
snakemake --profile profiles/local --summary > .proof/summary.txt
snakemake --profile profiles/local --list-changes code > .proof/list-changes.code.txt || true
Expected invariants you can verify immediately:
results/v1/consumer/all.upper.txtexists and contains:
.proof/summary.txt contains a header with the columns:
Core 1 — Modularity that scales: include vs module and real boundaries¶
Learning objectives¶
You will be able to:
- Reproduce a silent correctness bug caused by
include:namespace leakage. - Replace it with a
moduleboundary + explicituse ruleimports. - Prove the boundary using
--list-rulesand a stable file API.
Definition¶
include:merges Snakefiles into one namespace (globals can collide).moduleloads a workflow into its own namespace;use rule ... from ...imports explicitly.
Semantics¶
flowchart TD
I[include: shared namespace] --> Leak[globals collide]
M[module namespace] --> Use[use rule imports only what you depend on]
Use --> API[depend on file API, not provider globals]
Failure signatures¶
- Dry-run target set changes with “no obvious reason”.
- Consumer changes provider behavior without touching provider code.
Minimal repro (leak via include)¶
modules/provider/Snakefile¶
SAMPLES = ["A", "B"]
rule provider_make:
input: "data/{sample}.txt"
output: "results/v1/provider/{sample}.upper.txt"
shell: "tr '[:lower:]' '[:upper:]' < {input} > {output}"
workflow/rules/consumer.smk (bug: overwrites provider global)¶
SAMPLES = ["A"] # accidental narrowing
rule consumer_all:
input:
expand("results/v1/provider/{sample}.upper.txt", sample=SAMPLES)
output:
"results/v1/consumer/all.upper.txt"
shell:
"cat {input} > {output}"
Top-level Snakefile (bad assembly)¶
include: "modules/provider/Snakefile"
include: "workflow/rules/consumer.smk"
rule all:
input: "results/v1/consumer/all.upper.txt"
Run:
Expected planning evidence (verbatim file set):
- Planned input includes
results/v1/provider/A.upper.txt - Does not include
results/v1/provider/B.upper.txt
Fix pattern (module boundary + config-derived list)¶
workflow/rules/consumer.smk (fixed: no globals)¶
rule consumer_all:
input:
expand(f"{config['results_prefix']}/provider/{{sample}}.upper.txt", sample=config["samples"])
output:
f"{config['results_prefix']}/consumer/all.upper.txt"
shell:
"cat {input} > {output}"
Top-level Snakefile (good assembly)¶
from snakemake.utils import validate
configfile: "config/config.yaml"
validate(config, "config/schema.yaml")
module provider:
snakefile: "modules/provider/Snakefile"
use rule provider_make from provider as provider_*
include: "workflow/rules/consumer.smk"
rule all:
input: f"{config['results_prefix']}/consumer/all.upper.txt"
Run:
Expected planning evidence (verbatim file set):
-
Planned inputs include both:
-
results/v1/provider/A.upper.txt results/v1/provider/B.upper.txt
Proof hook¶
Provide:
snakemake --profile profiles/local --list-rulesoutput (before/after).- Two dry-runs showing the target set changed exactly as described.
Core 2 — Interface contracts: naming, schemas, versioned outputs, compatibility¶
Learning objectives¶
You will be able to:
- Fail fast on bad config via schema validation.
- Classify changes as breaking/non-breaking mechanically.
- Force breaking changes to be explicit via
results_prefixbump.
Definition¶
A contract is path + format + semantics, written down and enforced.
Semantics¶
- The config schema makes typos impossible to ignore.
- The file API doc makes upgrades reviewable.
Failure signatures¶
- “It ran, but outputs are wrong” (schema drift or semantic drift).
- Typos accepted (missing strict schema).
Minimal repro (schema failure must be immediate)¶
Break config/config.yaml:
Run:
Expected failure evidence (verbatim fragment):
-
A validation error mentioning:
-
missing required property
samples - unexpected property
samplez
Fix pattern¶
- Keep
additionalProperties: false. - Put breaking changes behind
results/v2/....
Proof hook¶
Provide:
- The schema error excerpt.
- The fixed config + a successful
-n.
Core 3 — Determinism and drift control: CI as the correctness boundary¶
Learning objectives¶
You will be able to:
- Demonstrate nondeterminism with a stable diff.
- Enforce a CI gate that catches entropy and drift.
- Produce drift artifacts (
--summary,--list-changes) as PR evidence.
Definition¶
Determinism means:
- stable plan for stable inputs,
- stable outputs for stable inputs,
- stable provenance signals.
Semantics¶
flowchart LR
A[Run] --> B[Artifacts: outputs + summary + list-changes]
C[Change code/params] --> D[list-changes flags impacted outputs]
E[Entropy in outputs] --> F[diff fails => CI fails]
Failure signatures¶
- “CI is flaky” (time/RNG/unordered globs in outputs).
- Drift report is empty when it shouldn’t be (metadata dropped or bypassed).
Minimal repro (prove entropy)¶
workflow/rules/entropy.smk¶
rule entropy_bad:
output: f"{config['results_prefix']}/entropy.txt"
shell:
"python - << 'PY'\n"
"import time\n"
"print(time.time())\n"
"PY\n"
"> {output}"
Run twice and diff:
snakemake --profile profiles/local -F results/v1/entropy.txt
cp results/v1/entropy.txt /tmp/e1.txt
snakemake --profile profiles/local -F results/v1/entropy.txt
diff -u /tmp/e1.txt results/v1/entropy.txt && echo OK || echo NONDETERMINISTIC
Expected output (verbatim):
Fix pattern¶
- Entropy goes to logs, not semantic outputs.
- If randomness is required, seed is config and becomes provenance.
CI gate (minimal, enforceable)¶
ci/gate.sh¶
#!/usr/bin/env bash
set -euo pipefail
snakemake --profile profiles/local --lint
snakemake --profile profiles/local -n
snakemake --profile profiles/local --cores 2
snakemake --profile profiles/local --summary > .proof/summary.txt
snakemake --profile profiles/local --list-changes code > .proof/list-changes.code.txt || true
Proof hook¶
Provide:
- the failing diff (
NONDETERMINISTIC) and the fixed diff (OK), .proof/summary.txtand.proof/list-changes.code.txt.
Core 4 — Resource semantics with evidence: prove what the workflow asked for¶
Learning objectives¶
You will be able to:
- Resolve dynamic resources deterministically from explicit evidence (input size).
- Prove resolved
threadsandmem_mbusing log artifacts. - Detect oversubscription failures before cluster execution.
Definition¶
Resource correctness is not “I wrote resources:”. It is:
- the workflow resolved concrete values,
- evidence was produced,
- and the executor could have enforced them.
Semantics (portable evidence)¶
flowchart TD
I[input size] --> R[resolve mem_mb]
R --> L[write evidence log]
L --> A[audit in CI / review]
Failure signatures¶
- “Resources ignored” (no evidence exists; you’re guessing).
- Oversubscription (threads > available cores) causes stalls/rejections.
Minimal repro (resource evidence logs)¶
workflow/rules/resources.smk¶
def mem_mb_from_input(wc, input):
# deterministic; scale with size for demonstration
return max(200, int(2 * input.size_mb) + 200)
rule resource_probe:
input: "data/{sample}.txt"
output: f"{config['results_prefix']}/resources/{{sample}}.done.txt"
log: "logs/resources/{sample}.txt"
threads: 2
resources:
mem_mb=mem_mb_from_input
shell:
r"""
printf "sample={wildcards.sample}\nthreads={threads}\nmem_mb={resources.mem_mb}\ninput={input}\n" > {log}
echo OK > {output}
"""
Make B large:
Run:
snakemake --profile profiles/local results/v1/resources/A.done.txt results/v1/resources/B.done.txt
echo "=== A ==="; cat logs/resources/A.txt
echo "=== B ==="; cat logs/resources/B.txt
Expected log evidence (verbatim lines present in both):
And B’s mem_mb must be > A’s.
Fix pattern¶
- Defaults belong in profile; rule-level resources are exceptions.
- Any “special” rule must emit an evidence log of resolved resources.
Proof hook¶
Provide:
logs/resources/A.txtandlogs/resources/B.txt,- and one sentence: “B mem_mb > A mem_mb (evidence above).”
Core 5 — Workflow as a product: distribution, pinning, upgrade paths, team practice¶
Learning objectives¶
You will be able to:
- Make a breaking change that is mechanically explicit (v1 → v2).
- Prove a non-breaking refactor does not perturb consumers.
- Encode team review rules that prevent silent breakage.
Definition¶
A workflow is a versioned product:
- stable file API,
- pinned dependencies,
- explicit upgrade paths.
Semantics (breaking change demo, isolated and concrete)¶
Step 1: baseline (v1)¶
Confirm:
snakemake --profile profiles/local --cores 2
snakemake --profile profiles/local --summary | grep -E "results/v1/provider|results/v1/consumer" | head
Step 2: introduce a breaking change (format semantics)¶
Change provider to output lowercase instead of uppercase (breaking semantics) but keep the path the same (this is the mistake).
Edit modules/provider/Snakefile:
rule provider_make:
input: "data/{sample}.txt"
output: "results/v1/provider/{sample}.upper.txt"
shell: "cat {input} > {output}" # now wrong semantics for v1
Run:
Expected evidence (verbatim):
This proves: semantic breaking change silently shipped under v1 path.
Step 3: correct governance (bump to v2)¶
Fix by bumping results_prefix in config/config.yaml:
Update workflow/contracts/FILE_API.md to “File API (v2)” and describe the new semantics.
Run:
Expected evidence (verbatim path change):
- outputs now exist under
results/v2/... - v1 remains intact unless explicitly cleaned
Fix pattern (team checklist)¶
A PR is not reviewable without:
snakemake --lintsnakemake -n.proof/summary.txt.proof/list-changes.*.txt- FILE_API.md diff if anything about outputs changed
Proof hook¶
Provide:
- the
headoutput showing silent semantic break under v1, - the config + contract bump to v2,
- and a directory listing proving v2 outputs exist.
Debugging playbook: scaling boundary failures¶
| Symptom | Command | Evidence | Likely cause | First fix |
|---|---|---|---|---|
| Targets shrink/expand | snakemake -n |
planned file set differs | include leakage | module boundary + config-derived lists |
| Config typos accepted | snakemake -n |
no validation error | missing strict schema | validate(config, schema) |
| CI flaky | diff |
NONDETERMINISTIC | entropy in outputs | move entropy to logs; seed via config |
| Unsure what changed | --list-changes code |
impacted outputs listed | drift | attach drift artifact; rerun targeted |
| Resource claims untrusted | cat logs/resources/*.txt |
threads/mem logged | unproved resources | evidence logs per rule |
| Module import surprises | --list-rules |
unexpected rules present | wildcard import | import only required rules |
Snakefile (module assembly, final)¶
from snakemake.utils import validate
configfile: "config/config.yaml"
validate(config, "config/schema.yaml")
include: "workflow/rules/consumer.smk"
include: "workflow/rules/entropy.smk"
include: "workflow/rules/resources.smk"
module provider:
snakefile: "modules/provider/Snakefile"
use rule provider_make from provider as provider_*
rule all:
input:
f"{config['results_prefix']}/consumer/all.upper.txt",
f"{config['results_prefix']}/entropy.txt",
f"{config['results_prefix']}/resources/A.done.txt",
f"{config['results_prefix']}/resources/B.done.txt"
Snakefile.reference (single-file sanity, final)¶
from snakemake.utils import validate
configfile: "config/config.yaml"
validate(config, "config/schema.yaml")
SAMPLES = config["samples"]
P = config["results_prefix"]
rule all:
input:
f"{P}/consumer/all.upper.txt",
f"{P}/entropy.txt",
f"{P}/resources/A.done.txt",
f"{P}/resources/B.done.txt"
rule provider_make:
input: "data/{sample}.txt"
output: f"{P}/provider/{{sample}}.upper.txt"
shell: "tr '[:lower:]' '[:upper:]' < {input} > {output}"
rule consumer_all:
input:
expand(f"{P}/provider/{{sample}}.upper.txt", sample=SAMPLES)
output:
f"{P}/consumer/all.upper.txt"
shell:
"cat {input} > {output}"
rule entropy_bad:
output: f"{P}/entropy.txt"
shell:
"python - << 'PY'\n"
"import time\n"
"print(time.time())\n"
"PY\n"
"> {output}"
def mem_mb_from_input(wc, input):
return max(200, int(2 * input.size_mb) + 200)
rule resource_probe:
input: "data/{sample}.txt"
output: f"{P}/resources/{{sample}}.done.txt"
log: "logs/resources/{sample}.txt"
threads: 2
resources:
mem_mb=mem_mb_from_input
shell:
r"""
printf "sample={wildcards.sample}\nthreads={threads}\nmem_mb={resources.mem_mb}\ninput={input}\n" > {log}
echo OK > {output}
"""
Run sanity:
Closing recap¶
Scaling Snakemake is boundary engineering:
- Modules enforce explicit dependencies; includes invite silent coupling.
- Schemas + file API docs turn correctness into something you can prove.
- CI gates kill entropy early.
- Resources must produce evidence artifacts; otherwise they are folklore.
- Breaking changes must be path/version changes, not “refactors”.
Directory glossary¶
Use Glossary when you want the recurring language in this module kept stable while you move between lessons, exercises, and capstone checkpoints.