Module 03: Production Operations and Policy Boundaries¶
Module Position¶
flowchart TD
family["Reproducible Research"] --> program["Deep Dive Snakemake"]
program --> module["Module 03: Production Operations and Policy Boundaries"]
module --> lessons["Lesson pages and worked examples"]
module --> checkpoints["Exercises and closing criteria"]
module --> capstone["Related capstone evidence"]
flowchart TD
purpose["Start with the module purpose and main questions"] --> lesson_map["Use the lesson map to choose reading order"]
lesson_map --> study["Read the lessons and examples with one review question in mind"]
study --> proof["Test the idea with exercises and capstone checkpoints"]
proof --> close["Move on only when the closing criteria feel concrete"]
Read the first diagram as a placement map: this page sits between the course promise, the lesson pages listed below, and the capstone surfaces that pressure-test the module. Read the second diagram as the study route for this page, so the diagrams point you toward the Lesson map, Exercises, and Closing criteria instead of acting like decoration.
Module 03 is where workflow semantics meet operational policy. Profiles, retries, executor choices, staging, and governance only help if they preserve the workflow's meaning instead of quietly mutating it.
Capstone exists here as corroboration. The module should already make the policy-versus-semantics split clear before you compare it with the reference profiles and confirmation routes.
Production Snakemake: HPC/Cloud Execution, Error Handling, Data Locality, Governance¶
Version & scope contract
Target: Snakemake 9.14.x (this module relies on profile files like
config.yaml, plugin catalog executors/storage, and the current unit-test generator behavior). Verify:
snakemake --versionsnakemake --help | sed -n '1,40p'- In scope: profiles as policy, executor/storage plugins, retries + incomplete semantics, staging/data locality, CI testing, governance/drift.
- Out of scope: authoring fundamentals (Module 01), checkpoints/wildcard expansion theory (Module 02).
Why this module matters¶
Production failures often get misdiagnosed as “Snakemake problems” when the real issue is a missing boundary:
- workflow semantics and executor policy are mixed together
- retries exist without a failure contract
- staging and shared filesystem assumptions are implicit
- CI checks prove too little to be trusted
This module teaches how to encode operations as explicit policy and proof instead of tribal command history.
Reading path¶
- Start with the policy-plus-proofs framing.
- Read profiles and executors before retries and incomplete semantics.
- Read staging and data locality before CI and governance.
- Treat the production lab as the concrete thread that ties the module together.
Capstone connection¶
The capstone’s profiles, confirm target, artifact verification, and workflow gates are direct embodiments of this module. If you want to know why the capstone is opinionated about proof artifacts and clean-room runs, this module is the reason.
At a Glance¶
| Focus | Learner question | Capstone timing |
|---|---|---|
| profiles as policy | "Which settings should change execution context without changing meaning?" | inspect the capstone after the policy-versus-semantics split feels clear |
| retries and failure policy | "What should be retried, and what should fail fast?" | compare profiles and proof targets together |
| production proof | "What makes a workflow trustworthy under CI or scheduler pressure?" | use confirm and related targets as evidence surfaces |
Orientation: production is “policy + plugins + proofs”¶
Production Snakemake means you stop relying on “tribal CLI invocations” and you make
execution reproducible by encoding policy in a profile, capabilities in plugins,
and correctness via proof artifacts (logs, change reports, tests). In this course,
the capstone exposes policy through profile-local config.yaml files, and each CLI flag
can be represented as a YAML key. (Snakemake)
Unified cost model¶
Total pain ≈ scheduler friction + FS latency + staging mistakes + poison artifacts + provenance loss
| What hurts | What you see | Dominant cause | First fix |
|---|---|---|---|
| Scheduler friction | too many tiny jobs | DAG granularity + submit overhead | group/merge jobs; cap submit rates |
| FS latency | “output missing” after job finished | shared FS lag | raise --latency-wait (Snakemake) |
| Staging mistakes | outputs “disappear” / land in wrong place | wrong prefixes / shared-fs-usage lies | make shared-fs-usage explicit + stage to scratch (Snakemake) |
| Poison artifacts | partial outputs break downstream | non-atomic writes + failure | atomic publish + strict incomplete policy (Snakemake) |
| Provenance loss | change reports empty | --drop-metadata |
never drop metadata in prod (Snakemake) |
Minimal production lab (runnable baseline)¶
This module uses a tiny workflow that exercises: profiles, executor plugin wiring, retries, incomplete outputs, staging knobs, unit-test generation, and drift reporting.
Golden layout (pre-run)¶
graph TD
lab["lab/"]
lab --> snakefile["Snakefile"]
lab --> profiles["profiles/"]
lab --> scripts["scripts/"]
lab --> results["results/"]
profiles --> local["local/"]
profiles --> slurm["slurm/"]
local --> localConfig["config.yaml"]
slurm --> slurmConfig["config.yaml"]
scripts --> flaky["flaky_once.py"]
scripts --> poison["poison.py"]
scripts --> atomic["atomic_writer.py"]
Golden “commissioning” command sequence¶
snakemake --profile profiles/local -n
snakemake --profile profiles/local --cores 2
snakemake --profile profiles/local --retries 1 results/flaky_once.txt
snakemake --profile profiles/local --generate-unit-tests
pytest .tests/unit/
snakemake --profile profiles/local --list-changes code
--generate-unit-tests and .tests/unit + pytest invocation are official behavior. (Snakemake)
--list-changes is the official drift report for changed input|code|params. (Snakemake)
Core 1 — Execution backends via profiles (cluster-first by construction)¶
Learning objectives¶
You will be able to:
- Encode execution policy in a version-controlled profile and prove it’s applied.
- Switch local ↔ SLURM without editing workflow code.
- Predict and fix “profile not applied” failures using evidence.
Definition¶
A profile is a directory containing a config.yaml that records execution policy.
Each CLI flag --foo-bar becomes YAML key foo-bar:; profiles can also include
auxiliary files. (Snakemake)
Semantics¶
- Profiles are policy. Workflow code describes the DAG; profile describes how/where it runs. (Snakemake)
- The SLURM executor is a plugin; it can be set via profile with
executor: slurm. (Snakemake)
flowchart LR
A[Snakefile] --> B[Compile DAG]
C[Profile] --> D[Executor choice]
D --> E[local jobs]
D --> F[slurm jobs]
B --> D
Failure signatures¶
- Runs locally despite “cluster intent” → wrong profile path or wrong filename (
config.yamlmissing). - Unknown executor → SLURM plugin not installed on the submission host.
- Logs missing → SLURM plugin defaults delete successful logs unless configured. (Snakemake)
Minimal repro (complete)¶
1) Two profiles¶
profiles/local/config.yaml
profiles/slurm/config.yaml
executor: slurm
jobs: 50
printshellcmds: true
latency-wait: 30
slurm-logdir: logs/slurm
slurm-keep-successful-logs: true
latency-waitwaits for outputs after job completion to tolerate FS latency. (Snakemake)- SLURM plugin settings
--slurm-logdirand--slurm-keep-successful-logsare documented and default to deleting successful logs unless enabled. (Snakemake)
2) Prove the profile is applied¶
snakemake --profile profiles/local -n --print-compilation > .proof/local.compile.txt
snakemake --profile profiles/slurm -n --print-compilation > .proof/slurm.compile.txt
--print-compilation is an official CLI flag for printing the workflow’s Python representation. (Snakemake)
Expected evidence (stable invariants):
- Both outputs contain
Building DAG of jobs... - The SLURM run shows an executor configured as
slurmin the compilation output (search within the file forslurm).
Fix pattern¶
- Put everything operational into the profile: executor, job caps, log retention, latency wait.
- Treat ad-hoc CLI flags as incident response only; if it matters, it belongs in a
version-controlled profile file (
config.yaml). (Snakemake)
Proof hook¶
Attach:
.proof/slurm.compile.txtcontaining “Building DAG” and at least one occurrence ofslurm.- The exact profile file content you used.
Core 2 — Robustness: atomicity, retries, incomplete semantics¶
Learning objectives¶
You will be able to:
- Create failure modes that produce poison outputs, then eliminate them.
- Use
--retries,--keep-incomplete, and--rerun-incompletecorrectly. - Explain why
--drop-metadatadestroys governance tools and refuse it in production.
Definition¶
Robustness is enforcing a strict output contract:
- outputs are either complete and correct, or absent / marked incomplete and rerunnable.
Key CLI:
--retriesrestarts failing jobs. (Snakemake)--keep-incompletekeeps failed-job partial outputs. (Snakemake)--rerun-incompletereruns jobs whose outputs are recognized as incomplete. (Snakemake)
Semantics¶
--retries Nrestarts a job N times; theattemptcounter exists to scale resources across retries. (Snakemake)--keep-incompleteis for forensics; it keeps poison outputs around (dangerous unless paired with strict reruns). (Snakemake)--drop-metadatamakes provenance-based tools like--list-changesempty or incomplete—this is explicitly documented. (Snakemake)
flowchart TD
A[Job starts] --> B[Writes temp output]
B -->|success| C[Atomic rename -> final output]
B -->|fail| D[Temp stays / marked incomplete]
D -->|--rerun-incomplete| A
Failure signatures¶
- “Downstream consumed garbage” → non-atomic writer produced plausible partial output.
- “Works after rerun” → transient failure; you lacked retries.
- “Drift reports show nothing” → metadata was dropped. (Snakemake)
Minimal repro (complete)¶
Repro A — flaky once + retries¶
scripts/flaky_once.py
import os, sys
from pathlib import Path
attempt = int(os.environ.get("SNAKEMAKE_ATTEMPT", "1"))
out = Path(sys.argv[1])
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text(f"attempt={attempt}\n")
if attempt == 1:
print("Failing on attempt 1 (intentional).", file=sys.stderr)
sys.exit(42)
print("Succeeded on attempt >=2.", file=sys.stderr)
Run:
snakemake --profile profiles/local results/flaky_once.txt || true
snakemake --profile profiles/local --retries 1 results/flaky_once.txt
cat results/flaky_once.txt
Expected output (verbatim, file content):
Repro B — poison output + incomplete discipline¶
scripts/poison.py
import sys
from pathlib import Path
out = Path(sys.argv[1])
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text("PARTIAL\n")
print("Wrote PARTIAL then crashing.", file=sys.stderr)
sys.exit(13)
Run:
snakemake --profile profiles/local results/poison.txt || true
test -e results/poison.txt && echo "UNSAFE: poison remained" || echo "OK: removed"
snakemake --profile profiles/local --keep-incomplete results/poison.txt || true
printf "poison file content:\n"; cat results/poison.txt
Expected output (verbatim fragments):
- After default failure:
- With
--keep-incomplete:
--keep-incomplete behavior is explicitly defined. (Snakemake)
Repro C — atomic writer (the fix)¶
scripts/atomic_writer.py
import sys
from pathlib import Path
final = Path(sys.argv[1])
tmp = final.with_suffix(final.suffix + ".tmp")
final.parent.mkdir(parents=True, exist_ok=True)
tmp.write_text("COMPLETE\n")
tmp.replace(final) # atomic rename on same filesystem
Rule uses atomic writer:
- If the job fails before
replace(), the final output never appears.
Fix pattern¶
- Never write final outputs “in place” unless the write is atomic by construction.
- Use
--keep-incompleteonly during triage; otherwise you risk poisoning future DAG runs. - Hard rule: do not use
--drop-metadatain production because it invalidates--list-changesand provenance reports. (Snakemake)
Proof hook¶
Submit:
cat results/flaky_once.txtshowingattempt=2.- Evidence that poison file contains
PARTIALonly when run with--keep-incomplete. (Snakemake)
Core 3 — Data locality and staging: storage plugins + explicit prefixes¶
Learning objectives¶
You will be able to:
- Configure staging to local scratch via
--default-storage-provider,--local-storage-prefix,--remote-job-local-storage-prefix, and--shared-fs-usage. - Demonstrate staging with filesystem evidence, and demonstrate a staging failure that proves misconfiguration.
- Encode staging in the profile instead of relying on per-run CLI.
Definition¶
Snakemake can map inputs/outputs to storage providers implemented as plugins. (Snakemake)
The fs storage plugin uses rsync to read/write from a locally mounted filesystem and is specifically motivated by avoiding harmful parallel IO patterns on NFS. (Snakemake)
Semantics¶
The fs plugin documentation gives a canonical staging configuration:
--default-storage-provider fs--local-storage-prefix /local/work/$USER--shared-fs-usage persistence software-deployment sources source-cache…and shows how to setremote-job-local-storage-prefixfor job-specific scratch. (Snakemake)
It also explicitly notes you still need a non-remote local storage prefix because some jobs may execute without remote submission. (Snakemake)
flowchart LR
N["Shared FS (NFS/Lustre)"] -->|stage in| S[Scratch prefix]
S --> J[Job executes]
J -->|stage out| N
Failure signatures¶
- Scratch directory stays empty → storage plugin not active (missing plugin install or flags/profile).
- rsync / permission error → scratch prefix not writable (most common real incident).
- Outputs appear locally but not on shared FS → shared-fs-usage / prefix mismatch.
Minimal repro (complete)¶
Repro A — staging success with explicit scratch evidence¶
Install plugin (once, on the submission host):
Installation is documented in the plugin catalog. (Snakemake)
Run with a visible scratch prefix:
rm -rf .scratch .snakemake/storage results/staged_demo.txt
snakemake --profile profiles/local -F results/staged_demo.txt \
--default-storage-provider fs \
--shared-fs-usage persistence software-deployment sources source-cache \
--local-storage-prefix .scratch/$USER
This exact pattern is recommended by the fs plugin docs (with a scratch path). (Snakemake)
Inspect:
Expected output (example, verbatim shape):
(Exact paths vary, but the invariant is: non-empty file list under .scratch/$USER.)
Repro B — staging failure (misconfigured scratch prefix)¶
Force a non-writable scratch prefix:
snakemake --profile profiles/local -F results/staged_demo.txt \
--default-storage-provider fs \
--shared-fs-usage persistence software-deployment sources source-cache \
--local-storage-prefix /root/forbidden_scratch
Expected failure (verbatim fragment):
- A permission error writing into
/root/forbidden_scratch(either from Snakemake or rsync).
Fix pattern¶
- Treat staging configuration as policy: move it into the profile once it works.
-
Encode both:
-
local-storage-prefix(for local jobs) remote-job-local-storage-prefix(for cluster jobs) because Snakemake may execute some jobs without remote submission. (Snakemake)
Proof hook¶
Provide:
- Output of
find .scratch/$USER -type f | head -n 10 - Your profile snippet (or CLI) showing
default-storage-provider: fsandlocal-storage-prefix: ...(Snakemake)
Core 4 — Testing and CI/CD: generate unit tests, then gate¶
Learning objectives¶
You will be able to:
- Generate unit tests with
--generate-unit-tests. - Run pytest and interpret failures as workflow regressions (not “pytest problems”).
- Keep unit tests small and deterministic.
Definition¶
Snakemake can generate unit tests from a successful run by copying representative job inputs into .tests/unit and producing pytest tests. (Snakemake)
Semantics¶
- Generate:
snakemake --generate-unit-tests(Snakemake) - Run:
pytest .tests/unit/(Snakemake) - Each test file is
.tests/unit/test_<rulename>.pyand compares outputs to the “known-good” results; default comparison is byte-by-byte viacmp/zcmp/bzcmp/xzcmp. (Snakemake)
flowchart TD
A[Successful run] --> B[--generate-unit-tests]
B --> C[.tests/unit + fixtures]
C --> D[pytest gate]
D -->|fail| E[workflow regression]
D -->|pass| F[ship]
Failure signatures¶
- “skipped job” warning during generation → representative job inputs not present. (Snakemake)
- pytest fails after legitimate change → you changed a contract; update golden outputs intentionally (and bump version).
- pytest flaky → workflow nondeterminism (random seeds, timestamps, unstable discovery).
Minimal repro (complete)¶
Run once:
Generate tests:
Inspect one generated test file (this is the verbatim evidence):
Run pytest:
Expected pytest tail (verbatim shape):
Fix pattern¶
- Generate tests only from a small dummy dataset; the docs explicitly warn against generating tests from big data. (Snakemake)
-
CI gates (minimum):
-
snakemake --lint(Snakemake) pytest .tests/unit/(Snakemake)
Proof hook¶
Provide:
- First 30–80 lines of one generated
.tests/unit/test_<rulename>.pyfile (viased). - The pytest summary lines showing collection and pass/fail.
Core 5 — Maintainability and governance: drift reports, contracts, versioning¶
Learning objectives¶
You will be able to:
- Detect drift with
--list-changesand explain what changed. - Prove that dropping metadata breaks governance tools (and refuse it).
- Adopt a review checklist that prevents interface breakage.
Definition¶
Governance means: stable interfaces + explicit change control + auditable provenance.
Snakemake provides drift tools:
--list-changes {input,code,params}lists output files whose specified items changed since creation. (Snakemake)--drop-metadatamakes provenance-based reports (including--list_x_changes) empty or incomplete. (Snakemake)
Semantics¶
--list-changes codeis your “what did we invalidate?” query after editing scripts/rules. (Snakemake)- If metadata is dropped, governance fails by definition. This is not a “maybe”; it is stated explicitly. (Snakemake)
flowchart LR
A[Run] --> B[Metadata tracked]
B --> C[--list-changes]
D[--drop-metadata] --> E[Reports empty/incomplete]
E --> F[Governance failure]
Failure signatures¶
- “Why did this rerun?” cannot be answered → metadata missing.
- “We changed code but nothing is flagged” →
--drop-metadatawas used, or outputs were recreated without tracking. - Downstream consumers break → contracts were implicit, not versioned.
Minimal repro (complete)¶
- Converge:
-
Edit a script (e.g., append a harmless comment to
scripts/atomic_writer.py). -
Ask Snakemake to enumerate invalidated outputs:
Expected behavior: at least one output is listed as impacted by code drift (exact formatting varies). (Snakemake)
- Demonstrate governance failure explicitly:
snakemake --profile profiles/local --drop-metadata --cores 2
snakemake --profile profiles/local --list-changes code
Expected behavior: the second --list-changes becomes empty or incomplete specifically because metadata was dropped (this is the documented effect). (Snakemake)
Fix pattern¶
Adopt three hard artifacts:
workflow/CONTRACT.md: file naming + formats + schema expectations.workflow/VERSION: semantic version (bump on contract changes).-
workflow/REVIEW.md: checklist requiring: -
snakemake --lint(Snakemake) snakemake -n --summary --reasonsnakemake --list-changes code|params|inputevidence (Snakemake)- “No
--drop-metadata” attestation (Snakemake)
Proof hook¶
Provide:
- The exact output of
snakemake --list-changes codebefore and after--drop-metadata. - Your
workflow/VERSIONand a short note: “contract changed? yes/no”.
Appendix — Consolidated reference Snakefile (single-file, end-to-end)¶
Snakefile
rule all:
input:
"results/staged_demo.txt",
"results/flaky_once.txt",
"results/atomic_ok.txt",
rule staged_demo:
output:
"results/staged_demo.txt"
shell:
"printf 'staged_demo=ok\\n' > {output}"
rule flaky_once:
output:
"results/flaky_once.txt"
shell:
"python capstone/flaky_once.py {output}"
rule poison:
output:
"results/poison.txt"
shell:
"python capstone/poison.py {output}"
rule atomic_ok:
output:
"results/atomic_ok.txt"
shell:
"python capstone/atomic_writer.py {output}"
Closing recap¶
If you want production-grade Snakemake, stop optimizing rules first. Instead:
- Profiles are policy surfaces, and they must fully encode how the DAG is executed
through reviewable
config.yamlfiles. (Snakemake) - Robustness is atomic outputs + strict incomplete semantics + retries; poison artifacts are a correctness bug, not an inconvenience. (Snakemake)
- Data locality is explicit: staging to scratch must be configured and proven with filesystem evidence; the fs plugin gives canonical patterns. (Snakemake)
- CI is real only when it runs workflow-derived tests (
--generate-unit-tests+ pytest) and gates merges. (Snakemake) - Governance requires metadata and drift reports;
--drop-metadatais operational malpractice in production because it breaks those tools by design. (Snakemake)
Directory glossary¶
Use Glossary when you want the recurring language in this module kept stable while you move between lessons, exercises, and capstone checkpoints.