Module 04: Scaling Workflows and Interface Boundaries¶

Module Position¶

flowchart TD
  family["Reproducible Research"] --> program["Deep Dive Snakemake"]
  program --> module["Module 04: Scaling Workflows and Interface Boundaries"]
  module --> lessons["Lesson pages and worked examples"]
  module --> checkpoints["Exercises and closing criteria"]
  module --> capstone["Related capstone evidence"]

flowchart TD
  purpose["Start with the module purpose and main questions"] --> lesson_map["Use the lesson map to choose reading order"]
  lesson_map --> study["Read the lessons and examples with one review question in mind"]
  study --> proof["Test the idea with exercises and capstone checkpoints"]
  proof --> close["Move on only when the closing criteria feel concrete"]

Read the first diagram as a placement map: this page sits between the course promise, the lesson pages listed below, and the capstone surfaces that pressure-test the module. Read the second diagram as the study route for this page, so the diagrams point you toward the Lesson map, Exercises, and Closing criteria instead of acting like decoration.

Module 04 turns one working workflow into a repository another engineer can still review. Interfaces, file APIs, module boundaries, and CI gates matter because they keep growth from turning into hidden coupling.

Capstone exists here as corroboration. The module should already make interface boundaries meaningful before you inspect the repository layout and proof surfaces in the reference workflow.

Scaling Workflows and Interface Boundaries¶

Version & scope contract

Target: Snakemake 9.14.x (CLI reporting, profiles, modules, schema validation).

This module is about scaling correctness, not tuning a specific cluster (Module 03) and not checkpoints/dynamic DAGs (Module 02).

Verify:
snakemake --version
snakemake -h | sed -n '1,40p'

Why this module matters¶

Teams often say a workflow has “outgrown itself” when the real problem is that nobody defined its interfaces. Files, schemas, modules, CI gates, and resource claims become implicit, so every change starts to feel dangerous.

This module is about restoring that safety by making boundaries explicit enough that workflows can scale in size, team count, and execution context without becoming fragile.

Reading path¶

Start with module boundaries and file APIs.
Read CI and drift control after the interface story is clear.
Read resource semantics after you know what is being protected.
Finish with workflow-as-product thinking so scaling is understood as a contract problem.

Capstone connection¶

The capstone’s FILE_API.md, workflow module layout, and confirmation targets are the concrete expression of this module. If you want to see what “executor-proof semantics” means in one small repository, this is the module that explains that shape.

At a Glance¶

Focus	Learner question	Capstone timing
modular boundaries	"Where does one workflow responsibility end and another begin?"	inspect the capstone once interface boundaries already feel meaningful
file APIs	"Which paths are public contracts and which are internal detail?"	compare `FILE_API.md` with rule files intentionally
CI gates	"What should scale review protect before the workflow gets bigger?"	use confirmation targets as the operational side of the interface story

Orientation: scaling fails at boundaries¶

When a workflow “doesn’t scale”, it’s usually not CPU. It’s:

hidden coupling across files,
implicit file contracts,
nondeterministic outputs,
or resources that were never proved to the executor.

Unified scaling model¶

Total scaling failure ≈ hidden coupling + implicit contracts + entropy + unproven resources + upgrade drift

flowchart LR
  P[Provider module] -->|File API: paths + formats| C[Consumer workflow]
  C --> V["Schema validate (fail fast)"]
  C --> G[CI gates: lint + dryrun + tests + diffs]
  C --> D[Drift evidence: list-changes + summary]
  C --> R[Resource evidence: logs show threads/mem]
  V --> S[Scale without fear]
  G --> S
  D --> S
  R --> S

Runnable lab (single repo, two assemblies)¶

You will have:

Module assembly (real boundaries): Snakefile
Single-file sanity (fast baseline): Snakefile.reference

Golden layout¶

graph TD
  lab["lab/"]
  lab --> snakefile["Snakefile"]
  lab --> reference["Snakefile.reference"]
  lab --> config["config/"]
  lab --> data["data/"]
  lab --> modules["modules/"]
  lab --> workflow["workflow/"]
  lab --> profiles["profiles/"]
  lab --> ci["ci/"]
  config --> configYaml["config.yaml"]
  config --> schemaYaml["schema.yaml"]
  config --> schemas["schemas/"]
  schemas --> refSchema["ref.schema.yaml"]
  schemas --> brokenSchema["broken.schema.yaml"]
  data --> a["A.txt"]
  data --> b["B.txt"]
  modules --> provider["provider/"]
  provider --> providerSnakefile["Snakefile"]
  workflow --> contracts["contracts/"]
  workflow --> rules["rules/"]
  contracts --> fileApi["FILE_API.md"]
  rules --> consumer["consumer.smk"]
  rules --> entropy["entropy.smk"]
  rules --> resources["resources.smk"]
  profiles --> local["local/"]
  local --> profileConfig["config.yaml"]
  ci --> gate["gate.sh"]

`profiles/local/config.yaml`¶

executor: local
cores: 2
printshellcmds: true
latency-wait: 5

`config/config.yaml`¶

results_prefix: "results/v1"
samples: ["A", "B"]

`config/schema.yaml`¶

type: object
required: [results_prefix, samples]
properties:
  results_prefix: {type: string, minLength: 1}
  samples:
    type: array
    minItems: 1
    items: {type: string, pattern: "^[A-Za-z0-9._-]+$"}
additionalProperties: false

`workflow/contracts/FILE_API.md`¶

# File API (v1)

Provider outputs:
- path: results/v1/provider/{sample}.upper.txt
- semantics: uppercase of data/{sample}.txt

Consumer outputs:
- path: results/v1/consumer/all.upper.txt
- semantics: concatenation of provider outputs in config.samples order

Breaking changes:
- Any change to output paths, patterns, or semantics => bump results_prefix to results/v2

`data/A.txt`, `data/B.txt`¶

printf "hello a\n" > data/A.txt
printf "hello b\n" > data/B.txt

Commissioning sequence (module assembly)¶

snakemake --profile profiles/local --lint
snakemake --profile profiles/local -n
snakemake --profile profiles/local --list-rules
snakemake --profile profiles/local --rulegraph mermaid-js > .proof/rulegraph.mmd
snakemake --profile profiles/local --cores 2
snakemake --profile profiles/local --summary > .proof/summary.txt
snakemake --profile profiles/local --list-changes code > .proof/list-changes.code.txt || true

Expected invariants you can verify immediately:

results/v1/consumer/all.upper.txt exists and contains:

HELLO A
HELLO B

* .proof/summary.txt contains a header with the columns:

filename  modification time  rule version  status  plan

Core 1 — Modularity that scales: `include` vs `module` and real boundaries¶

Learning objectives¶

You will be able to:

Reproduce a silent correctness bug caused by include: namespace leakage.
Replace it with a module boundary + explicit use rule imports.
Prove the boundary using --list-rules and a stable file API.

Definition¶

include: merges Snakefiles into one namespace (globals can collide).
module loads a workflow into its own namespace; use rule ... from ... imports explicitly.

Semantics¶

flowchart TD
  I[include: shared namespace] --> Leak[globals collide]
  M[module namespace] --> Use[use rule imports only what you depend on]
  Use --> API[depend on file API, not provider globals]

Failure signatures¶

Dry-run target set changes with “no obvious reason”.
Consumer changes provider behavior without touching provider code.

Minimal repro (leak via `include`)¶

`modules/provider/Snakefile`¶

SAMPLES = ["A", "B"]

rule provider_make:
    input: "data/{sample}.txt"
    output: "results/v1/provider/{sample}.upper.txt"
    shell: "tr '[:lower:]' '[:upper:]' < {input} > {output}"

`workflow/rules/consumer.smk` (bug: overwrites provider global)¶

SAMPLES = ["A"]  # accidental narrowing

rule consumer_all:
    input:
        expand("results/v1/provider/{sample}.upper.txt", sample=SAMPLES)
    output:
        "results/v1/consumer/all.upper.txt"
    shell:
        "cat {input} > {output}"

Top-level `Snakefile` (bad assembly)¶

include: "modules/provider/Snakefile"
include: "workflow/rules/consumer.smk"

rule all:
    input: "results/v1/consumer/all.upper.txt"

Run:

snakemake --profile profiles/local -n

Expected planning evidence (verbatim file set):

Planned input includes results/v1/provider/A.upper.txt
Does not include results/v1/provider/B.upper.txt

Fix pattern (module boundary + config-derived list)¶

`workflow/rules/consumer.smk` (fixed: no globals)¶

rule consumer_all:
    input:
        expand(f"{config['results_prefix']}/provider/{{sample}}.upper.txt", sample=config["samples"])
    output:
        f"{config['results_prefix']}/consumer/all.upper.txt"
    shell:
        "cat {input} > {output}"

Top-level `Snakefile` (good assembly)¶

from snakemake.utils import validate

configfile: "config/config.yaml"
validate(config, "config/schema.yaml")

module provider:
    snakefile: "modules/provider/Snakefile"

use rule provider_make from provider as provider_*

include: "workflow/rules/consumer.smk"

rule all:
    input: f"{config['results_prefix']}/consumer/all.upper.txt"

Run:

snakemake --profile profiles/local -n

Expected planning evidence (verbatim file set):

Planned inputs include both:
results/v1/provider/A.upper.txt
results/v1/provider/B.upper.txt

Proof hook¶

Provide:

snakemake --profile profiles/local --list-rules output (before/after).
Two dry-runs showing the target set changed exactly as described.

Core 2 — Interface contracts: naming, schemas, versioned outputs, compatibility¶

Learning objectives¶

You will be able to:

Fail fast on bad config via schema validation.
Classify changes as breaking/non-breaking mechanically.
Force breaking changes to be explicit via results_prefix bump.

Definition¶

A contract is path + format + semantics, written down and enforced.

Semantics¶

The config schema makes typos impossible to ignore.
The file API doc makes upgrades reviewable.

Failure signatures¶

“It ran, but outputs are wrong” (schema drift or semantic drift).
Typos accepted (missing strict schema).

Minimal repro (schema failure must be immediate)¶

Break config/config.yaml:

results_prefix: "results/v1"
samplez: ["A", "B"]

Run:

snakemake --profile profiles/local -n

Expected failure evidence (verbatim fragment):

A validation error mentioning:
missing required property samples
unexpected property samplez

Fix pattern¶

Keep additionalProperties: false.
Put breaking changes behind results/v2/....

Proof hook¶

Provide:

The schema error excerpt.
The fixed config + a successful -n.

Core 3 — Determinism and drift control: CI as the correctness boundary¶

Learning objectives¶

You will be able to:

Demonstrate nondeterminism with a stable diff.
Enforce a CI gate that catches entropy and drift.
Produce drift artifacts (--summary, --list-changes) as PR evidence.

Definition¶

Determinism means:

stable plan for stable inputs,
stable outputs for stable inputs,
stable provenance signals.

Semantics¶

flowchart LR
  A[Run] --> B[Artifacts: outputs + summary + list-changes]
  C[Change code/params] --> D[list-changes flags impacted outputs]
  E[Entropy in outputs] --> F[diff fails => CI fails]

Failure signatures¶

“CI is flaky” (time/RNG/unordered globs in outputs).
Drift report is empty when it shouldn’t be (metadata dropped or bypassed).

Minimal repro (prove entropy)¶

`workflow/rules/entropy.smk`¶

rule entropy_bad:
    output: f"{config['results_prefix']}/entropy.txt"
    shell:
        "python - << 'PY'\n"
        "import time\n"
        "print(time.time())\n"
        "PY\n"
        "> {output}"

Run twice and diff:

snakemake --profile profiles/local -F results/v1/entropy.txt
cp results/v1/entropy.txt /tmp/e1.txt
snakemake --profile profiles/local -F results/v1/entropy.txt
diff -u /tmp/e1.txt results/v1/entropy.txt && echo OK || echo NONDETERMINISTIC

Expected output (verbatim):

NONDETERMINISTIC

Fix pattern¶

Entropy goes to logs, not semantic outputs.
If randomness is required, seed is config and becomes provenance.

CI gate (minimal, enforceable)¶

`ci/gate.sh`¶

#!/usr/bin/env bash
set -euo pipefail

snakemake --profile profiles/local --lint
snakemake --profile profiles/local -n
snakemake --profile profiles/local --cores 2
snakemake --profile profiles/local --summary > .proof/summary.txt
snakemake --profile profiles/local --list-changes code > .proof/list-changes.code.txt || true

Proof hook¶

Provide:

the failing diff (NONDETERMINISTIC) and the fixed diff (OK),
.proof/summary.txt and .proof/list-changes.code.txt.

Core 4 — Resource semantics with evidence: prove what the workflow asked for¶

Learning objectives¶

You will be able to:

Resolve dynamic resources deterministically from explicit evidence (input size).
Prove resolved threads and mem_mb using log artifacts.
Detect oversubscription failures before cluster execution.

Definition¶

Resource correctness is not “I wrote resources:”. It is:

the workflow resolved concrete values,
evidence was produced,
and the executor could have enforced them.

Semantics (portable evidence)¶

flowchart TD
  I[input size] --> R[resolve mem_mb]
  R --> L[write evidence log]
  L --> A[audit in CI / review]

Failure signatures¶

“Resources ignored” (no evidence exists; you’re guessing).
Oversubscription (threads > available cores) causes stalls/rejections.

Minimal repro (resource evidence logs)¶

`workflow/rules/resources.smk`¶

def mem_mb_from_input(wc, input):
    # deterministic; scale with size for demonstration
    return max(200, int(2 * input.size_mb) + 200)

rule resource_probe:
    input: "data/{sample}.txt"
    output: f"{config['results_prefix']}/resources/{{sample}}.done.txt"
    log: "logs/resources/{sample}.txt"
    threads: 2
    resources:
        mem_mb=mem_mb_from_input
    shell:
        r"""
        printf "sample={wildcards.sample}\nthreads={threads}\nmem_mb={resources.mem_mb}\ninput={input}\n" > {log}
        echo OK > {output}
        """

Make B large:

python - << 'PY'
with open("data/B.txt","w") as f:
    f.write("x" * 5_000_000 + "\n")
PY

Run:

snakemake --profile profiles/local results/v1/resources/A.done.txt results/v1/resources/B.done.txt
echo "=== A ==="; cat logs/resources/A.txt
echo "=== B ==="; cat logs/resources/B.txt

Expected log evidence (verbatim lines present in both):

sample=...
threads=2
mem_mb=...
input=data/...

And B’s mem_mb must be > A’s.

Fix pattern¶

Defaults belong in profile; rule-level resources are exceptions.
Any “special” rule must emit an evidence log of resolved resources.

Proof hook¶

Provide:

logs/resources/A.txt and logs/resources/B.txt,
and one sentence: “B mem_mb > A mem_mb (evidence above).”

Core 5 — Workflow as a product: distribution, pinning, upgrade paths, team practice¶

Learning objectives¶

You will be able to:

Make a breaking change that is mechanically explicit (v1 → v2).
Prove a non-breaking refactor does not perturb consumers.
Encode team review rules that prevent silent breakage.

Definition¶

A workflow is a versioned product:

stable file API,
pinned dependencies,
explicit upgrade paths.

Semantics (breaking change demo, isolated and concrete)¶

Step 1: baseline (v1)¶

Confirm:

snakemake --profile profiles/local --cores 2
snakemake --profile profiles/local --summary | grep -E "results/v1/provider|results/v1/consumer" | head

Step 2: introduce a breaking change (format semantics)¶

Change provider to output lowercase instead of uppercase (breaking semantics) but keep the path the same (this is the mistake).

Edit modules/provider/Snakefile:

rule provider_make:
    input: "data/{sample}.txt"
    output: "results/v1/provider/{sample}.upper.txt"
    shell: "cat {input} > {output}"  # now wrong semantics for v1

Run:

snakemake --profile profiles/local -F --cores 2
head -n 2 results/v1/consumer/all.upper.txt

Expected evidence (verbatim):

hello a
hello b

This proves: semantic breaking change silently shipped under v1 path.

Step 3: correct governance (bump to v2)¶

Fix by bumping results_prefix in config/config.yaml:

results_prefix: "results/v2"
samples: ["A", "B"]

Update workflow/contracts/FILE_API.md to “File API (v2)” and describe the new semantics. Run:

snakemake --profile profiles/local --cores 2
ls -1 results/v2/provider | head

Expected evidence (verbatim path change):

outputs now exist under results/v2/...
v1 remains intact unless explicitly cleaned

Fix pattern (team checklist)¶

A PR is not reviewable without:

snakemake --lint
snakemake -n
.proof/summary.txt
.proof/list-changes.*.txt
FILE_API.md diff if anything about outputs changed

Proof hook¶

Provide:

the head output showing silent semantic break under v1,
the config + contract bump to v2,
and a directory listing proving v2 outputs exist.

Debugging playbook: scaling boundary failures¶

Symptom	Command	Evidence	Likely cause	First fix
Targets shrink/expand	`snakemake -n`	planned file set differs	include leakage	module boundary + config-derived lists
Config typos accepted	`snakemake -n`	no validation error	missing strict schema	`validate(config, schema)`
CI flaky	`diff`	NONDETERMINISTIC	entropy in outputs	move entropy to logs; seed via config
Unsure what changed	`--list-changes code`	impacted outputs listed	drift	attach drift artifact; rerun targeted
Resource claims untrusted	`cat logs/resources/*.txt`	threads/mem logged	unproved resources	evidence logs per rule
Module import surprises	`--list-rules`	unexpected rules present	wildcard import	import only required rules

`Snakefile` (module assembly, final)¶

from snakemake.utils import validate

configfile: "config/config.yaml"
validate(config, "config/schema.yaml")

include: "workflow/rules/consumer.smk"
include: "workflow/rules/entropy.smk"
include: "workflow/rules/resources.smk"

module provider:
    snakefile: "modules/provider/Snakefile"

use rule provider_make from provider as provider_*

rule all:
    input:
        f"{config['results_prefix']}/consumer/all.upper.txt",
        f"{config['results_prefix']}/entropy.txt",
        f"{config['results_prefix']}/resources/A.done.txt",
        f"{config['results_prefix']}/resources/B.done.txt"

`Snakefile.reference` (single-file sanity, final)¶

from snakemake.utils import validate

configfile: "config/config.yaml"
validate(config, "config/schema.yaml")

SAMPLES = config["samples"]
P = config["results_prefix"]

rule all:
    input:
        f"{P}/consumer/all.upper.txt",
        f"{P}/entropy.txt",
        f"{P}/resources/A.done.txt",
        f"{P}/resources/B.done.txt"

rule provider_make:
    input: "data/{sample}.txt"
    output: f"{P}/provider/{{sample}}.upper.txt"
    shell: "tr '[:lower:]' '[:upper:]' < {input} > {output}"

rule consumer_all:
    input:
        expand(f"{P}/provider/{{sample}}.upper.txt", sample=SAMPLES)
    output:
        f"{P}/consumer/all.upper.txt"
    shell:
        "cat {input} > {output}"

rule entropy_bad:
    output: f"{P}/entropy.txt"
    shell:
        "python - << 'PY'\n"
        "import time\n"
        "print(time.time())\n"
        "PY\n"
        "> {output}"

def mem_mb_from_input(wc, input):
    return max(200, int(2 * input.size_mb) + 200)

rule resource_probe:
    input: "data/{sample}.txt"
    output: f"{P}/resources/{{sample}}.done.txt"
    log: "logs/resources/{sample}.txt"
    threads: 2
    resources:
        mem_mb=mem_mb_from_input
    shell:
        r"""
        printf "sample={wildcards.sample}\nthreads={threads}\nmem_mb={resources.mem_mb}\ninput={input}\n" > {log}
        echo OK > {output}
        """

Run sanity:

snakemake --profile profiles/local -s Snakefile.reference --cores 2

Closing recap¶

Scaling Snakemake is boundary engineering:

Modules enforce explicit dependencies; includes invite silent coupling.
Schemas + file API docs turn correctness into something you can prove.
CI gates kill entropy early.
Resources must produce evidence artifacts; otherwise they are folklore.
Breaking changes must be path/version changes, not “refactors”.

Directory glossary¶

Use Glossary when you want the recurring language in this module kept stable while you move between lessons, exercises, and capstone checkpoints.

Module 04: Scaling Workflows and Interface Boundaries¶

Module Position¶

Scaling Workflows and Interface Boundaries¶

Why this module matters¶

Reading path¶

Capstone connection¶

At a Glance¶

Orientation: scaling fails at boundaries¶

Unified scaling model¶

Runnable lab (single repo, two assemblies)¶

Golden layout¶

profiles/local/config.yaml¶

config/config.yaml¶

config/schema.yaml¶

workflow/contracts/FILE_API.md¶

data/A.txt, data/B.txt¶

Commissioning sequence (module assembly)¶

Core 1 — Modularity that scales: include vs module and real boundaries¶

Learning objectives¶

Definition¶

Semantics¶

Failure signatures¶

Minimal repro (leak via include)¶

modules/provider/Snakefile¶

workflow/rules/consumer.smk (bug: overwrites provider global)¶

Top-level Snakefile (bad assembly)¶

Fix pattern (module boundary + config-derived list)¶

workflow/rules/consumer.smk (fixed: no globals)¶

Top-level Snakefile (good assembly)¶

Proof hook¶

Core 2 — Interface contracts: naming, schemas, versioned outputs, compatibility¶

Learning objectives¶

Definition¶

Semantics¶

Failure signatures¶

Minimal repro (schema failure must be immediate)¶

Fix pattern¶

Proof hook¶

Core 3 — Determinism and drift control: CI as the correctness boundary¶

Learning objectives¶

Definition¶

Semantics¶

Failure signatures¶

Minimal repro (prove entropy)¶

workflow/rules/entropy.smk¶

Fix pattern¶

CI gate (minimal, enforceable)¶

ci/gate.sh¶

Proof hook¶

Core 4 — Resource semantics with evidence: prove what the workflow asked for¶

Learning objectives¶

Definition¶

Semantics (portable evidence)¶

Failure signatures¶

Minimal repro (resource evidence logs)¶

workflow/rules/resources.smk¶

Fix pattern¶

Proof hook¶

Core 5 — Workflow as a product: distribution, pinning, upgrade paths, team practice¶

Learning objectives¶

Definition¶

Semantics (breaking change demo, isolated and concrete)¶

Step 1: baseline (v1)¶

Step 2: introduce a breaking change (format semantics)¶

Step 3: correct governance (bump to v2)¶

Fix pattern (team checklist)¶

Proof hook¶

Debugging playbook: scaling boundary failures¶

Snakefile (module assembly, final)¶

Snakefile.reference (single-file sanity, final)¶

Closing recap¶

Directory glossary¶

`profiles/local/config.yaml`¶

`config/config.yaml`¶

`config/schema.yaml`¶

`workflow/contracts/FILE_API.md`¶

`data/A.txt`, `data/B.txt`¶

Core 1 — Modularity that scales: `include` vs `module` and real boundaries¶

Minimal repro (leak via `include`)¶

`modules/provider/Snakefile`¶

`workflow/rules/consumer.smk` (bug: overwrites provider global)¶

Top-level `Snakefile` (bad assembly)¶

`workflow/rules/consumer.smk` (fixed: no globals)¶

Top-level `Snakefile` (good assembly)¶

`workflow/rules/entropy.smk`¶

`ci/gate.sh`¶

`workflow/rules/resources.smk`¶

`Snakefile` (module assembly, final)¶

`Snakefile.reference` (single-file sanity, final)¶