Module 02: Dynamic DAGs, Discovery, and Integrity¶

Module Position¶

flowchart TD
  family["Reproducible Research"] --> program["Deep Dive Snakemake"]
  program --> module["Module 02: Dynamic DAGs, Discovery, and Integrity"]
  module --> lessons["Lesson pages and worked examples"]
  module --> checkpoints["Exercises and closing criteria"]
  module --> capstone["Related capstone evidence"]

flowchart TD
  purpose["Start with the module purpose and main questions"] --> lesson_map["Use the lesson map to choose reading order"]
  lesson_map --> study["Read the lessons and examples with one review question in mind"]
  study --> proof["Test the idea with exercises and capstone checkpoints"]
  proof --> close["Move on only when the closing criteria feel concrete"]

Read the first diagram as a placement map: this page sits between the course promise, the lesson pages listed below, and the capstone surfaces that pressure-test the module. Read the second diagram as the study route for this page, so the diagrams point you toward the Lesson map, Exercises, and Closing criteria instead of acting like decoration.

Module 02 turns dynamic behavior into an explicit contract. Checkpoints, discovery artifacts, and provenance surfaces are useful only when the DAG stays reviewable and the discovered set stays stable enough to trust.

Capstone exists here as corroboration. The module should already make deterministic discovery understandable before you inspect the capstone checkpoint and publish flow.

Version & scope contract

Target: Snakemake 9.14.x semantics (mid-December 2025 docs). Verify your runtime:
snakemake --version
snakemake -h | sed -n '930,1025p'   # software deployment + conda + apptainer flags
* Scope: advanced DAG construction, dynamic DAGs (checkpoints), integrity/provenance, env/container discipline, and performance patterns without assuming a cluster. Cluster-first execution and executor plugins are Module 03. * Hard constraint: deterministic targets, deterministic discovery, atomic outputs, reproducible software stacks. If you violate any of these, Snakemake will still run — you will just stop trusting your results.

Why this module matters¶

Dynamic behavior is where many workflows become impressive demos and unreliable systems. Checkpoints, metadata-driven expansion, and environment management can all be correct, but they can also hide moving targets, unstable discovery, and irreproducible plans.

This module is about turning "the DAG depends on data" from a hand-wave into a disciplined contract.

Reading path¶

Start with the predictive model and wildcard discipline.
Read checkpoints only after the target-list story feels clear.
Read integrity and environment sections before performance patterns.
Treat the appendices as proof aids, not as optional filler.

Capstone connection¶

The capstone’s discovery checkpoint, provenance artifacts, and versioned publish flow all depend on this module’s rules. If you want to know why discovery is recorded explicitly and why the workflow is opinionated about reproducibility evidence, this module provides that justification.

At a Glance¶

Focus	Learner question	Capstone timing
deterministic discovery	"How can the DAG depend on data without becoming a moving target?"	use the capstone heavily once the explicit discovered-set idea is clear
checkpoints	"What is a checkpoint allowed to discover, and what must it never hide?"	inspect the capstone after you can explain the checkpoint contract in words
integrity evidence	"Which artifacts keep dynamic behavior reviewable?"	compare discovered-set and provenance files deliberately

0. Orientation¶

0.1 The predictive model for “advanced Snakemake pain”¶

If Module 01 taught you “the DAG is a function of files”, Module 02 teaches you what breaks when the DAG is not predictable.

A practical cost model:

Pain term	What you feel	Root cause	First fix
DAG explosion	thousands of unintended jobs	`expand()` cartesian product, uncontrolled wildcards	constrain + validate + build explicit target lists
Dynamic nondeterminism	reruns that “shouldn’t happen”	checkpoint outputs differ across runs	make discovery deterministic + record discovered set
Poison artifacts	“Nothing to be done” but results are wrong	stale outputs that still satisfy patterns	strict contracts + provenance + `--summary`/`--list-changes`
Env churn	workflow is “slow before it starts”	too many unique environments, repeated solves	reuse envs + pin + pre-create
Scheduler overhead	cluster/FS melts on small jobs	too-fine task granularity	batch/group/scatter-gather intentionally

0.2 A single mental picture for Module 02¶

flowchart TD
  A[config + metadata] --> B[deterministic target list]
  B --> C[DAG construction]
  C -->|static| D[rules]
  C -->|data-dependent| E[checkpoint]
  E --> F[discovered set recorded]
  F --> C
  D --> G[atomic outputs + provenance]
  G --> H[summary / report / drift checks]

Invariant: If a run’s “discovered set” is not recorded as an explicit artifact, you do not have a reproducible dynamic DAG.

Core 1 — Wildcard Mastery: Metadata-Driven Expansion Without Explosions¶

Learning objectives¶

You will be able to:

predict when expand() produces a cartesian product (and prevent it),
build a validated, explicit target list from a sample sheet,
use wildcard constraints to prevent ambiguous matching,
prove that your DAG size equals your metadata size (no hidden multiplication).

1.1 Definition¶

Metadata-driven expansion means: you compute the exact list of targets from structured metadata (sample sheet), validate it, and only then hand it to Snakemake (typically via rule all / rule targets).

This is the opposite of “let wildcards float freely and hope”.

1.2 Semantics: why `expand()` bites¶

By default, expand() uses a cartesian product of wildcard value lists. The docs explicitly note you can replace that combinator (e.g., zip) when you intend paired alignment. (snakemake.readthedocs.io)

Minimal repro: accidental cartesian product¶

Snakefile

SAMPLES = ["s1", "s2"]
READS   = ["R1", "R2"]

rule all:
    input:
        expand("work/{sample}.{read}.fq", sample=SAMPLES, read=READS)

Expected You get 4 targets:

work/s1.R1.fq
work/s1.R2.fq
work/s2.R1.fq
work/s2.R2.fq

That was correct here — but the same mechanism silently creates nonsense when lists are meant to be paired (e.g., sample ↔ library, tumor ↔ normal).

Fix pattern: pair with `zip`¶

SAMPLES = ["s1", "s2"]
LIBS    = ["libA", "libB"]  # paired with SAMPLES

rule all:
    input:
        expand("work/{sample}.{lib}.ok", zip, sample=SAMPLES, lib=LIBS)

Expected Only:

work/s1.libA.ok
work/s2.libB.ok

1.3 The professional pattern: “targets are data”¶

You want a single function that:

reads metadata, 2) validates it, 3) returns explicit targets.

Minimal, runnable sample sheet pattern¶

config/samples.tsv

sample  read1   read2
s1  data/reads/s1_R1.txt    data/reads/s1_R2.txt
s2  data/reads/s2_R1.txt    data/reads/s2_R2.txt

Snakefile snippet

import csv
from pathlib import Path

SAMPLES_TSV = Path("config/samples.tsv")

def load_samples(tsv: Path):
    if not tsv.exists():
        raise ValueError(f"Missing sample sheet: {tsv}")
    rows = []
    with tsv.open() as fh:
        rdr = csv.DictReader(fh, delimiter="\t")
        required = {"sample", "read1", "read2"}
        if set(rdr.fieldnames or []) != required:
            raise ValueError(f"Expected columns {required}, got {rdr.fieldnames}")
        for r in rdr:
            rows.append(r)

    samples = [r["sample"] for r in rows]
    if len(samples) != len(set(samples)):
        raise ValueError("Duplicate sample IDs in samples.tsv")

    # Optional: enforce safe wildcard domain (prevents regex surprises later)
    for s in samples:
        if not s.replace("_", "").isalnum():
            raise ValueError(f"Unsafe sample id (use [A-Za-z0-9_]+): {s}")

    return rows

ROWS = load_samples(SAMPLES_TSV)
SAMPLE_IDS = [r["sample"] for r in ROWS]

def targets():
    return [f"results/qc/{s}.ok" for s in SAMPLE_IDS]

rule all:
    input:
        targets()

1.4 Failure signatures¶

Symptom: “Why do I have N×M jobs?”
Evidence: snakemake -n prints job counts far above sample count.
Symptom: wildcard matches files you didn’t intend
Evidence: AmbiguousRuleException or a rule fires for wrong filenames.

1.5 Proof hook¶

Run:

snakemake -n

Expected invariant: job counts scale linearly with sample rows (not multiplicatively).

Core 2 — Checkpoints: Dynamic DAGs Done Safely (and When They’re a Smell)¶

Learning objectives¶

You will be able to:

explain the two-phase model: “build DAG → run checkpoint → re-evaluate DAG”,
implement a checkpoint that discovers an unknown set deterministically,
demonstrate a “moving target” anti-pattern and repair it,
prove that the discovered set is stable across repeated runs.

2.1 Definition¶

A checkpoint is a rule that allows Snakemake to re-evaluate part of the DAG after some data exists. This is for cases where the downstream targets cannot be known at parse time. (snakemake.readthedocs.io)

2.2 Semantics: the two-phase execution model¶

Phase 1: Snakemake builds a partial DAG that includes the checkpoint output.
Phase 2: Once the checkpoint finishes, input functions that access checkpoints.<name>.get(...) are re-evaluated, and the downstream DAG becomes concrete. (snakemake.readthedocs.io)

Critical contract: the checkpoint output should be declared with directory(...) when it represents “a set of files whose names are only known after execution.” (snakemake.readthedocs.io)

2.3 Minimal repro: deterministic discovery (correct pattern)¶

We will “discover” chunk IDs from a file, then process each chunk.

Create this scratch input inside the example repository:

data/items.txt

A
B
C

Snakefile

from pathlib import Path
import json

checkpoint discover_chunks:
    input:
        "data/items.txt"
    output:
        directory("work/discovered")
    run:
        outdir = Path(output[0])
        outdir.mkdir(parents=True, exist_ok=True)

        # Deterministic discovery: sorted unique IDs from the file
        ids = sorted({line.strip() for line in Path(input[0]).read_text().splitlines() if line.strip()})
        (outdir / "chunks.json").write_text(json.dumps({"chunks": ids}, indent=2) + "\n")

rule process_chunk:
    input:
        "work/discovered/chunks.json"
    output:
        "work/chunks/{chunk}.done"
    wildcard_constraints:
        chunk=r"[A-Za-z0-9_]+"
    run:
        import json
        from pathlib import Path
        ids = json.loads(Path(input[0]).read_text())["chunks"]
        if wildcards.chunk not in ids:
            raise ValueError(f"Unknown chunk {wildcards.chunk}; discovered={ids}")
        Path(output[0]).parent.mkdir(parents=True, exist_ok=True)
        Path(output[0]).write_text(f"{wildcards.chunk}\n")

def chunk_targets(wildcards):
    # This is the canonical checkpoint access pattern.
    ck = checkpoints.discover_chunks.get()
    chunks_json = Path(ck.output[0]) / "chunks.json"
    import json
    ids = json.loads(chunks_json.read_text())["chunks"]
    return expand("work/chunks/{chunk}.done", chunk=ids)

rule gather:
    input:
        chunk_targets
    output:
        "results/chunks.manifest"
    run:
        from pathlib import Path
        Path(output[0]).parent.mkdir(parents=True, exist_ok=True)
        Path(output[0]).write_text("".join(Path(f).read_text() for f in input))

Run

snakemake -j 1 results/chunks.manifest

Expected filesystem

work/discovered/chunks.json
work/chunks/A.done
work/chunks/B.done
work/chunks/C.done
results/chunks.manifest

Expected results/chunks.manifest

A
B
C

2.4 Minimal repro: “moving target” checkpoint (anti-pattern)¶

Broken checkpoint: emits random chunk IDs each run.

import random
import string
from pathlib import Path
import json

checkpoint discover_chunks:
    output:
        directory("work/discovered")
    run:
        outdir = Path(output[0])
        outdir.mkdir(parents=True, exist_ok=True)

        # NONDETERMINISTIC: changes across runs even with identical inputs.
        ids = ["".join(random.choice(string.ascii_uppercase) for _ in range(4)) for _ in range(3)]
        (outdir / "chunks.json").write_text(json.dumps({"chunks": ids}, indent=2) + "\n")

Failure signatures¶

Symptom: repeated runs create different downstream targets.
Evidence: git diff work/discovered/chunks.json changes each run; outputs accumulate; provenance becomes meaningless.

Fix pattern¶

Discovery must be a deterministic function of checkpoint inputs.
The discovered set must be recorded (e.g., chunks.json) and treated as a contract.

2.5 Proof hook¶

Run twice:

snakemake -j 1 results/chunks.manifest
snakemake -j 1 results/chunks.manifest

Expected invariant: second run is “Nothing to be done” and work/discovered/chunks.json is unchanged.

Core 3 — Data Integrity and Provenance as First-Class Outputs¶

Learning objectives¶

You will be able to:

treat provenance artifacts as outputs (not “nice-to-have logs”),
use --summary / --detailed-summary to detect stale/poison artifacts,
use --list-changes and rule version: to force evidence-based reruns,
generate an HTML report as a reproducible audit artifact.

3.1 Definition¶

Integrity means: outputs correspond to specific inputs + code + parameters + software.

Snakemake supports this via metadata tracking and CLI introspection (--summary, --detailed-summary, change listing). (snakemake.readthedocs.io)

3.2 The evidence tools (with expected output structure)¶

`--summary` (what exists, what will run, why)¶

Docs state the summary columns include: filename, modification time, rule version, status, plan. (snakemake.readthedocs.io)

Run:

snakemake --summary

Expected header structure (columns):

filename  modification time  rule version  status  plan

`--detailed-summary` (adds input + shell command)¶

Docs state it adds: input file(s), shell command columns. (snakemake.readthedocs.io)

Run:

snakemake --detailed-summary

Expected header structure (columns):

filename  modification time  rule version  input file(s)  shell command  status  plan

`--list-changes` (drift detection)¶

The modern interface is --list-changes {input,code,params} (migration docs call out the redesign). (snakemake.readthedocs.io)

Run:

snakemake --list-changes code
snakemake --list-changes params
snakemake --list-changes input

Expected output: a list of output files that are considered stale under that drift type.

3.3 Minimal repro: rule versioning + code drift¶

rule build:
    input: "data/items.txt"
    output: "results/build.txt"
    version: "1"
    shell: "cat {input} > {output}"

Run once.
Change version: "1" → version: "2".
Run snakemake --summary.

Expected evidence: the status / plan reflect that results/build.txt is outdated due to version/implementation change. (snakemake.readthedocs.io)

3.4 Report as an audit artifact¶

--report generates a self-contained HTML report (or a zip archive for larger reports). (snakemake.readthedocs.io)

Run:

snakemake -j 1 --report results/report.zip

Expected:

results/report.zip exists
it contains report.html as the entrypoint (docs behavior). (snakemake.readthedocs.io)

3.5 Proof hook¶

Your workflow is “auditable” only if you can answer, with artifacts:

What ran? (logs, benchmark, report)
With what code/version? (version:, metadata, repo state)
With what inputs/params? (snapshotted config + sample sheet)

Core 4 — Environments and Containers: Reproducibility Without Slowness¶

Learning objectives¶

You will be able to:

run per-rule conda envs correctly (and understand which flags are required),
eliminate env churn via reuse + pin files + pre-creation,
reason about containers vs conda as a reproducibility/performance tradeoff,
prove that your software stack is stable across machines.

4.1 The flag reality (don’t guess)¶

From the CLI docs:

--software-deployment-method has alias --sdm (choices include conda, apptainer). (snakemake.readthedocs.io)
--use-conda must be set or conda: directives are ignored. (snakemake.readthedocs.io)
--conda-create-envs-only creates envs and exits (requires --use-conda). (snakemake.readthedocs.io)
--use-apptainer must be set or container: directives are ignored. (snakemake.readthedocs.io)

Operational implication: you don’t “turn on conda” with one flag. You choose a deployment method and enable the directive.

4.2 Minimal repro: one env reused across many rules¶

workflow/envs/python.yaml

channels:
  - conda-forge
dependencies:
  - python=3.11

Snakefile

rule step1:
    input: "data/items.txt"
    output: "work/step1.txt"
    conda: "workflow/envs/python.yaml"
    shell: "python -c \"open('{output}', 'w').write(open('{input}').read())\""

rule step2:
    input: "work/step1.txt"
    output: "results/final.txt"
    conda: "workflow/envs/python.yaml"
    shell: "python -c \"open('{output}', 'w').write(open('{input}').read().lower())\""

Pre-create envs

snakemake --sdm conda --use-conda --conda-create-envs-only

Then run normally:

snakemake --sdm conda --use-conda -j 1 results/final.txt

Expected evidence: the second invocation does not re-solve environments (it reuses cached envs under the conda prefix). (Exact timing varies by machine.)

4.3 Pin files: freezing conda to exact builds¶

Snakemake supports <platform>.pin.txt alongside env YAML to freeze environments to explicit specs. (snakemake.readthedocs.io)

Example:

workflow/envs/python.yaml
workflow/envs/python.linux-64.pin.txt

Interpretation: this is “container-like reproducibility” without building an image.

4.4 Containers (Apptainer/Singularity) realities¶

--use-apptainer (aka --use-singularity) enables container directives. (snakemake.readthedocs.io)
If apptainer/singularity binary is missing, Snakemake fails fast (common HPC module issue). (GitHub)

Rule of thumb: use containers when you need maximal reproducibility across heterogeneous nodes; use conda when you need fast iteration and minimal overhead — but pin aggressively either way.

4.5 Proof hook¶

You have “reproducible software deployment” if:

a cold run can be made deterministic (pin files or pinned container tags),
a warm run does not re-create environments,
--report contains provenance that matches the deployed software method. (snakemake.readthedocs.io)

Core 5 — Performance Patterns: DAG Shape, Scheduler Load, and I/O¶

Learning objectives¶

You will be able to:

recognize “too many tiny jobs” as a scheduler problem (not a compute problem),
apply scatter/gather and batching intentionally,
understand job grouping and where it actually matters,
reduce filesystem pressure by changing DAG shape (not by “more threads”).

5.1 The dominant performance killer: overhead¶

In real pipelines, you often pay more for:

process launch + conda activation,
filesystem metadata ops,
scheduler submission latency,

than for the compute itself.

5.2 Minimal repro: tiny-job pathology¶

SAMPLES = [f"s{i}" for i in range(200)]

rule tiny:
    output: "work/tiny/{s}.txt"
    wildcard_constraints: s=r"s[0-9]+"
    shell: "echo {wildcards.s} > {output}"

rule all:
    input: expand("work/tiny/{s}.txt", s=SAMPLES)

Expected symptom: snakemake -n prints 200 jobs for tiny plus all.

Fix pattern A: batch inside a rule (manual batching)¶

Write one rule that processes a batch list (e.g., 20 samples per job). This reduces job count by ~20×, at the cost of less parallelism granularity.

Fix pattern B: job grouping (cluster/cloud payoff)¶

Snakemake supports grouping jobs so they are submitted together as “group jobs” in cluster/cloud execution. Docs: grouping partitions the job graph into groups; ignored locally. (snakemake.readthedocs.io)

Important truth: you cannot “see the benefit” of grouping in local mode because it is intentionally ignored. The proof requires a non-local executor (Module 03).

5.3 Scatter/gather done right¶

Scatter:

split a large input into deterministic shards (often via checkpoint if shard count is data-dependent),
process shards independently,
gather into final outputs.

This is the safe use-case for checkpoints: you trade a single large job for a stable, reproducible shard set.

5.4 Proof hook¶

Your performance changes are real only if you can show:

fewer jobs in the planned DAG (snakemake -n job counts),
fewer filesystem outputs (or fewer tiny intermediates),
under cluster mode: fewer submissions (group jobs), with unchanged final results.

Appendix A — Minimal Lab Setup¶

Create this structure (exact):

graph TD
  lab["lab/"]
  lab --> snakefile["Snakefile"]
  lab --> config["config/"]
  lab --> data["data/"]
  lab --> workflow["workflow/"]
  config --> samples["samples.tsv"]
  data --> items["items.txt"]
  data --> reads["reads/"]
  reads --> s1r1["s1_R1.txt"]
  reads --> s1r2["s1_R2.txt"]
  reads --> s2r1["s2_R1.txt"]
  reads --> s2r2["s2_R2.txt"]
  workflow --> envs["envs/"]
  envs --> py["python.yaml"]

Populate:

data/items.txt as the scratch discovery input from Core 2
config/samples.tsv as in Core 1
workflow/envs/python.yaml as in Core 4

Appendix B — Debugging Playbook: What You See → What It Means → First Fix¶

What you see	Run this	Expected evidence	Likely cause	First fix
DAG is huge	`snakemake -n`	job counts ≫ sample rows	cartesian `expand()`, free wildcards	explicit target list + `zip` + validation
“Nothing to do” but you distrust outputs	`snakemake --summary`	`status/plan` show “up-to-date”	poison artifact still matches contract	tighten contracts + `version:` + `--list-changes`
Output should rerun after code change	`snakemake --list-changes code`	file listed (or not)	rule body not tracked / metadata dropped	stop using `--drop-metadata`; rerun with `-R $(...)`
Checkpoint downstream missing	run with `-n --reason`	checkpoint dependency shown	wrong `.get()` usage or nondeterministic discovery	use canonical `checkpoints.x.get(...).output` + record discovered set
Conda slow every time	`snakemake --sdm conda --use-conda --list-conda-envs`	many envs	env fragmentation	reuse env files; pin; precreate

CLI evidence tools (--summary, --detailed-summary, --list-changes, --report) are documented in Snakemake’s CLI docs. (snakemake.readthedocs.io)

Appendix C — Exercises¶

Each exercise requires:

the command(s) you ran,
the evidence artifact(s) produced (file contents or CLI output),
a 5–10 line explanation: symptom → violated contract → fix.

Exercise 1 — Prove you avoided a cartesian explosion¶

Modify samples.tsv to include 10 samples.
Build targets from metadata.
Proof: snakemake -n shows job count linear in sample count.

Exercise 2 — Break a checkpoint on purpose, then repair it¶

Implement the “moving target” checkpoint.
Show that discovered set changes across runs.
Repair to deterministic discovery.
Proof: chunks.json identical across two runs.

Exercise 3 — Demonstrate drift detection¶

Add version: "1" to a rule producing a result.
Run once.
Change to version: "2".
Proof: snakemake --summary indicates the result is scheduled due to version/implementation drift (columns as documented). (snakemake.readthedocs.io)

Exercise 4 — Eliminate env churn¶

Add conda: to two rules with the same env file.
Run --conda-create-envs-only, then run the workflow.
Proof: second run does not recreate envs; --list-conda-envs shows a single env (or a small stable set).

Exercise 5 — Performance reasoning (no cluster required)¶

Create a “200 tiny jobs” repro (Core 5).
Replace with manual batching (20 per job).
Proof: snakemake -n job counts drop by ~10×.

Appendix D — Reference Workflow (Complete, Runnable Baseline)¶

If you want one copy-paste file that exercises Module 02 patterns (metadata targets + checkpoint discovery + provenance hooks), use this single Snakefile:

# Snakefile — Module 02 baseline

import csv
import json
from pathlib import Path

# -----------------------
# Metadata → targets (Core 1)
# -----------------------
SAMPLES_TSV = Path("config/samples.tsv")

def load_samples(tsv: Path):
    rows = []
    with tsv.open() as fh:
        rdr = csv.DictReader(fh, delimiter="\t")
        required = {"sample", "read1", "read2"}
        if set(rdr.fieldnames or []) != required:
            raise ValueError(f"Expected columns {required}, got {rdr.fieldnames}")
        for r in rdr:
            rows.append(r)

    ids = [r["sample"] for r in rows]
    if len(ids) != len(set(ids)):
        raise ValueError("Duplicate sample IDs in samples.tsv")
    for s in ids:
        if not s.replace("_", "").isalnum():
            raise ValueError(f"Unsafe sample id (use [A-Za-z0-9_]+): {s}")
    return rows

ROWS = load_samples(SAMPLES_TSV)
SAMPLE_IDS = [r["sample"] for r in ROWS]

rule all:
    input:
        "results/chunks.manifest",
        expand("results/qc/{sample}.ok", sample=SAMPLE_IDS)

rule qc:
    input:
        r1=lambda wc: next(r["read1"] for r in ROWS if r["sample"] == wc.sample),
        r2=lambda wc: next(r["read2"] for r in ROWS if r["sample"] == wc.sample),
    output:
        "results/qc/{sample}.ok"
    wildcard_constraints:
        sample=r"[A-Za-z0-9_]+"
    version: "1"
    run:
        Path(output[0]).parent.mkdir(parents=True, exist_ok=True)
        # Minimal “QC”: prove both reads exist and write a stable marker.
        for f in input:
            if not Path(f).exists():
                raise ValueError(f"Missing input: {f}")
        Path(output[0]).write_text(f"{wildcards.sample}\tOK\n")

# -----------------------
# Deterministic checkpoint discovery (Core 2)
# -----------------------
checkpoint discover_chunks:
    input:
        "data/items.txt"
    output:
        directory("work/discovered")
    run:
        outdir = Path(output[0])
        outdir.mkdir(parents=True, exist_ok=True)
        ids = sorted({line.strip() for line in Path(input[0]).read_text().splitlines() if line.strip()})
        (outdir / "chunks.json").write_text(json.dumps({"chunks": ids}, indent=2) + "\n")

def chunk_targets(_):
    ck = checkpoints.discover_chunks.get()
    chunks_json = Path(ck.output[0]) / "chunks.json"
    ids = json.loads(chunks_json.read_text())["chunks"]
    return expand("work/chunks/{chunk}.done", chunk=ids)

rule process_chunk:
    input:
        "work/discovered/chunks.json"
    output:
        "work/chunks/{chunk}.done"
    wildcard_constraints:
        chunk=r"[A-Za-z0-9_]+"
    version: "1"
    run:
        ids = json.loads(Path(input[0]).read_text())["chunks"]
        if wildcards.chunk not in ids:
            raise ValueError(f"Unknown chunk {wildcards.chunk}; discovered={ids}")
        Path(output[0]).parent.mkdir(parents=True, exist_ok=True)
        Path(output[0]).write_text(f"{wildcards.chunk}\n")

rule gather:
    input:
        chunk_targets
    output:
        "results/chunks.manifest"
    version: "1"
    run:
        Path(output[0]).parent.mkdir(parents=True, exist_ok=True)
        Path(output[0]).write_text("".join(Path(f).read_text() for f in input))

Verified CLI / semantics references (for this module)¶

Checkpoints .get() behavior and directory(...) guidance (snakemake.readthedocs.io)
expand(..., zip, ...) to avoid cartesian product (snakemake.readthedocs.io)
--summary / --detailed-summary column definitions (snakemake.readthedocs.io)
--list-changes redesigned interface (snakemake.readthedocs.io)
CLI flags: --sdm, --use-conda, --conda-create-envs-only, --use-apptainer (snakemake.readthedocs.io)
Reports (--report) (snakemake.readthedocs.io)
Job grouping semantics (cluster/cloud only; ignored locally) (snakemake.readthedocs.io)

Directory glossary¶

Use Glossary when you want the recurring language in this module kept stable while you move between lessons, exercises, and capstone checkpoints.

Module 02: Dynamic DAGs, Discovery, and Integrity¶

Module Position¶

Why this module matters¶

Reading path¶

Capstone connection¶

At a Glance¶

Table of Contents¶

0. Orientation¶

0.1 The predictive model for “advanced Snakemake pain”¶

0.2 A single mental picture for Module 02¶

Core 1 — Wildcard Mastery: Metadata-Driven Expansion Without Explosions¶

Learning objectives¶

1.1 Definition¶

1.2 Semantics: why expand() bites¶

Minimal repro: accidental cartesian product¶

Fix pattern: pair with zip¶

1.3 The professional pattern: “targets are data”¶

Minimal, runnable sample sheet pattern¶

1.4 Failure signatures¶

1.5 Proof hook¶

Core 2 — Checkpoints: Dynamic DAGs Done Safely (and When They’re a Smell)¶

Learning objectives¶

2.1 Definition¶

2.2 Semantics: the two-phase execution model¶

2.3 Minimal repro: deterministic discovery (correct pattern)¶

2.4 Minimal repro: “moving target” checkpoint (anti-pattern)¶

Failure signatures¶

Fix pattern¶

2.5 Proof hook¶

Core 3 — Data Integrity and Provenance as First-Class Outputs¶

Learning objectives¶

3.1 Definition¶

3.2 The evidence tools (with expected output structure)¶

--summary (what exists, what will run, why)¶

--detailed-summary (adds input + shell command)¶

--list-changes (drift detection)¶

3.3 Minimal repro: rule versioning + code drift¶

3.4 Report as an audit artifact¶

3.5 Proof hook¶

Core 4 — Environments and Containers: Reproducibility Without Slowness¶

Learning objectives¶

4.1 The flag reality (don’t guess)¶

4.2 Minimal repro: one env reused across many rules¶

4.3 Pin files: freezing conda to exact builds¶

4.4 Containers (Apptainer/Singularity) realities¶

4.5 Proof hook¶

Core 5 — Performance Patterns: DAG Shape, Scheduler Load, and I/O¶

Learning objectives¶

5.1 The dominant performance killer: overhead¶

5.2 Minimal repro: tiny-job pathology¶

Fix pattern A: batch inside a rule (manual batching)¶

Fix pattern B: job grouping (cluster/cloud payoff)¶

5.3 Scatter/gather done right¶

5.4 Proof hook¶

Appendix A — Minimal Lab Setup¶

Appendix B — Debugging Playbook: What You See → What It Means → First Fix¶

Appendix C — Exercises¶

Exercise 1 — Prove you avoided a cartesian explosion¶

Exercise 2 — Break a checkpoint on purpose, then repair it¶

Exercise 3 — Demonstrate drift detection¶

Exercise 4 — Eliminate env churn¶

Exercise 5 — Performance reasoning (no cluster required)¶

Appendix D — Reference Workflow (Complete, Runnable Baseline)¶

Verified CLI / semantics references (for this module)¶

Directory glossary¶

1.2 Semantics: why `expand()` bites¶

Fix pattern: pair with `zip`¶

`--summary` (what exists, what will run, why)¶

`--detailed-summary` (adds input + shell command)¶

`--list-changes` (drift detection)¶