Module 02: Dynamic DAGs, Discovery, and Integrity¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive Snakemake"]
section["Dynamic Dags Discovery Integrity"]
page["Module 02: Dynamic DAGs, Discovery, and Integrity"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
Module 01 taught you how Snakemake plans work from explicit file contracts. Module 02 asks the harder question:
what happens when the set of jobs is not fully known until the workflow starts looking at data?
This module is about making that situation reviewable instead of magical.
Dynamic behavior is not the enemy. Hidden discovery is.
What this module is for¶
By the end of Module 02, you should be able to explain five things in plain language:
- how a changing sample set becomes an explicit target list instead of ambient filesystem luck
- how wildcard domains stay narrow enough to prevent accidental fanout
- what a checkpoint is allowed to discover and what it must never hide
- which artifacts make dynamic behavior durable enough for review and downstream trust
- when environment design and job granularity help the workflow, and when they quietly make it worse
Study route¶
flowchart TD
start["Overview"] --> core1["Deterministic Target Lists and Sample Discovery"]
core1 --> core2["Wildcard Domains and Fanout Control"]
core2 --> core3["Checkpoints and Reviewed DAG Changes"]
core3 --> core4["Provenance, Manifests, and Publish Boundaries"]
core4 --> core5["Software Stacks and Scheduler Cost"]
core5 --> example["Worked Example: Making Checkpoint Discovery Reviewable"]
example --> practice["Exercises"]
practice --> answers["Exercise Answers"]
answers --> glossary["Glossary"]
Read the module in that order the first time through. When you return later, jump to the page that matches the failure or design pressure in front of you.
The ten files in this module¶
- Overview (
index.md) - Deterministic Target Lists and Sample Discovery
- Wildcard Domains and Fanout Control
- Checkpoints and Reviewed DAG Changes
- Provenance, Manifests, and Publish Boundaries
- Software Stacks and Scheduler Cost
- Worked Example: Making Checkpoint Discovery Reviewable
- Exercises
- Exercise Answers
- Glossary
How to use the file set¶
| If you need to... | Start here |
|---|---|
| stop discovery from drifting with directory noise or unordered scans | Deterministic Target Lists and Sample Discovery |
prevent wildcards and expand() from creating nonsense work |
Wildcard Domains and Fanout Control |
| decide whether a checkpoint is the right tool or a design smell | Checkpoints and Reviewed DAG Changes |
| make dynamic behavior visible to reviewers and downstream consumers | Provenance, Manifests, and Publish Boundaries |
| keep environments and job granularity from turning correctness into slowness | Software Stacks and Scheduler Cost |
| see the module as one repaired workflow rather than five isolated rules | Worked Example: Making Checkpoint Discovery Reviewable |
| test your own understanding | Exercises |
| compare your reasoning against a reference answer | Exercise Answers |
| stabilize the module vocabulary | Glossary |
The running question¶
Carry this question through every page:
if discovery changes the DAG, what exact artifact lets another person review that change later?
Good Module 02 answers usually mention one or more of these:
- a concrete discovered-set file
- a validated target list
- a wildcard boundary that limits what can be claimed
- a checkpoint output that is durable enough to reread
- a publish or provenance artifact that preserves the run story
The running example¶
This module keeps returning to one simple workflow shape:
- raw sequencing files appear in
data/raw/ - the workflow discovers which samples exist
- per-sample jobs fan out from that discovery
- the discovered set is recorded before downstream work trusts it
- a publish surface carries that discovery forward for review
That shape is small enough to teach, but realistic enough to reveal the real design mistakes people make with Snakemake.
Commands to keep close¶
These commands form the evidence loop for Module 02:
snakemake -n
snakemake --summary
snakemake --dag | dot -Tpdf > dag.pdf
snakemake --lint
snakemake --list-changes params input code
They answer different questions:
- what would run
- which files Snakemake believes it owns
- how the current jobs relate
- which design smells are already visible
- which tracked changes justify reruns
Learning outcomes¶
By the end of this module, you should be able to:
- materialize a deterministic target list from changing data
- constrain wildcard domains so the DAG matches the real problem
- use checkpoints to reveal dynamic structure instead of hiding it
- explain how discovery, provenance, and publish artifacts fit together
- recognize when too many environments or too many tiny jobs are degrading the workflow
Exit standard¶
Do not move on until all of these are true:
- you can explain one discovery route without saying "Snakemake just figures it out"
- you can show where the discovered set is recorded and why that location matters
- you can describe one checkpoint that is justified and one that is really a smell
- you can say which artifacts are internal execution state and which are safe to publish
- you can explain one performance repair without weakening workflow truth
When those become ordinary, Module 02 has done its job.