Skip to content

Module 02: Dynamic DAGs, Discovery, and Integrity

Page Maps

graph LR
  family["Reproducible Research"]
  program["Deep Dive Snakemake"]
  section["Dynamic Dags Discovery Integrity"]
  page["Module 02: Dynamic DAGs, Discovery, and Integrity"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone
flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

Module 01 taught you how Snakemake plans work from explicit file contracts. Module 02 asks the harder question:

what happens when the set of jobs is not fully known until the workflow starts looking at data?

This module is about making that situation reviewable instead of magical.

Dynamic behavior is not the enemy. Hidden discovery is.

What this module is for

By the end of Module 02, you should be able to explain five things in plain language:

  • how a changing sample set becomes an explicit target list instead of ambient filesystem luck
  • how wildcard domains stay narrow enough to prevent accidental fanout
  • what a checkpoint is allowed to discover and what it must never hide
  • which artifacts make dynamic behavior durable enough for review and downstream trust
  • when environment design and job granularity help the workflow, and when they quietly make it worse

Study route

flowchart TD
  start["Overview"] --> core1["Deterministic Target Lists and Sample Discovery"]
  core1 --> core2["Wildcard Domains and Fanout Control"]
  core2 --> core3["Checkpoints and Reviewed DAG Changes"]
  core3 --> core4["Provenance, Manifests, and Publish Boundaries"]
  core4 --> core5["Software Stacks and Scheduler Cost"]
  core5 --> example["Worked Example: Making Checkpoint Discovery Reviewable"]
  example --> practice["Exercises"]
  practice --> answers["Exercise Answers"]
  answers --> glossary["Glossary"]

Read the module in that order the first time through. When you return later, jump to the page that matches the failure or design pressure in front of you.

The ten files in this module

  1. Overview (index.md)
  2. Deterministic Target Lists and Sample Discovery
  3. Wildcard Domains and Fanout Control
  4. Checkpoints and Reviewed DAG Changes
  5. Provenance, Manifests, and Publish Boundaries
  6. Software Stacks and Scheduler Cost
  7. Worked Example: Making Checkpoint Discovery Reviewable
  8. Exercises
  9. Exercise Answers
  10. Glossary

How to use the file set

If you need to... Start here
stop discovery from drifting with directory noise or unordered scans Deterministic Target Lists and Sample Discovery
prevent wildcards and expand() from creating nonsense work Wildcard Domains and Fanout Control
decide whether a checkpoint is the right tool or a design smell Checkpoints and Reviewed DAG Changes
make dynamic behavior visible to reviewers and downstream consumers Provenance, Manifests, and Publish Boundaries
keep environments and job granularity from turning correctness into slowness Software Stacks and Scheduler Cost
see the module as one repaired workflow rather than five isolated rules Worked Example: Making Checkpoint Discovery Reviewable
test your own understanding Exercises
compare your reasoning against a reference answer Exercise Answers
stabilize the module vocabulary Glossary

The running question

Carry this question through every page:

if discovery changes the DAG, what exact artifact lets another person review that change later?

Good Module 02 answers usually mention one or more of these:

  • a concrete discovered-set file
  • a validated target list
  • a wildcard boundary that limits what can be claimed
  • a checkpoint output that is durable enough to reread
  • a publish or provenance artifact that preserves the run story

The running example

This module keeps returning to one simple workflow shape:

  • raw sequencing files appear in data/raw/
  • the workflow discovers which samples exist
  • per-sample jobs fan out from that discovery
  • the discovered set is recorded before downstream work trusts it
  • a publish surface carries that discovery forward for review

That shape is small enough to teach, but realistic enough to reveal the real design mistakes people make with Snakemake.

Commands to keep close

These commands form the evidence loop for Module 02:

snakemake -n
snakemake --summary
snakemake --dag | dot -Tpdf > dag.pdf
snakemake --lint
snakemake --list-changes params input code

They answer different questions:

  • what would run
  • which files Snakemake believes it owns
  • how the current jobs relate
  • which design smells are already visible
  • which tracked changes justify reruns

Learning outcomes

By the end of this module, you should be able to:

  • materialize a deterministic target list from changing data
  • constrain wildcard domains so the DAG matches the real problem
  • use checkpoints to reveal dynamic structure instead of hiding it
  • explain how discovery, provenance, and publish artifacts fit together
  • recognize when too many environments or too many tiny jobs are degrading the workflow

Exit standard

Do not move on until all of these are true:

  • you can explain one discovery route without saying "Snakemake just figures it out"
  • you can show where the discovered set is recorded and why that location matters
  • you can describe one checkpoint that is justified and one that is really a smell
  • you can say which artifacts are internal execution state and which are safe to publish
  • you can explain one performance repair without weakening workflow truth

When those become ordinary, Module 02 has done its job.