Module 01: First Principles and the File-DAG Contract¶

Page Maps¶

graph LR
  family["Reproducible Research"]
  program["Deep Dive Snakemake"]
  section["File Contracts Workflow Graph Truth"]
  page["Module 01: First Principles and the File-DAG Contract"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone

flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

Module 01 is where the whole Snakemake program either becomes clear or stays mystical.

This module is not about memorizing rule syntax. It is about learning the contract Snakemake actually enforces:

targets declare intent
files define dependency truth
rules publish outputs that other rules are allowed to trust
reruns only make sense when the workflow tells the truth about its inputs, outputs, and tracked changes

If that model is shaky, every later feature in the course becomes harder to reason about.

What this module is for¶

By the end of Module 01, you should be able to explain five things clearly:

how a rule acts as a file contract rather than a vague workflow step
how targets become a DAG and then concrete jobs
why convergence fails when hidden inputs or unstable parameters leak in
how wildcard binding can stay precise or become ambiguous
how config, profiles, logs, and atomic publication keep a small workflow trustworthy

Study route¶

flowchart TD
  start["Overview"] --> core1["File Contracts, Targets, and Job Planning"]
  core1 --> core2["Convergence, Rerun Causes, and Hidden Inputs"]
  core2 --> core3["Wildcards, Binding, Ambiguity, and Constraints"]
  core3 --> core4["Config as Data, Profiles as Policy"]
  core4 --> core5["Atomic Publication, Logs, and Failure Evidence"]
  core5 --> example["Worked Example: Repairing a Lying First Workflow"]
  example --> practice["Exercises"]
  practice --> answers["Exercise Answers"]
  answers --> glossary["Glossary"]

Read the module in that order the first time. Later, jump directly to the page that matches the failure or design question you are facing.

The ten files in this module¶

How to use the file set¶

If you need to...	Start here
understand why a rule runs, does not run, or is absent from the DAG	File Contracts, Targets, and Job Planning
explain why a workflow reruns forever or fails to rerun when it should	Convergence, Rerun Causes, and Hidden Inputs
make wildcard patterns precise enough to avoid accidental matches	Wildcards, Binding, Ambiguity, and Constraints
keep semantic inputs separate from execution policy	Config as Data, Profiles as Policy
stop partial outputs and improve failure evidence	Atomic Publication, Logs, and Failure Evidence
see the whole module as one repaired beginner workflow	Worked Example: Repairing a Lying First Workflow
test your own understanding	Exercises
compare your reasoning against a reference	Exercise Answers
stabilize the module vocabulary	Glossary

The running question¶

Carry this question through every page:

what exact file contract or tracked change explains why Snakemake builds, skips, or reruns this output?

Good Module 01 answers usually mention one or more of these:

a concrete target path
the rule output pattern that matches it
the input or parameter that justifies the job
the evidence route that confirms the explanation, such as dry-run, summary, DAG, or logs
the publication rule that makes a final output trustworthy

Commands to keep close¶

These commands form the evidence loop for Module 01:

snakemake -n
snakemake --summary
snakemake --dag | dot -Tpdf > dag.pdf
snakemake --rulegraph | dot -Tpdf > rulegraph.pdf
snakemake --lint

They answer different questions:

what would run
who owns which files
how jobs depend on each other
how rules relate structurally
which design smells already exist

Learning outcomes¶

By the end of this module, you should be able to:

explain rules as file contracts and predict the resulting jobs
prove convergence and diagnose rerun causes
use wildcards precisely and recognize ambiguity early
validate config early and keep profiles out of semantic workflow meaning
publish outputs atomically and leave behind usable failure evidence

Exit standard¶

Do not move on until all of these are true:

you can explain why a small workflow does or does not run a rule
you can make a tiny workflow converge after a clean run
you can show one ambiguous wildcard design and repair it
you can distinguish config data from execution policy clearly
you can explain why a final output is trustworthy or why it is poison

When those feel ordinary, Module 01 has done its job.