Skip to content

Module 01: First Principles and the File-DAG Contract

Page Maps

graph LR
  family["Reproducible Research"]
  program["Deep Dive Snakemake"]
  section["File Contracts Workflow Graph Truth"]
  page["Module 01: First Principles and the File-DAG Contract"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone
flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

Module 01 is where the whole Snakemake program either becomes clear or stays mystical.

This module is not about memorizing rule syntax. It is about learning the contract Snakemake actually enforces:

  • targets declare intent
  • files define dependency truth
  • rules publish outputs that other rules are allowed to trust
  • reruns only make sense when the workflow tells the truth about its inputs, outputs, and tracked changes

If that model is shaky, every later feature in the course becomes harder to reason about.

What this module is for

By the end of Module 01, you should be able to explain five things clearly:

  • how a rule acts as a file contract rather than a vague workflow step
  • how targets become a DAG and then concrete jobs
  • why convergence fails when hidden inputs or unstable parameters leak in
  • how wildcard binding can stay precise or become ambiguous
  • how config, profiles, logs, and atomic publication keep a small workflow trustworthy

Study route

flowchart TD
  start["Overview"] --> core1["File Contracts, Targets, and Job Planning"]
  core1 --> core2["Convergence, Rerun Causes, and Hidden Inputs"]
  core2 --> core3["Wildcards, Binding, Ambiguity, and Constraints"]
  core3 --> core4["Config as Data, Profiles as Policy"]
  core4 --> core5["Atomic Publication, Logs, and Failure Evidence"]
  core5 --> example["Worked Example: Repairing a Lying First Workflow"]
  example --> practice["Exercises"]
  practice --> answers["Exercise Answers"]
  answers --> glossary["Glossary"]

Read the module in that order the first time. Later, jump directly to the page that matches the failure or design question you are facing.

The ten files in this module

  1. Overview (index.md)
  2. File Contracts, Targets, and Job Planning
  3. Convergence, Rerun Causes, and Hidden Inputs
  4. Wildcards, Binding, Ambiguity, and Constraints
  5. Config as Data, Profiles as Policy
  6. Atomic Publication, Logs, and Failure Evidence
  7. Worked Example: Repairing a Lying First Workflow
  8. Exercises
  9. Exercise Answers
  10. Glossary

How to use the file set

If you need to... Start here
understand why a rule runs, does not run, or is absent from the DAG File Contracts, Targets, and Job Planning
explain why a workflow reruns forever or fails to rerun when it should Convergence, Rerun Causes, and Hidden Inputs
make wildcard patterns precise enough to avoid accidental matches Wildcards, Binding, Ambiguity, and Constraints
keep semantic inputs separate from execution policy Config as Data, Profiles as Policy
stop partial outputs and improve failure evidence Atomic Publication, Logs, and Failure Evidence
see the whole module as one repaired beginner workflow Worked Example: Repairing a Lying First Workflow
test your own understanding Exercises
compare your reasoning against a reference Exercise Answers
stabilize the module vocabulary Glossary

The running question

Carry this question through every page:

what exact file contract or tracked change explains why Snakemake builds, skips, or reruns this output?

Good Module 01 answers usually mention one or more of these:

  • a concrete target path
  • the rule output pattern that matches it
  • the input or parameter that justifies the job
  • the evidence route that confirms the explanation, such as dry-run, summary, DAG, or logs
  • the publication rule that makes a final output trustworthy

Commands to keep close

These commands form the evidence loop for Module 01:

snakemake -n
snakemake --summary
snakemake --dag | dot -Tpdf > dag.pdf
snakemake --rulegraph | dot -Tpdf > rulegraph.pdf
snakemake --lint

They answer different questions:

  • what would run
  • who owns which files
  • how jobs depend on each other
  • how rules relate structurally
  • which design smells already exist

Learning outcomes

By the end of this module, you should be able to:

  • explain rules as file contracts and predict the resulting jobs
  • prove convergence and diagnose rerun causes
  • use wildcards precisely and recognize ambiguity early
  • validate config early and keep profiles out of semantic workflow meaning
  • publish outputs atomically and leave behind usable failure evidence

Exit standard

Do not move on until all of these are true:

  • you can explain why a small workflow does or does not run a rule
  • you can make a tiny workflow converge after a clean run
  • you can show one ambiguous wildcard design and repair it
  • you can distinguish config data from execution policy clearly
  • you can explain why a final output is trustworthy or why it is poison

When those feel ordinary, Module 01 has done its job.