Skip to content

Workflow Modularization

Guide Fit

flowchart TD
  family["Reproducible Research"] --> program["Deep Dive Snakemake"]
  program --> pressure["A concrete learner or reviewer question"]
  pressure --> guide["Workflow Modularization"]
  guide --> next["Modules, capstone, and reference surfaces"]
flowchart TD
  question["Name the exact question you need answered"] --> skim["Skim only the sections that match that pressure"]
  skim --> crosscheck["Open the linked module, proof surface, or capstone route"]
  crosscheck --> next_move["Leave with one next decision, page, or command"]

Read the first diagram as a timing map: this guide is for a named pressure, not for wandering the whole course-book. Read the second diagram as the guide loop: arrive with a concrete question, use only the matching sections, then leave with one smaller and more honest next move.

Use this page when a workflow is growing and the main question is not "can we split it?" but "which split keeps the workflow legible?"

The Levels

If the real need is... Prefer this level What it should own What it must not hide
one small workflow with obvious rule relationships a single Snakefile the visible workflow graph architecture complexity for its own sake
grouping coherent rule families inside one repository include: files under workflow/rules/ rule families with shared file-contract concerns cross-cutting defaults that only exist in helper files
reusing a workflow bundle with a clear boundary workflow/modules/ a named workflow boundary with explicit inputs and outputs the real DAG shape or consumer-facing file contracts
moving non-trivial implementation out of rule bodies workflow/scripts/ or src/ package code computation and reusable program logic silent workflow semantics that are no longer visible from the rules
changing run context without changing meaning profiles/ execution policy, retries, resources, and executor settings analytical meaning or published output contracts

Fast Decision Rules

  • Stay in one Snakefile while the workflow graph is still easier to review than the split.
  • Use include: when the split mirrors rule ownership that a reviewer can name in one sentence.
  • Use workflow/modules/ only when the module has a stable interface and does not make the main graph harder to explain.
  • Move logic into workflow/scripts/ or src/ when the code is real program logic, not merely shell glue.
  • Keep profiles/ for operating policy only; if a profile change would alter the workflow meaning, the boundary is wrong.

Anti-Patterns

  • Splitting files only because one file became long, while leaving ownership more confusing than before.
  • Creating a "common" workflow module that everybody imports and nobody can review confidently.
  • Hiding path conventions or wildcard assumptions in helper code that the rule surface never names.
  • Treating profile settings as harmless when they actually change published behavior or scientific meaning.

Best Companion Surfaces