Module 07: Workflow Architecture and File APIs¶
Module Position¶
flowchart TD
family["Reproducible Research"] --> program["Deep Dive Snakemake"]
program --> module["Module 07: Workflow Architecture and File APIs"]
module --> lessons["Lesson pages and worked examples"]
module --> checkpoints["Exercises and closing criteria"]
module --> capstone["Related capstone evidence"]
flowchart TD
purpose["Start with the module purpose and main questions"] --> lesson_map["Use the lesson map to choose reading order"]
lesson_map --> study["Read the lessons and examples with one review question in mind"]
study --> proof["Test the idea with exercises and capstone checkpoints"]
proof --> close["Move on only when the closing criteria feel concrete"]
Read the first diagram as a placement map: this page sits between the course promise, the lesson pages listed below, and the capstone surfaces that pressure-test the module. Read the second diagram as the study route for this page, so the diagrams point you toward the Lesson map, Exercises, and Closing criteria instead of acting like decoration.
Workflows become hard to maintain long before they become biologically or computationally large. The usual cause is not scale by itself. It is hidden architecture: rules copied across files, modules introduced without a boundary story, and consumers depending on paths that were never documented as an interface.
This module is about designing a Snakemake repository that stays legible as it grows: where rules live, how helper modules are arranged, which files are public contracts, and how reuse can reduce duplication without burying the workflow model.
Capstone exists here as corroboration. The module should already make repository boundaries understandable before you inspect the reference workflow layout.
Before You Begin¶
This module works best after Modules 01-06, especially the parts on file contracts, publish surfaces, and production boundaries.
Use this module if you need to learn how to:
- split a Snakemake repository into modules without hiding the real workflow shape
- define file APIs and repository boundaries that other teams can review
- reuse rules and helper code without turning the workflow into an internal framework
Proof loop for this module:
Capstone corroboration:
- inspect
capstone/Snakefile - inspect
capstone/workflow/rules/ - inspect Publish Review Guide
- inspect Capstone Walkthrough
At a Glance¶
| Focus | Learner question | Capstone timing |
|---|---|---|
| repository boundaries | "Where should a maintainer look first to understand the workflow shape?" | inspect the capstone after the routing-versus-implementation distinction is clear |
| file APIs | "Which paths are a public promise and which are only implementation detail?" | compare the workflow tree with FILE_API.md deliberately |
| reuse without hiding meaning | "How do we reduce duplication without turning the workflow into a private framework?" | use the capstone once module boundaries already feel reviewable |
1) Table of Contents¶
- Table of Contents
- Learning Outcomes
- How to Use This Module
- Core 1 — Repository Layers and Rule Placement
- Core 2 — Modules, Includes, and Namespaced Reuse
- Core 3 — File APIs as Team Contracts
- Core 4 — Shared Helpers Without Hidden Coupling
- Core 5 — Architecture Review Before a Workflow Rots
- Capstone Sidebar
- Exercises
- Closing Criteria
2) Learning Outcomes¶
By the end of this module, you can:
- organize a Snakemake repository so a newcomer can locate the main workflow boundaries quickly
- use includes or modules without losing track of the visible rule graph
- define a file API that distinguishes stable outputs from internal implementation detail
- reuse helpers and shared code while keeping contracts explicit
- review workflow architecture for hidden coupling before it becomes expensive to change
3) How to Use This Module¶
Build or refactor a repository with these layers:
Then answer four questions by inspection:
- where is the workflow entrypoint?
- where do rules live?
- which outputs are public?
- where does reusable code belong if it is not itself a rule?
If those answers require oral tradition, the architecture is already too opaque.
4) Core 1 — Repository Layers and Rule Placement¶
A maintainable Snakemake repository usually has distinct homes for:
- workflow entry logic
- rule definitions
- helper scripts and packages
- config and schemas
- profiles and executor policy
- published file contracts
The top-level Snakefile should feel like a routing surface, not the only place where
truth can be located.
Good architecture lets a reviewer answer:
- which files construct the workflow
- which files implement computation
- which files define published outputs
5) Core 2 — Modules, Includes, and Namespaced Reuse¶
Breaking a workflow across files is only helpful if the split mirrors responsibility.
Useful patterns:
include:files grouped by coherent rule families- module boundaries that correspond to stable domains or interfaces
- rule names and output paths that stay understandable after the split
Risky patterns:
- including files purely because one Snakefile became too long
- spreading one workflow concern across many files with no ownership boundary
- hiding critical defaults in a helper file that most reviewers never open
The question is never “can I split this?” It is “does the split make the contract easier to inspect?”
6) Core 3 — File APIs as Team Contracts¶
When a repository grows, downstream trust depends on explicit file APIs.
A file API should answer:
- which paths are stable
- what each published file means
- which formats are authoritative
- which directories are internal only
This is the architectural equivalent of a typed interface. Without it:
- notebooks read unstable intermediate files
- tests bind to implementation detail
- refactors become dangerous because consumers were never named
7) Core 4 — Shared Helpers Without Hidden Coupling¶
Shared code becomes dangerous when it reads undeclared files, relies on global config shape that rules do not document, or mutates behavior through import-time side effects.
Healthy helper boundaries:
- pass paths and parameters explicitly
- keep pure transformation logic testable outside Snakemake
- let the rule own the file contract while the helper owns the computation
- keep rule names, file paths, and helper names aligned enough that review is still possible
Unhealthy coupling:
- helper code that silently reads sibling directories
configkeys assumed by helpers but not validated anywhere- “common” modules that nobody understands but everyone is afraid to touch
8) Core 5 — Architecture Review Before a Workflow Rots¶
Review a growing workflow with these questions:
- can a new contributor find the rule entrypoint quickly?
- can a downstream consumer tell what is public versus internal?
- can you remove or refactor one rule family without surprising unrelated parts?
- does the repository layout reinforce the mental model or fight it?
- are file contracts and module boundaries documented where people will actually read them?
Architecture rot often starts as convenience:
- “we will document the path later”
- “put it in common for now”
- “this extra include is temporary”
If you do not stop those shortcuts early, the workflow stops teaching its own shape.
9) Capstone Sidebar¶
Use the capstone to inspect:
Snakefileas the workflow entrypointworkflow/rules/as the split between rule familiesFILE_API.mdas the stable downstream contractsrc/capstone/as helper code that stays outside rule files
10) Exercises¶
- Refactor one single-file workflow into rule-family files without changing the visible contract.
- Write a short
FILE_API.mdfor one workflow and mark which directories are internal only. - Move one piece of reusable logic into
src/and make its inputs explicit in the calling rule. - Review a workflow repository and list the three strongest signs of hidden architectural coupling.
11) Closing Criteria¶
You pass this module only if you can demonstrate:
- a repository layout that mirrors workflow responsibilities clearly
- rule splits or modules that improve inspectability rather than hiding it
- a file API that names the stable downstream surface
- reusable helpers that do not smuggle undeclared dependencies across the repository
Directory glossary¶
Use Glossary when you want the recurring language in this module kept stable while you move between lessons, exercises, and capstone checkpoints.