Worked Example: Reading a Snakemake Repository Like an Architect¶

Page Maps¶

graph LR
  family["Reproducible Research"]
  program["Deep Dive Snakemake"]
  section["Workflow Architecture File Apis"]
  page["Worked Example: Reading a Snakemake Repository Like an Architect"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone

flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

This worked example ties the module together.

The goal is not to inspect every file in the repository. The goal is to show how a reviewer reads a growing workflow without getting lost in helper code or folder noise.

Starting situation¶

Imagine you inherit the capstone and need to answer three questions quickly:

where is the workflow assembled?
where do the main rule families live?
where are path contracts and public promises documented?

If you cannot answer those questions in a stable order, the repository architecture is already too opaque.

Step 1: start at the entrypoint¶

The top-level Snakefile is the right first file because it visibly:

loads and validates config
sets directory defaults
includes rule-family files
defines the main target

That is exactly what a good entrypoint should do.

A reviewer learns the repository shape immediately:

workflow/rules/common.smk
workflow/rules/preprocess.smk
workflow/rules/summarize_report.smk
workflow/rules/publish.smk

This is better than starting in helper code, because the entrypoint shows the visible DAG assembly first.

Step 2: inspect rule families before helpers¶

Once the entrypoint is clear, the next useful move is to inspect the named rule families.

This gives a workflow story:

preprocessing builds internal per-sample surfaces
summarize and report promote selected artifacts
publish defines the public bundle and its integrity surface

That sequence teaches architecture and workflow meaning at the same time.

Step 3: inspect reusable workflow surfaces separately¶

The repository also contains workflow/modules/.

That tells a reviewer something different:

these are reusable workflow bundles
they are not the same kind of surface as the locally assembled rule families

For example, workflow/modules/qc_module/Snakefile and workflow/modules/screen_module/Snakefile show reusable rule-template style boundaries, while workflow/scripts/provenance.py remains workflow-adjacent step logic.

That is a useful architectural distinction.

Step 4: inspect package code only after workflow boundaries are visible¶

The capstone also has src/capstone/.

That is the right place to expect reusable implementation code such as:

trim_fastq.py
kmer_profile.py
screen_panel.py

By the time a reviewer reaches this layer, they should already know which rule or module surface owns the orchestration boundary.

That prevents package code from becoming the first and only story of the repository.

Step 5: confirm path contracts¶

Architecture review is incomplete until the path contracts are visible too.

The capstone keeps those in:

workflow/CONTRACT.md
workflow/contracts/FILE_API.md
capstone/docs/file-api.md

Those docs answer questions the directory tree alone cannot:

which workflow paths are stable
which published paths are stable
what kinds of changes count as contract changes

This is where architecture becomes reviewable rather than intuitive.

One review route¶

flowchart LR
  snakefile["Snakefile"] --> rules["workflow/rules/"]
  rules --> modules["workflow/modules/"]
  rules --> scripts["workflow/scripts/"]
  modules --> package["src/capstone/"]
  rules --> contracts["workflow/CONTRACT.md + FILE_API.md"]
  contracts --> publish["publish/v1/"]

The point of this route is to keep visible ownership ahead of implementation detail.

A useful contrast¶

Weak reading order:

open src/capstone/ first
browse helpers until the workflow shape slowly appears

Strong reading order:

start with Snakefile
inspect named rule families
inspect modules and scripts by role
inspect package code once orchestration is already legible
confirm path contracts in the contract docs

The second route produces architectural understanding faster and with less guesswork.

What this example teaches¶

If you can explain this example well, you understand the module:

why entrypoint clarity matters
why rule families and modules solve different architecture questions
why helper code should not become the first repository story
why file APIs and contract docs belong to architecture review