Skip to content

Rule Families, Includes, and Named Ownership

Page Maps

graph LR
  family["Reproducible Research"]
  program["Deep Dive Snakemake"]
  section["Scaling Workflows Interface Boundaries"]
  page["Rule Families, Includes, and Named Ownership"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone
flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

The first scaling decision is usually simpler than teams make it sound:

does this workflow need more boundaries, or does it only need clearer ownership?

Many repositories split too early and end up with more files but less clarity. This page is about the first healthy split: rule families with named ownership.

A larger file is not automatically a modularity problem

When one Snakefile grows, teams often reach for file splits because the file feels long.

Length can matter. It is not the real test.

The real questions are:

  • can a reviewer still explain which rules belong together
  • can they name where one workflow concern begins and ends
  • does the visible DAG stay easier to read after the split

If the answer is no, the split is probably cosmetic rather than architectural.

include: is for coherent rule families

include: is a good tool when:

  • the repository still owns one visible workflow graph
  • a set of rules clearly belongs to one concern
  • the split improves human reading order

Typical examples:

  • discovery and preprocessing rules
  • summarization and reporting rules
  • publish or verification rules

That is why the capstone keeps rule families under workflow/rules/ while still using one top-level Snakefile as the visible orchestration surface.

The ownership test

Before creating a new rule file, try to say its job in one sentence.

Good sentence:

workflow/rules/publish.smk owns promotion of reviewed internal results into the public publish boundary.

Weak sentence:

this file has the extra stuff that did not fit anywhere else.

If the sentence is weak, the split is weak too.

One healthy split

flowchart TD
  snakefile["Snakefile"] --> common["workflow/rules/common.smk"]
  snakefile --> preprocess["workflow/rules/preprocess.smk"]
  snakefile --> summarize["workflow/rules/summarize_report.smk"]
  snakefile --> publish["workflow/rules/publish.smk"]

This is not impressive because it uses more files. It is impressive because each file has named ownership and the top-level workflow story remains visible.

What include: must not hide

An include-based split becomes harmful when it hides things such as:

  • cross-cutting defaults that nobody can locate quickly
  • wildcard assumptions that only exist in helper files
  • path conventions that the top-level workflow never names
  • a consumer-facing contract that now depends on private internal trivia

The rule is simple:

if a reviewer has to open random helper files before they can explain the workflow, the split is already weakening the repository.

A weak first split

Weak shape:

workflow/rules/
  helpers.smk
  more_helpers.smk
  misc.smk

This fails because:

  • ownership is unclear
  • review order is unclear
  • the names do not tell the reader which concern each file owns

It creates smaller files without creating better boundaries.

A stronger split

Stronger shape:

workflow/rules/
  common.smk
  preprocess.smk
  summarize_report.smk
  publish.smk

This works better because:

  • the names correspond to real workflow concerns
  • the reader can predict what each file should contain
  • the top-level Snakefile still tells the orchestration story

That is named ownership.

When to stop splitting

A repository can over-split just as easily as it can under-split.

Stop when:

  • each rule family has one clear concern
  • the top-level orchestration still reads as one visible graph
  • opening another file would add indirection more than clarity

The goal is not to maximize file count. The goal is to make the workflow explainable.

Common failure modes

Failure mode What it looks like Better repair
split by length only files are shorter but ownership is still vague split by named workflow concern
helper files absorb real workflow meaning reviewers cannot explain rules from the visible graph keep the contract visible from Snakefile and rule-family files
file names are generic reading order becomes guesswork name files by the workflow concern they own
one concern is scattered across many files changes require repository archaeology regroup rules under one owning family
include: is used for reusable sub-workflows with stable interfaces boundaries stay too soft consider a real module boundary instead

The explanation a reviewer trusts

Strong explanation:

the repository still owns one visible workflow graph, but workflow/rules/preprocess.smk groups discovery and per-sample processing, while workflow/rules/publish.smk owns the public promotion boundary, so each file has named workflow ownership.

Weak explanation:

we split the Snakefile because it was getting long.

The first explanation gives a boundary. The second gives a symptom.

End-of-page checkpoint

Before leaving this page, you should be able to:

  • explain when include: is the right first scaling tool
  • describe one good rule-family boundary in one sentence
  • name one sign that a split created indirection instead of clarity
  • explain why file count is a weak proxy for modularity