Rule Logic, Scripts, and Software Ownership¶

Page Maps¶

graph LR
  family["Reproducible Research"]
  program["Deep Dive Snakemake"]
  section["Software Boundaries Reproducible Rules"]
  page["Rule Logic, Scripts, and Software Ownership"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone

flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

The first software-boundary decision is usually the most common one:

should this logic stay in the rule, or should it move into software?

If that question is answered badly, the repository often gets worse in two ways at once:

rules become hard to review
helper code becomes hard to trust

This page is about drawing that line well.

Rules should own orchestration, not hidden programs¶

A Snakemake rule is strongest when it makes these things explicit:

declared inputs and outputs
parameters that affect file meaning
resource claims
the execution boundary for one step

That is orchestration.

Once a rule starts containing large parsing logic, data transformation logic, or report generation logic, it is often becoming the least testable program in the repository.

That is the warning sign.

What belongs comfortably in a rule¶

Rules are a good home for:

short shell commands whose meaning is still obvious
small glue logic that keeps the file contract readable
explicit parameter passing into an external command or script

The important part is that a reviewer can still explain:

what files this rule reads
what files it writes
what change would make it rerun

If that explanation becomes cloudy, the boundary is already weakening.

What usually belongs in a script or package¶

Move logic out of the rule when it becomes:

non-trivial data transformation
reusable domain logic
logic that deserves direct tests outside Snakemake
code whose readability would improve if treated like a normal program

That is why the capstone keeps reusable processing code under src/capstone/ and leaves workflow-adjacent metadata generation in workflow/scripts/.

The rule should still own the file contract even when the implementation moves.

One useful split¶

flowchart LR
  rule["Snakemake rule"] --> contract["declared files and params"]
  contract --> script["workflow/scripts or src package"]
  script --> output["produced artifact"]

This picture matters because the rule is not replaced. It remains the place where the file contract stays visible.

The script or package owns implementation, not workflow meaning by itself.

A weak first draft¶

Weak shape:

rule summarize:
    input:
        "results/raw.json"
    output:
        "publish/v1/summary.json"
    run:
        import json
        data = json.load(open(input[0]))
        # many lines of transformation and formatting logic here
        ...

This may work. It creates two problems:

the rule body becomes the hidden implementation layer
the logic is harder to test outside a workflow run

The repository now has software; it is just pretending it does not.

A stronger rewrite¶

Stronger shape:

rule summarize:
    input:
        "results/raw.json"
    output:
        "publish/v1/summary.json"
    script:
        "workflow/scripts/summarize.py"

Or, when the logic is reusable:

rule summarize:
    input:
        "results/raw.json"
    output:
        "publish/v1/summary.json"
    shell:
        "python -m capstone.summarize --input {input} --output {output}"

This improves the repository only if the rule still tells the file story clearly and the software boundary is explicit.

`script:` and package code solve different problems¶

script: is a good fit when:

the code is workflow-adjacent
the logic is meaningful but still closely tied to one orchestration step

Package code under src/ is a better fit when:

the logic is reusable across steps
it deserves direct tests and imports
it is real implementation code, not only glue

The difference is not prestige. It is ownership.

Common failure modes¶

Failure mode	What it looks like	Better repair
giant `run:` block	rule files become the least reviewed programs in the repo	move non-trivial logic into script or package code
script hides undeclared file reads	the rule contract looks smaller than the real behavior	keep all meaningful file dependencies visible in the rule
package code changes workflow meaning silently	helpers become a second hidden workflow	keep the rule as the visible contract boundary
shell fragments turn into mini applications	debugging and testing stay trapped inside Snakemake runs	promote real program logic into software surfaces
everything is moved to helper code reflexively	rules stop explaining the workflow	leave simple orchestration in the rule where it belongs

The explanation a reviewer trusts¶

Strong explanation:

this rule still owns the file contract, but the transformation logic moved into workflow/scripts/ because it became real program logic; the repository now keeps the workflow story visible in the rule and the implementation testable in code.

Weak explanation:

we moved it to a script because the rule looked messy.

The first explanation gives an ownership reason. The second gives only a cleanliness reaction.

End-of-page checkpoint¶

Before leaving this page, you should be able to:

explain one case where logic should stay in a rule
explain one case where logic should move into a script
explain one case where logic belongs in package code under src/
describe why moving code out of a rule does not remove the rule’s ownership of the file contract