Worked Example: Moving a Workflow Across the Software Boundary Safely¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive Snakemake"]
section["Software Boundaries Reproducible Rules"]
page["Worked Example: Moving a Workflow Across the Software Boundary Safely"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
This worked example ties the module together.
The goal is not to show every possible Snakemake feature. The goal is to show how a workflow step becomes more trustworthy when software ownership, runtime, and provenance are made explicit together.
Starting situation¶
Imagine a publication step that needs to write a provenance artifact for released results.
A rushed first draft might keep everything inline:
rule build_provenance:
input:
"publish/v1/results.tsv"
output:
"publish/v1/provenance.json"
run:
import datetime
import json
import platform
import subprocess
import sys
...
This works until the step becomes real software.
Now the repository has several problems:
- the rule hides implementation detail in a large
run:block - runtime assumptions are implicit
- the publication artifact has no clearly owned software boundary
- reviewers cannot easily separate file contract from implementation
Better target¶
The capstone points toward a stronger design:
- the rule owns the file contract
workflow/scripts/provenance.pyowns step-local implementationworkflow/envs/python.yamldeclares the step runtimeenvironment.yamlserves repository-level authoring and workflow execution setupDockerfileoffers a stronger machine-portability surface
That is already a meaningful boundary stack.
Step 1: keep the rule readable¶
Conceptually, the rule should look more like this:
rule build_provenance:
input:
results="publish/v1/results.tsv"
output:
json="publish/v1/provenance.json"
conda:
"workflow/envs/python.yaml"
script:
"workflow/scripts/provenance.py"
This rule tells a reviewer four useful things immediately:
- which artifact is being produced
- which input gives it context
- where the implementation lives
- which runtime boundary the step relies on
That is a much stronger contract than a long run: block.
Step 2: keep step-local code in the right place¶
The capstone's workflow/scripts/provenance.py is a good example of code that belongs in
workflow/scripts/ rather than in a reusable package:
- it depends on the injected
snakemakeobject - it is tightly coupled to one workflow step
- its job is tied to publication metadata rather than broad domain reuse
If later multiple steps need shared formatting or metadata helpers, that reusable portion
can graduate into src/capstone/.
That is the important judgment:
- step-local behavior stays near the workflow
- reusable behavior moves into package code
Step 3: declare the runtime where it matters¶
workflow/envs/python.yaml currently declares a small Python runtime:
That file is not useful because it is long. It is useful because it says the provenance step should not depend on ambient host Python.
At the repository level, environment.yaml serves a different purpose:
- it gives contributors and workflow runners a predictable baseline
- it pins the Snakemake family used for the project
And Dockerfile solves a broader boundary again by packaging that environment into a more
portable machine-level contract.
Step 4: make provenance part of the output story¶
The current provenance script records several important pieces of software evidence:
- timestamp
- Python version and executable
- platform
- Snakemake version
- git commit
- workflow config
That design matters because publication outputs should be defensible after the run is finished.
When a reviewer asks, "what software context produced this artifact?", the repository can answer with a file instead of a memory.
Step 5: think through change scenarios¶
Now imagine three different edits:
workflow/scripts/provenance.pychanges how metadata is serialized.workflow/envs/python.yamladds another runtime dependency.Dockerfilechanges because the execution surface is moving to a stricter container path.
These edits do not all mean the same thing, but they all affect software trust.
The repository should now be able to reason clearly:
- the rule contract is still visible
- the step implementation changed in a named place
- the runtime boundary changed in a named place
- provenance artifacts can help distinguish earlier outputs from rebuilt outputs
That is what a healthy software boundary feels like in practice.
What this example teaches¶
flowchart LR
rule["Visible rule contract"] --> script["Step-local implementation"]
rule --> env["Rule runtime contract"]
env --> container["Optional machine boundary"]
script --> prov["Published provenance artifact"]
container --> prov
The point is not complexity for its own sake.
The point is that each layer now has one clear job:
- rule: file meaning
- script: implementation
- environment: step runtime
- container: machine portability when needed
- provenance artifact: evidence after execution
Review summary¶
If you can explain this example well, you understand the module:
- why the rule remains visible even when implementation moves out
- why runtime declarations belong next to execution boundaries
- why not all code belongs in the same directory
- why provenance is part of publication trust, not optional decoration