Worked Example: Moving a Workflow Across the Software Boundary Safely¶

Page Maps¶

graph LR
  family["Reproducible Research"]
  program["Deep Dive Snakemake"]
  section["Software Boundaries Reproducible Rules"]
  page["Worked Example: Moving a Workflow Across the Software Boundary Safely"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone

flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

This worked example ties the module together.

The goal is not to show every possible Snakemake feature. The goal is to show how a workflow step becomes more trustworthy when software ownership, runtime, and provenance are made explicit together.

Starting situation¶

Imagine a publication step that needs to write a provenance artifact for released results.

A rushed first draft might keep everything inline:

rule build_provenance:
    input:
        "publish/v1/results.tsv"
    output:
        "publish/v1/provenance.json"
    run:
        import datetime
        import json
        import platform
        import subprocess
        import sys
        ...

This works until the step becomes real software.

Now the repository has several problems:

the rule hides implementation detail in a large run: block
runtime assumptions are implicit
the publication artifact has no clearly owned software boundary
reviewers cannot easily separate file contract from implementation

Better target¶

The capstone points toward a stronger design:

the rule owns the file contract
workflow/scripts/provenance.py owns step-local implementation
workflow/envs/python.yaml declares the step runtime
environment.yaml serves repository-level authoring and workflow execution setup
Dockerfile offers a stronger machine-portability surface

That is already a meaningful boundary stack.

Step 1: keep the rule readable¶

Conceptually, the rule should look more like this:

rule build_provenance:
    input:
        results="publish/v1/results.tsv"
    output:
        json="publish/v1/provenance.json"
    conda:
        "workflow/envs/python.yaml"
    script:
        "workflow/scripts/provenance.py"

This rule tells a reviewer four useful things immediately:

which artifact is being produced
which input gives it context
where the implementation lives
which runtime boundary the step relies on

That is a much stronger contract than a long run: block.

Step 2: keep step-local code in the right place¶

The capstone's workflow/scripts/provenance.py is a good example of code that belongs in workflow/scripts/ rather than in a reusable package:

it depends on the injected snakemake object
it is tightly coupled to one workflow step
its job is tied to publication metadata rather than broad domain reuse

If later multiple steps need shared formatting or metadata helpers, that reusable portion can graduate into src/capstone/.

That is the important judgment:

step-local behavior stays near the workflow
reusable behavior moves into package code

Step 3: declare the runtime where it matters¶

workflow/envs/python.yaml currently declares a small Python runtime:

name: capstone-python
channels:
  - conda-forge
dependencies:
  - python=3.11

That file is not useful because it is long. It is useful because it says the provenance step should not depend on ambient host Python.

At the repository level, environment.yaml serves a different purpose:

it gives contributors and workflow runners a predictable baseline
it pins the Snakemake family used for the project

And Dockerfile solves a broader boundary again by packaging that environment into a more portable machine-level contract.

Step 4: make provenance part of the output story¶

The current provenance script records several important pieces of software evidence:

timestamp
Python version and executable
platform
Snakemake version
git commit
workflow config

That design matters because publication outputs should be defensible after the run is finished.

When a reviewer asks, "what software context produced this artifact?", the repository can answer with a file instead of a memory.

Step 5: think through change scenarios¶

Now imagine three different edits:

workflow/scripts/provenance.py changes how metadata is serialized.
workflow/envs/python.yaml adds another runtime dependency.
Dockerfile changes because the execution surface is moving to a stricter container path.

These edits do not all mean the same thing, but they all affect software trust.

The repository should now be able to reason clearly:

the rule contract is still visible
the step implementation changed in a named place
the runtime boundary changed in a named place
provenance artifacts can help distinguish earlier outputs from rebuilt outputs

That is what a healthy software boundary feels like in practice.

What this example teaches¶

flowchart LR
  rule["Visible rule contract"] --> script["Step-local implementation"]
  rule --> env["Rule runtime contract"]
  env --> container["Optional machine boundary"]
  script --> prov["Published provenance artifact"]
  container --> prov

The point is not complexity for its own sake.

The point is that each layer now has one clear job:

rule: file meaning
script: implementation
environment: step runtime
container: machine portability when needed
provenance artifact: evidence after execution

Review summary¶

If you can explain this example well, you understand the module:

why the rule remains visible even when implementation moves out
why runtime declarations belong next to execution boundaries
why not all code belongs in the same directory
why provenance is part of publication trust, not optional decoration