Worked Example: Hardening a Workflow for Production Use¶

Page Maps¶

graph LR
  family["Reproducible Research"]
  program["Deep Dive Snakemake"]
  section["Production Operations Policy Boundaries"]
  page["Worked Example: Hardening a Workflow for Production Use"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone

flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

This file ties the whole module together around one realistic problem:

a workflow runs locally, but the team now wants to run it in CI and review it like a production repository.

The question is not "how do we add more flags?" The question is how to make the workflow operable without blurring its meaning.

The starting situation¶

Assume the repository already has:

truthful rule contracts
explicit dynamic discovery from Module 02
a stable publish boundary

What it does not yet have is a clean production story.

The maintainers currently do this:

one person runs snakemake -p --rerun-incomplete
another person adds extra flags in CI
partial failures are handled ad hoc
no one can explain which settings are policy and which would change workflow meaning

That is a good Module 03 starting point because it is common and fixable.

Step 1: move stable run policy into profiles¶

The first repair is not a scheduler change. It is naming the operating contexts.

So introduce:

profiles/
  local/config.yaml
  ci/config.yaml
  slurm/config.yaml

Now the repository can answer:

what local runs normally do
what CI runs normally do
what scheduler-oriented policy looks like

This is Core 1 becoming concrete. The team stops relying on private shell habits.

Step 2: give failures a small honest policy¶

The team then reviews how failed runs behave.

Weak current behavior:

retry until it passes
inspect the terminal if something looks wrong
trust whatever files are left behind

The repair is to state a failure policy:

transient failure may be retried
incomplete outputs are rerun deliberately
logs remain available for review
semantic errors fail fast instead of hiding under retries

This is Core 2: recovery becomes a contract, not a mood.

Step 3: make storage assumptions explicit¶

Next, the team notices that local runs and CI runs do not feel the same:

CI sometimes needs more patience for output visibility
a scheduler-oriented run may use different scratch or staging assumptions

The weak reaction would be to patch the workflow code for each context.

The stronger reaction is:

keep the final output and publish paths semantically stable
place latency and staging assumptions in operating policy
review scratch behavior as context, not as workflow meaning

This is Core 3: data locality changes how the workflow runs, not what the workflow means.

Step 4: connect operation to proof routes¶

At this point the repository has profiles and a failure story, but the team still needs a review route.

So define a proof ladder:

dry-run for planning
make profile-audit for policy comparison
make verify for executed workflow confidence
make confirm for the strongest clean-room repository proof

This is Core 4: production trust grows by named routes, not by one giant ritual command.

Step 5: make policy review survivable for the next maintainer¶

The last repair is governance.

The repository should make it obvious that:

profile changes are reviewed first as policy diffs
semantic workflow changes belong elsewhere
proof-route changes should explain which claim got stronger or weaker

This is Core 5: the repository teaches its own review order instead of leaving it to memory.

The repaired production story¶

By the end of these repairs, the workflow can be described cleanly:

workflow meaning stays in rule code, config, and publish contracts
profiles encode operating context only
failures leave rerunnable or reviewable state instead of ambiguous poison
storage and latency assumptions are explicit
proof routes scale from dry-run to clean-room confirmation

That is what it means to harden a workflow for production use without turning it into a different workflow.

Review questions for the repaired design¶

When you inspect a repository shaped like this, ask:

which profile settings are context only
how does the repository distinguish retryable from fail-fast conditions
where are staging and latency assumptions recorded
which proof route answers the current operational question
how would the next maintainer review a profile diff without guesswork

If those answers are visible, the module’s production story has landed.