Failure Policy, Retries, and Incomplete Outputs¶

Page Maps¶

graph LR
  family["Reproducible Research"]
  program["Deep Dive Snakemake"]
  section["Production Operations Policy Boundaries"]
  page["Failure Policy, Retries, and Incomplete Outputs"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone

flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

Production operation is not only about getting a run to finish.

It is also about deciding what the workflow should do when something goes wrong:

retry
stop
keep evidence
rerun from incomplete state

Those decisions form a failure policy. If that policy is vague, the repository will recover inconsistently and leave ambiguous state behind.

The sentence to keep¶

When a job fails, ask:

should the next action be retry, rerun, or refusal to trust the output?

That question is the center of this page.

Retries are for transient failure, not semantic uncertainty¶

A retry is justified when the same declared job is likely to succeed on a second attempt without changing its meaning.

Typical examples:

network or mirror hiccups during a download step
scheduler or infrastructure instability
temporary shared-filesystem timing issues

A retry is not a fix for:

wrong inputs
bad parameters
nondeterministic rule logic
hidden semantic state

If the job meaning is wrong, retrying only repeats the wrong job.

Incomplete outputs are part of the contract¶

Production workflows must be honest about partially written outputs.

The repository should answer:

what happens if a job fails halfway through
whether any final-looking file remains behind
how the next run recognizes that state

This is why incomplete-output handling matters. It is not merely cleanup.

Keep the recovery story small and explicit¶

A healthy recovery story usually looks like this:

a job fails
logs preserve evidence
partial or incomplete outputs are not trusted as final
the next run reruns the affected work deliberately

That is a much better teaching model than "rerun until it works."

A weak first response¶

Weak production habit:

enable retries everywhere
keep partial outputs casually
assume later jobs will sort it out

This feels resilient. It is often the opposite:

poison artifacts survive longer
later failures become harder to interpret
maintainers lose the original failure boundary

The repository becomes noisier instead of safer.

A stronger failure-policy split¶

Use three categories:

1. Retryable failure¶

The job can be attempted again because the contract is still the same and the failure is likely transient.

2. Rerunnable incomplete state¶

The job produced incomplete state that should be recognized and rebuilt, not trusted.

3. Fail-fast contract error¶

The job or configuration is wrong in a way that no retry should hide.

This split gives the workflow one of the most important operational qualities: honest recovery.

Logs belong in the same discussion¶

Failure policy without logs is weak because the repository cannot explain what happened.

Per-job logs are especially important in production because they answer:

which exact job failed
what command it ran
whether the error looks transient or semantic

Logs do not replace recovery policy. They make it reviewable.

One simple decision table¶

Situation	Better response
filesystem lag delayed visible outputs even though the job completed	tune latency or rerun policy, not workflow meaning
external infrastructure failed briefly	allow retry
wrong sample or bad config key caused the failure	fail fast and fix inputs
job left behind a partial final-looking output	treat it as incomplete and rerun deliberately
repeated retries still produce different outputs or errors	stop and inspect the rule contract

This is the kind of operational table a human team can actually use.

What "keep evidence" should mean¶

Keeping evidence does not mean keeping every broken file forever.

It means:

preserve logs
make incomplete state recognizable
avoid promoting partial outputs into trusted boundaries

That is how the next maintainer can tell whether the workflow should retry, rerun, or be repaired.

Common failure modes¶

Failure mode	What it looks like	Better repair
retries enabled indiscriminately	wrong jobs get repeated instead of fixed	reserve retries for transient failure classes
partial outputs remain trusted	downstream steps read poison artifacts	publish atomically and rerun incomplete work deliberately
logs are global or missing	nobody can locate the failing job clearly	keep per-job logs or equivalently narrow evidence
incomplete handling is inconsistent across contexts	local recovery differs from CI without explanation	make the policy explicit in profiles and repository docs
retry is used to mask nondeterminism	failures seem random and irreproducible	repair the underlying rule contract first

The explanation a reviewer trusts¶

Strong explanation:

this rule may be retried for transient infrastructure errors, but incomplete outputs are never trusted as final; failed jobs keep per-job logs, and the next run reruns incomplete work instead of silently continuing from poison artifacts.

Weak explanation:

if it flakes, we just retry and usually it settles down.

The strong version gives an operational contract. The weak version gives a coping habit.

End-of-page checkpoint¶

Before leaving this page, you should be able to:

name one failure that deserves retry and one that does not
explain why incomplete-output handling is part of output trust
describe how logs support recovery decisions
explain why retries cannot repair semantic workflow defects