Retries, Latency, and Failure Discipline¶

Page Maps¶

graph LR
  family["Reproducible Research"]
  program["Deep Dive Snakemake"]
  section["Operating Contexts Execution Policy"]
  page["Retries, Latency, and Failure Discipline"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone

flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

Not every failure means the workflow is wrong.

But not every repeated failure is transient either.

This page is about the difference.

Retries, latency waits, and incomplete-output handling are useful operating tools only when they support understanding instead of hiding defects.

Operational help should not become a correctness crutch¶

Settings such as:

retry counts
latency waits
rerun-incomplete behavior
failed-log visibility

can improve robustness and reviewability.

They become risky when they are used to avoid harder questions:

why is this rule unstable?
why are outputs incomplete?
why does visibility lag matter here?
why does one environment need more “help” than another?

If those questions disappear behind higher retry counts, the policy boundary is masking a real defect.

Failures need categories¶

A strong operating posture separates at least three kinds of failure:

likely transient operational failures
deterministic workflow or environment failures
incomplete or partially written outputs that need explicit cleanup behavior

Those categories do not deserve the same response.

Retries may help the first.

They usually do not fix the second.

The third requires honest output-discipline rather than wishful reruns.

The capstone profiles hint at this boundary¶

The capstone profiles include operational settings such as:

rerun-incomplete: true
latency-wait
visible shell commands and failed logs

These are useful precisely because they stay visible as policy.

They do not pretend to explain why a failure happened. They only define how the system should respond once the failure exists.

One useful contrast¶

flowchart TD
  failure["job failure"] --> classify["transient, deterministic, or incomplete output?"]
  classify --> retry["retry policy"]
  classify --> repair["rule or environment repair"]
  classify --> cleanup["cleanup and rerun discipline"]

This matters because one policy knob should not answer all three branches.

A weak failure posture¶

Weak shape:

retries increase whenever failures become annoying
latency waits are raised without examining storage or visibility assumptions
incomplete outputs remain after failure and are treated as normal clutter

This makes operating policy look helpful while understanding gets worse.

A stronger failure posture¶

Stronger shape:

use retries only when transient failure is plausible
use latency waits as explicit filesystem-policy decisions, not as superstition
keep failed logs visible enough that maintainers can inspect root causes
treat incomplete outputs as contract and cleanup questions, not as background noise

Now the workflow fails in ways that remain understandable.

A practical test¶

Ask these questions when a failure-related setting changes:

What class of failure is this setting meant to address?
Would this change help diagnose the issue or only postpone it?
Could the same setting be hiding a deterministic defect in the workflow or runtime?

If the second answer is “only postpone it,” the policy is probably doing the wrong job.

Common failure modes¶

Failure mode	What goes wrong	Better repair
retries increase with no failure classification	true defects stay alive longer	define which failures are plausibly transient first
latency waits become folklore	storage problems stay unnamed	treat latency as an explicit filesystem and visibility assumption
incomplete outputs are left ambiguous	reruns and review become murky	use explicit incomplete-output policy and cleanup expectations
failed logs are hidden to reduce noise	diagnosis gets slower and more anecdotal	keep failure evidence accessible during review
teams celebrate resilience without understanding failure cause	policy success masks semantic risk	pair policy changes with root-cause review

The explanation a reviewer trusts¶

Strong explanation:

this retry and latency policy exists because the operating context may introduce genuine scheduling or visibility delays, but we still inspect failed logs and treat incomplete outputs as evidence, not as harmless leftovers.

Weak explanation:

we raised retries so the workflow would stop failing.

The strong explanation names the failure model. The weak explanation only suppresses the symptom.

End-of-page checkpoint¶

Before leaving this page, you should be able to:

explain when retries are justified and when they are suspicious
explain why latency waits should be tied to real visibility assumptions
describe why incomplete outputs are a policy and contract concern
explain how failure evidence supports operating review