Skip to content

Failure Recovery

Failure recovery in DAG should preserve evidence first, then restore a runnable state with clear attribution.

Visual Summary

flowchart TD
    fail[detect run failure] --> capture[capture status and artifacts]
    capture --> classify[classify root cause scope]
    classify --> remediate[apply targeted remediation]
    remediate --> replay[replay and diff verification]

Recovery Sequence

  1. record run status and retain failing artifact directory
  2. classify failure as graph, input, runtime, environment, or backend issue
  3. remediate one scope at a time and rerun
  4. replay the recovered run to verify determinism behavior
  5. diff against last known good run before promotion

Diagnostic Commands

bijux dag status ./runs/failed-20260406-01
bijux dag inspect ./runs/failed-20260406-01
bijux dag replay ./runs/failed-20260406-01 --out ./runs/replay-failed
bijux dag diff ./runs/good-20260405-77 ./runs/recovered-20260406-02 --mode semantic --explain

Code Anchors

  • crates/bijux-dag-app/src/routes/status_routes.rs
  • crates/bijux-dag-app/src/routes/inspect_routes.rs
  • crates/bijux-dag-runtime/src/replay/

Recovery Boundaries

  • never replace failing evidence in-place
  • never classify unknown mismatch as success
  • never skip replay or diff after high-impact remediation

Next Reads