Failure Recovery¶
Failure recovery in DAG should preserve evidence first, then restore a runnable state with clear attribution.
Visual Summary¶
flowchart TD
fail[detect run failure] --> capture[capture status and artifacts]
capture --> classify[classify root cause scope]
classify --> remediate[apply targeted remediation]
remediate --> replay[replay and diff verification]
Recovery Sequence¶
- record run status and retain failing artifact directory
- classify failure as graph, input, runtime, environment, or backend issue
- remediate one scope at a time and rerun
- replay the recovered run to verify determinism behavior
- diff against last known good run before promotion
Diagnostic Commands¶
bijux dag status ./runs/failed-20260406-01
bijux dag inspect ./runs/failed-20260406-01
bijux dag replay ./runs/failed-20260406-01 --out ./runs/replay-failed
bijux dag diff ./runs/good-20260405-77 ./runs/recovered-20260406-02 --mode semantic --explain
Code Anchors¶
crates/bijux-dag-app/src/routes/status_routes.rscrates/bijux-dag-app/src/routes/inspect_routes.rscrates/bijux-dag-runtime/src/replay/
Recovery Boundaries¶
- never replace failing evidence in-place
- never classify unknown mismatch as success
- never skip replay or diff after high-impact remediation