Module 08: Recovery, Scale, and Incident Survival¶

Page Maps¶

graph LR
  family["Reproducible Research"]
  program["Deep Dive DVC"]
  section["Recovery Scale Incident Survival"]
  page["Module 08: Recovery, Scale, and Incident Survival"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone

flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

Module 08 treats time as part of the reproducibility system.

A repository can be healthy today and still become unrecoverable later. Storage grows, remotes move, credentials rotate, CI images drift, maintainers leave, and old states become harder to interpret. None of that is exceptional. It is normal system life.

This module is about long-lived trust:

which states must remain recoverable
which states can expire
when cleanup is safe
how remote migration preserves evidence
how CI and maintainer handoffs survive time
how incidents are handled without improvising the state story

The central question is:

If local cache, local memory, or one storage provider disappeared, which results could still be restored and defended?

If the answer is unclear, the repository may be tidy today but not durable.

The capstone corroboration surface for this module is the set of files and commands that make recovery and long-lived state visible: capstone/dvc.lock, capstone/publish/v1/manifest.json, capstone/docs/recovery-guide.md, capstone/docs/stage-contract-guide.md, capstone/docs/release-review-guide.md, and the make -C capstone recovery-review route.

Why this module exists¶

Reproducibility is not a setup achievement.

Long-lived workflows fail through ordinary pressure:

old artifacts are deleted because storage feels expensive
nobody knows which releases must remain restorable
dvc gc is run without understanding what it can remove
a remote migration copies recent objects but misses older release evidence
CI images update and results drift
the only person who knew the recovery route leaves the team

These are design problems. The point of Module 08 is to make them discussable before they become incidents.

Study route¶

flowchart LR
  overview["Overview"] --> core1["Core 1: durability boundaries"]
  core1 --> core2["Core 2: retention policy"]
  core2 --> core3["Core 3: cleanup and cache safety"]
  core3 --> core4["Core 4: migration and CI drift"]
  core4 --> core5["Core 5: incident response and handoff"]
  core5 --> example["Worked example"]
  example --> practice["Exercises and answers"]
  practice --> glossary["Glossary"]

Read the module in that order the first time.

If the problem is already partly clear, use this shortcut:

open Core 1 when the main confusion is "what must survive local loss?"
open Core 2 when the main confusion is "which history should we keep, and for how long?"
open Core 3 when the main confusion is "when is cleanup safe?"
open Core 4 when the main confusion is "how do remotes or CI change over time?"
open Core 5 when the main confusion is "how should we respond when recovery is needed?"

Module map¶

Page	Purpose
Overview	explains the module promise and study route
Durability Boundaries and Recovery Goals	teaches what must survive local cache loss and maintainer turnover
Retention Policy and History Value	teaches how to decide which historical states deserve durable recovery
Garbage Collection and Cache Safety	teaches cleanup discipline around `dvc gc` and cache removal
Remote Migration and CI Drift	teaches remote transitions and CI drift as long-lived system risks
Incident Response and Maintainer Handoffs	teaches incident response and knowledge continuity
Worked Example: Restoring after Local Cache Loss	walks through one realistic recovery check
Exercises	gives five mastery exercises
Exercise Answers	explains model answers and review logic
Glossary	keeps the module vocabulary stable

What should be clear by the end¶

By the end of this module, you should be able to explain:

what state must survive local cache loss
how retention policy differs across release, audit, operational, and exploratory states
why cleanup can be destructive when references and remotes are misunderstood
how remote migration can preserve or break historical continuity
why CI drift and maintainer turnover belong in the recovery model
how an incident response route protects evidence before repair

Commands to keep close¶

These commands form the evidence loop for Module 08:

make -C capstone recovery-review
make -C capstone state-summary
make -C capstone release-audit
dvc pull
dvc status
dvc gc --dry-run

Use the make routes for the course-provided capstone review. Treat dvc gc --dry-run as a planning command, not permission to delete anything.

Capstone route¶

Use the capstone after you can name what needs to survive.

Best corroboration surfaces for this module:

capstone/dvc.lock
capstone/publish/v1/manifest.json
capstone/publish/v1/metrics.json
capstone/publish/v1/params.yaml
capstone/docs/recovery-guide.md
capstone/docs/stage-contract-guide.md
capstone/docs/publish-contract.md

Useful proof route:

make -C capstone state-summary
make -C capstone recovery-review
make -C capstone release-audit

The point of that route is not to admire a clean repository. It is to ask whether the state that matters can still be found, restored, checked, and explained after ordinary time pressure.