Skip to content

Module 08: Recovery, Scale, and Incident Survival

Page Maps

graph LR
  family["Reproducible Research"]
  program["Deep Dive DVC"]
  section["Recovery Scale Incident Survival"]
  page["Module 08: Recovery, Scale, and Incident Survival"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone
flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

Module 08 treats time as part of the reproducibility system.

A repository can be healthy today and still become unrecoverable later. Storage grows, remotes move, credentials rotate, CI images drift, maintainers leave, and old states become harder to interpret. None of that is exceptional. It is normal system life.

This module is about long-lived trust:

  • which states must remain recoverable
  • which states can expire
  • when cleanup is safe
  • how remote migration preserves evidence
  • how CI and maintainer handoffs survive time
  • how incidents are handled without improvising the state story

The central question is:

If local cache, local memory, or one storage provider disappeared, which results could still be restored and defended?

If the answer is unclear, the repository may be tidy today but not durable.

The capstone corroboration surface for this module is the set of files and commands that make recovery and long-lived state visible: capstone/dvc.lock, capstone/publish/v1/manifest.json, capstone/docs/recovery-guide.md, capstone/docs/stage-contract-guide.md, capstone/docs/release-review-guide.md, and the make -C capstone recovery-review route.

Why this module exists

Reproducibility is not a setup achievement.

Long-lived workflows fail through ordinary pressure:

  • old artifacts are deleted because storage feels expensive
  • nobody knows which releases must remain restorable
  • dvc gc is run without understanding what it can remove
  • a remote migration copies recent objects but misses older release evidence
  • CI images update and results drift
  • the only person who knew the recovery route leaves the team

These are design problems. The point of Module 08 is to make them discussable before they become incidents.

Study route

flowchart LR
  overview["Overview"] --> core1["Core 1: durability boundaries"]
  core1 --> core2["Core 2: retention policy"]
  core2 --> core3["Core 3: cleanup and cache safety"]
  core3 --> core4["Core 4: migration and CI drift"]
  core4 --> core5["Core 5: incident response and handoff"]
  core5 --> example["Worked example"]
  example --> practice["Exercises and answers"]
  practice --> glossary["Glossary"]

Read the module in that order the first time.

If the problem is already partly clear, use this shortcut:

  • open Core 1 when the main confusion is "what must survive local loss?"
  • open Core 2 when the main confusion is "which history should we keep, and for how long?"
  • open Core 3 when the main confusion is "when is cleanup safe?"
  • open Core 4 when the main confusion is "how do remotes or CI change over time?"
  • open Core 5 when the main confusion is "how should we respond when recovery is needed?"

Module map

Page Purpose
Overview explains the module promise and study route
Durability Boundaries and Recovery Goals teaches what must survive local cache loss and maintainer turnover
Retention Policy and History Value teaches how to decide which historical states deserve durable recovery
Garbage Collection and Cache Safety teaches cleanup discipline around dvc gc and cache removal
Remote Migration and CI Drift teaches remote transitions and CI drift as long-lived system risks
Incident Response and Maintainer Handoffs teaches incident response and knowledge continuity
Worked Example: Restoring after Local Cache Loss walks through one realistic recovery check
Exercises gives five mastery exercises
Exercise Answers explains model answers and review logic
Glossary keeps the module vocabulary stable

What should be clear by the end

By the end of this module, you should be able to explain:

  • what state must survive local cache loss
  • how retention policy differs across release, audit, operational, and exploratory states
  • why cleanup can be destructive when references and remotes are misunderstood
  • how remote migration can preserve or break historical continuity
  • why CI drift and maintainer turnover belong in the recovery model
  • how an incident response route protects evidence before repair

Commands to keep close

These commands form the evidence loop for Module 08:

make -C capstone recovery-review
make -C capstone state-summary
make -C capstone release-audit
dvc pull
dvc status
dvc gc --dry-run

Use the make routes for the course-provided capstone review. Treat dvc gc --dry-run as a planning command, not permission to delete anything.

Capstone route

Use the capstone after you can name what needs to survive.

Best corroboration surfaces for this module:

  • capstone/dvc.lock
  • capstone/publish/v1/manifest.json
  • capstone/publish/v1/metrics.json
  • capstone/publish/v1/params.yaml
  • capstone/docs/recovery-guide.md
  • capstone/docs/stage-contract-guide.md
  • capstone/docs/publish-contract.md

Useful proof route:

make -C capstone state-summary
make -C capstone recovery-review
make -C capstone release-audit

The point of that route is not to admire a clean repository. It is to ask whether the state that matters can still be found, restored, checked, and explained after ordinary time pressure.