Skip to content

Exercises

Page Maps

graph LR
  family["Reproducible Research"]
  program["Deep Dive DVC"]
  section["Truthful Pipelines Declared Dependencies"]
  page["Exercises"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone
flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

Use these exercises to practice pipeline judgment, not only DVC vocabulary.

The strongest answers will explain the declared graph, the hidden influence risk, and the evidence you would expect from dvc.lock.

Exercise 1: Read a stage contract

You see this stage:

stages:
  prepare:
    cmd: python -m incident_escalation_capstone.prepare
    deps:
      - data/raw/service_incidents.csv
    outs:
      - data/prepared/incidents.parquet

Write a short explanation of:

  • what the stage promises
  • what changes should make it stale
  • what you still need to inspect before trusting the declaration

Exercise 2: Place each influence

An evaluation command reads:

  • models/escalation-model.json
  • data/prepared/incidents.parquet
  • data/reference/escalation_policy.csv
  • a threshold value from params.yaml
  • a temporary log file written during the run

Decide which items belong in deps, which belong in params, which belong in outs, and which should probably stay outside the output contract.

Explain your reasoning.

Exercise 3: Predict reruns

Use this graph:

flowchart LR
  prepare["prepare"] --> prepared["prepared incidents"]
  prepared --> fit["fit"]
  fit --> model["model"]
  model --> evaluate["evaluate"]
  prepared --> evaluate

Predict what should rerun when:

  • the raw incidents file changes and prepare produces a new prepared output
  • fit.model_family changes
  • evaluate.threshold changes
  • an unrelated README changes

Explain each prediction in terms of declared dependencies and parameters.

Exercise 4: Diagnose stale output risk

You review this report:

I changed data/reference/escalation_policy.csv, ran dvc repro, and evaluation did not rerun. But the evaluation command uses that file.

Write a review response that explains:

  • the likely graph problem
  • why this is more dangerous than an extra rerun
  • the concrete declaration repair
  • what lock evidence should show after the repair and rerun

Exercise 5: Refactor a mixed stage

A stage currently fits a model and evaluates it in one command:

stages:
  train_and_evaluate:
    cmd: python -m incident_escalation_capstone.train_and_evaluate
    deps:
      - data/prepared/incidents.parquet
    params:
      - fit.model_family
      - evaluate.threshold
    outs:
      - models/escalation-model.json
      - reports/evaluation.json

Propose either:

  • a split into clearer stages, if you think the model is a meaningful intermediate
  • a justification for keeping it together, if you think the combined boundary is more honest

Your answer should explain how the refactor affects rerun behavior and review clarity.

Mastery check

You have a strong grasp of this module if your answers consistently keep five ideas visible:

  • a stage contract is a reviewable promise, not only a command
  • file reads and control values belong in different declaration fields
  • dvc repro only reacts to declared influence
  • stale outputs are a correctness risk, while false reruns are usually visible waste
  • graph refactoring is safe when it preserves the provenance story