Exercise Answers¶

Page Maps¶

graph LR
  family["Reproducible Research"]
  program["Deep Dive DVC"]
  section["Truthful Pipelines Declared Dependencies"]
  page["Exercise Answers"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone

flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

These answers are model explanations, not the only acceptable wording.

What matters is whether the reasoning keeps the declared graph, real command behavior, and lock evidence connected.

Answer 1: Read a stage contract¶

The stage promises:

run python -m incident_escalation_capstone.prepare
read data/raw/service_incidents.csv
produce data/prepared/incidents.parquet

The stage should become stale when:

the raw incidents file changes
the command text changes
the declared output is missing
lock evidence no longer matches the declared current state

What still needs inspection:

whether the command reads any other files, such as reference tables or schemas
whether it uses control values that belong in params.yaml
whether it writes additional artifacts that should be declared
whether implementation files should be listed as dependencies in this course's chosen convention

The main lesson is that the YAML is a claim. Reviewers still need to verify that the claim matches the real read and write behavior.

Answer 2: Place each influence¶

Strong placement:

deps:
  - models/escalation-model.json
  - data/prepared/incidents.parquet
  - data/reference/escalation_policy.csv
params:
  - evaluate.threshold
outs:
  - reports/evaluation.json

The model, prepared data, and policy CSV are file reads, so they belong in deps.

The threshold is a reviewed control value, so it belongs in params.

The temporary log should usually stay outside the output contract unless it is a reviewed artifact that downstream readers rely on. If it is only debugging residue, declaring it as an output makes the stage noisier without improving the provenance story.

Answer 3: Predict reruns¶

If the raw incidents file changes and prepare produces a new prepared output:

prepare should rerun because its declared input changed
fit should rerun because it depends on the prepared output
evaluate should rerun if it depends on either the prepared output or the model output

If fit.model_family changes:

fit should rerun because its declared parameter changed
evaluate should rerun if the model output changes and evaluation depends on that model
prepare should not rerun because the change does not belong to preparation

If evaluate.threshold changes:

evaluate should rerun
prepare and fit should not rerun unless their declared state also changed

If an unrelated README changes:

no DVC stage should rerun unless the README is declared as a dependency somewhere

The main lesson is to predict from declared edges, not from a vague feeling that "the pipeline changed."

Answer 4: Diagnose stale output risk¶

Strong review response:

The evaluation stage likely has a missing dependency. If the command reads data/reference/escalation_policy.csv, that path should appear in the stage's deps. This is more dangerous than an extra rerun because DVC can skip evaluation even after a meaningful input changes, leaving a stale report that looks current. Add the policy CSV to deps, rerun the stage, and confirm dvc.lock records the policy dependency and the updated evaluation output evidence.

Concrete repair:

deps:
  - models/escalation-model.json
  - data/prepared/incidents.parquet
  - data/reference/escalation_policy.csv

The exact list may include the evaluation implementation file too, depending on the course repository convention.

Answer 5: Refactor a mixed stage¶

A strong split answer:

stages:
  fit:
    cmd: python -m incident_escalation_capstone.fit
    deps:
      - data/prepared/incidents.parquet
    params:
      - fit.model_family
    outs:
      - models/escalation-model.json
  evaluate:
    cmd: python -m incident_escalation_capstone.evaluate
    deps:
      - data/prepared/incidents.parquet
      - models/escalation-model.json
    params:
      - evaluate.threshold
    outs:
      - reports/evaluation.json

Why this is stronger when the model is a meaningful intermediate:

a model control change reruns fitting and then evaluation
an evaluation threshold change reruns only evaluation
the model artifact has a clear owner
review can separate "why did the model change?" from "why did the report change?"

A defensible keep-together answer is possible only if the model has no independent review or reuse value and the combined command truly owns both outputs as one cohesive result. In that case, you should still explain why both outputs share the same inputs and controls.

Self-check¶

If your answers consistently explain:

what the stage declaration promises
where each real influence belongs
how rerun prediction follows declared state
why stale output risk should be repaired before convenience cleanup
how graph shape affects provenance clarity

then you are using Module 04 correctly.