Safe Pipeline Refactoring and Shared Outputs¶

Page Maps¶

graph LR
  family["Reproducible Research"]
  program["Deep Dive DVC"]
  section["Truthful Pipelines Declared Dependencies"]
  page["Safe Pipeline Refactoring and Shared Outputs"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone

flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

Real DVC pipelines rarely stay as a perfect straight line.

They grow:

one preparation stage feeds several analyses
one model output feeds evaluation and publishing
a single command produces a metric file and a plot
an overloaded stage needs to be split into clearer boundaries
directories move as the repository becomes more organized

These changes are normal. The risk is not change itself. The risk is changing graph shape without preserving the provenance story.

Shared intermediates need an owner¶

A shared intermediate is an artifact produced once and consumed by more than one later stage.

Example:

flowchart LR
  prepare["prepare"] --> features["data/prepared/features.parquet"]
  features --> fit["fit"]
  features --> inspect["inspect"]
  fit --> evaluate["evaluate"]

The important question is not "can two later stages read the same file?" They can.

The important question is:

Which stage owns the shared artifact, and do all consumers declare it?

The producer should list the artifact in outs:

stages:
  prepare:
    cmd: python -m incident_escalation_capstone.prepare
    deps:
      - data/raw/service_incidents.csv
    outs:
      - data/prepared/features.parquet

Each consumer should list it in deps:

stages:
  fit:
    cmd: python -m incident_escalation_capstone.fit
    deps:
      - data/prepared/features.parquet
    outs:
      - models/escalation-model.json
  inspect:
    cmd: python -m incident_escalation_capstone.inspect
    deps:
      - data/prepared/features.parquet
    outs:
      - reports/inspection.json

That gives DVC a clear fan-out story.

Multi-output stages can be honest¶

A stage can own more than one output when the command truly produces a cohesive set.

Example:

stages:
  evaluate:
    cmd: python -m incident_escalation_capstone.evaluate
    deps:
      - models/escalation-model.json
      - data/prepared/features.parquet
    params:
      - evaluate.threshold
    outs:
      - reports/evaluation.json
      - reports/error-slices.csv
      - reports/calibration.svg

This is reasonable if those artifacts are produced together from the same evaluation run.

It becomes weak if one output is used for release, one is a temporary debugging file, and one belongs to a different command. Multi-output is not a problem by itself. Mixed ownership is the problem.

Ask:

do these outputs share the same inputs and controls?
should they be refreshed together?
would a reviewer understand why one command owns all of them?
does any downstream stage depend on only one of them?

If the answers are clear, a multi-output stage is fine. If the answers are confused, split the stage boundary.

Refactoring without losing truth¶

Pipeline refactoring should preserve a readable before-and-after story.

Common safe moves include:

renaming a stage while preserving its command, dependencies, parameters, and outputs
moving an output path while keeping the producing stage and consumer dependencies clear
splitting one large stage into two stages when an intermediate artifact has real meaning
merging two stages when the intermediate has no independent review value
narrowing a broad dependency after confirming the real read surface

The safe refactor question is:

After this change, can I still explain which declared input caused which declared output?

If not, the change may be making the pipeline prettier while making it less truthful.

A split example¶

Before:

stages:
  train_and_evaluate:
    cmd: python -m incident_escalation_capstone.train_and_evaluate
    deps:
      - data/prepared/features.parquet
    params:
      - fit.model_family
      - evaluate.threshold
    outs:
      - models/escalation-model.json
      - reports/evaluation.json

This may be convenient, but it hides two different claims in one stage: fitting a model and evaluating it.

After:

stages:
  fit:
    cmd: python -m incident_escalation_capstone.fit
    deps:
      - data/prepared/features.parquet
    params:
      - fit.model_family
    outs:
      - models/escalation-model.json
  evaluate:
    cmd: python -m incident_escalation_capstone.evaluate
    deps:
      - models/escalation-model.json
      - data/prepared/features.parquet
    params:
      - evaluate.threshold
    outs:
      - reports/evaluation.json

Now a threshold change reruns evaluation without pretending model fitting changed. A model control change reruns fitting and then propagates to evaluation. The graph becomes more specific and more teachable.

A merge example¶

Splitting is not always better.

If one stage writes tmp/normalized.csv and the next immediately reads it, and nobody reviews or reuses that intermediate, the split may add ceremony without adding truth. In that case, a single stage with one meaningful declared output may be clearer.

The question is not "more stages or fewer stages?" The question is "which boundaries make the causal story visible?"

Review checkpoint¶

You understand this core when you can:

name the producer that owns a shared intermediate
confirm every consumer declares the shared intermediate as a dependency
decide when multiple outputs belong to one stage
split an overloaded stage when the intermediate has review value
merge stages when the split only exposes throwaway scratch
preserve lock evidence and review clarity while changing graph shape

Pipeline structure is not sacred. Provenance is. Refactor the graph when it makes the truth easier to see.