Generator Pipelines and Atomic Publication¶

Page Maps¶

graph LR
  family["Reproducible Research"]
  program["Deep Dive Make"]
  section["Generated Files Multi Output Pipeline Boundaries"]
  page["Generator Pipelines and Atomic Publication"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone

flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

Generated outputs become much riskier once the generator stops being one clean command and starts becoming a pipeline:

generate intermediate data
transform it
validate it
write several final outputs
maybe update a manifest too

At that point the most important question is no longer "did the script run?" It is:

when are downstream targets allowed to trust the result?

This page teaches that answer.

The sentence to keep¶

For a generator pipeline, ask:

where is the publication boundary, and what must be true before any downstream target can treat the outputs as complete?

That is the heart of pipeline design.

Pipelines fail differently from simple generators¶

A simple single-output rule often fails in obvious ways: the file is there or it is not.

Pipelines fail more subtly:

one stage succeeds, another fails
partial outputs remain on disk
one output is fresh, another is stale
validation should have rejected the result, but publication happened too early

This is why pipelines need an explicit publication contract rather than a casual "the script writes files" understanding.

Publication should happen after success, not during hope¶

Suppose a generator pipeline does:

render a header
render a JSON schema
validate the pair
move them into the trusted output directory

That means the publication event is step 4, not step 1.

If the build lets downstream targets see the header before validation completed, the graph is already trusting partial work.

That is the core lesson: generation and publication are not always the same moment.

Temporary paths are part of safe publication¶

One healthy pattern is:

build/api.h build/api.json &: schema/api.yml scripts/gen_api.py | build/
    python3 scripts/gen_api.py --out-dir build/tmp
    python3 scripts/validate_api.py build/tmp/api.h build/tmp/api.json
    mv build/tmp/api.h build/api.h
    mv build/tmp/api.json build/api.json

The idea is not "always use tmp because it looks professional." The idea is:

incomplete work stays outside the trusted output paths
validation happens before final publication
downstream targets only see the files once the pipeline succeeded

That is a much stronger contract than writing directly into the final paths throughout the pipeline.

Atomic publication is about trust¶

When the module says "atomic publication," the important meaning is not kernel-level technical perfection in every environment. The important meaning is:

do not let downstream work observe a half-published result and mistake it for truth.

Often that means:

write to a temporary file
validate the full result
move the file into place

For directory-level or multi-output publication, it may mean staging several outputs and then moving or touching the final boundary only after all of them are ready.

A useful single-file example¶

Single-file generation can still benefit from publication discipline:

build/include/config.h: schema/config.yml scripts/gen_config.py | build/include/
    @python3 scripts/gen_config.py schema/config.yml > $@.tmp
    @python3 scripts/validate_header.py $@.tmp
    @mv $@.tmp $@

This rule is easier to trust because:

an invalid header never becomes the published header
the published path changes only after validation
the consumer edge still points at one clean output path

That same pattern scales to larger pipelines.

Multi-output publication needs one clear finishing point¶

For coupled outputs, the question becomes:

which step marks the set as complete?

One answer is grouped targets with staged files:

api.h api.json &: schema/api.yml scripts/gen_api.py | build/
    @python3 scripts/gen_api.py --out-dir build/tmp
    @python3 scripts/validate_api.py build/tmp/api.h build/tmp/api.json
    @mv build/tmp/api.h api.h
    @mv build/tmp/api.json api.json

Another answer is a stamp or manifest that is touched or published only after both final outputs are in place.

The important thing is that the build names the finishing point instead of letting publication leak across multiple partial steps.

Cleanup on failure matters¶

Pipelines that fail mid-run need one more discipline:

remove temporary artifacts that are not trustworthy
do not leave behind final outputs that were only partially updated

That usually means the recipe should fail before moving temporary outputs into their final paths, or explicitly remove partial temp state on the way out.

The standard here is not cosmetic tidiness. It is preventing the next build step from treating garbage as truth.

A simple pattern with explicit cleanup¶

build/report.json: data/input.csv scripts/gen_report.py | build/
    @python3 scripts/gen_report.py data/input.csv > $@.tmp
    @python3 scripts/check_report.py $@.tmp
    @mv $@.tmp $@ || { rm -f $@.tmp; exit 1; }

In a real shell recipe you may want clearer trap handling, but the design point is constant:

invalid or incomplete content should die in temporary space
the final target path should remain the trustworthy boundary

Pipelines create stage boundaries on purpose¶

Some pipelines legitimately need multiple trusted stages:

raw generated output
normalized generated output
packaged generated bundle

That is fine, but each stage must still answer:

what file or boundary represents completion here
who consumes this stage
what validates it before the next stage trusts it

In other words, a pipeline may have several boundaries, but each one still needs the same honest publication logic.

Failure signatures worth recognizing¶

"The generated file exists, but it was only half-written when the consumer saw it"¶

That means publication happened too early.

"Validation failed, but the final output path still changed"¶

That means the final path stopped being a trustworthy boundary.

"The pipeline leaves a mix of old and new outputs after failure"¶

That means coupled publication is being modeled too loosely.

"Temporary outputs keep leaking into later stages"¶

That usually means temporary space and trusted space were not separated clearly enough.

A review question that improves pipeline design¶

Take one generator pipeline and ask:

which step first creates intermediate content
which step validates the content
which step publishes the trusted outputs
what happens on failure before publication
which files or boundary nodes downstream targets are allowed to depend on

If those answers are weak, the pipeline contract is weak too.

What to practice from this page¶

Choose one multi-stage generator in the capstone or your own build and write down:

the temporary paths
the validation step
the publication step
the cleanup behavior on failure
the exact output path or boundary file consumers should trust

If you can explain those without hand-waving, the pipeline has a real publication contract.

End-of-page checkpoint¶

Before leaving this lesson, make sure you can explain:

why pipeline generation and publication are not automatically the same moment
why temporary paths help keep partial work out of trusted output paths
what atomic publication means in practical build terms
why validation should happen before downstream trust
how to tell whether a pipeline leaves behind untrustworthy partial outputs