Generator Pipelines and Atomic Publication¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive Make"]
section["Generated Files Multi Output Pipeline Boundaries"]
page["Generator Pipelines and Atomic Publication"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
Generated outputs become much riskier once the generator stops being one clean command and starts becoming a pipeline:
- generate intermediate data
- transform it
- validate it
- write several final outputs
- maybe update a manifest too
At that point the most important question is no longer "did the script run?" It is:
when are downstream targets allowed to trust the result?
This page teaches that answer.
The sentence to keep¶
For a generator pipeline, ask:
where is the publication boundary, and what must be true before any downstream target can treat the outputs as complete?
That is the heart of pipeline design.
Pipelines fail differently from simple generators¶
A simple single-output rule often fails in obvious ways: the file is there or it is not.
Pipelines fail more subtly:
- one stage succeeds, another fails
- partial outputs remain on disk
- one output is fresh, another is stale
- validation should have rejected the result, but publication happened too early
This is why pipelines need an explicit publication contract rather than a casual "the script writes files" understanding.
Publication should happen after success, not during hope¶
Suppose a generator pipeline does:
- render a header
- render a JSON schema
- validate the pair
- move them into the trusted output directory
That means the publication event is step 4, not step 1.
If the build lets downstream targets see the header before validation completed, the graph is already trusting partial work.
That is the core lesson: generation and publication are not always the same moment.
Temporary paths are part of safe publication¶
One healthy pattern is:
build/api.h build/api.json &: schema/api.yml scripts/gen_api.py | build/
python3 scripts/gen_api.py --out-dir build/tmp
python3 scripts/validate_api.py build/tmp/api.h build/tmp/api.json
mv build/tmp/api.h build/api.h
mv build/tmp/api.json build/api.json
The idea is not "always use tmp because it looks professional." The idea is:
- incomplete work stays outside the trusted output paths
- validation happens before final publication
- downstream targets only see the files once the pipeline succeeded
That is a much stronger contract than writing directly into the final paths throughout the pipeline.
Atomic publication is about trust¶
When the module says "atomic publication," the important meaning is not kernel-level technical perfection in every environment. The important meaning is:
do not let downstream work observe a half-published result and mistake it for truth.
Often that means:
- write to a temporary file
- validate the full result
- move the file into place
For directory-level or multi-output publication, it may mean staging several outputs and then moving or touching the final boundary only after all of them are ready.
A useful single-file example¶
Single-file generation can still benefit from publication discipline:
build/include/config.h: schema/config.yml scripts/gen_config.py | build/include/
@python3 scripts/gen_config.py schema/config.yml > $@.tmp
@python3 scripts/validate_header.py $@.tmp
@mv $@.tmp $@
This rule is easier to trust because:
- an invalid header never becomes the published header
- the published path changes only after validation
- the consumer edge still points at one clean output path
That same pattern scales to larger pipelines.
Multi-output publication needs one clear finishing point¶
For coupled outputs, the question becomes:
which step marks the set as complete?
One answer is grouped targets with staged files:
api.h api.json &: schema/api.yml scripts/gen_api.py | build/
@python3 scripts/gen_api.py --out-dir build/tmp
@python3 scripts/validate_api.py build/tmp/api.h build/tmp/api.json
@mv build/tmp/api.h api.h
@mv build/tmp/api.json api.json
Another answer is a stamp or manifest that is touched or published only after both final outputs are in place.
The important thing is that the build names the finishing point instead of letting publication leak across multiple partial steps.
Cleanup on failure matters¶
Pipelines that fail mid-run need one more discipline:
- remove temporary artifacts that are not trustworthy
- do not leave behind final outputs that were only partially updated
That usually means the recipe should fail before moving temporary outputs into their final paths, or explicitly remove partial temp state on the way out.
The standard here is not cosmetic tidiness. It is preventing the next build step from treating garbage as truth.
A simple pattern with explicit cleanup¶
build/report.json: data/input.csv scripts/gen_report.py | build/
@python3 scripts/gen_report.py data/input.csv > $@.tmp
@python3 scripts/check_report.py $@.tmp
@mv $@.tmp $@ || { rm -f $@.tmp; exit 1; }
In a real shell recipe you may want clearer trap handling, but the design point is constant:
- invalid or incomplete content should die in temporary space
- the final target path should remain the trustworthy boundary
Pipelines create stage boundaries on purpose¶
Some pipelines legitimately need multiple trusted stages:
- raw generated output
- normalized generated output
- packaged generated bundle
That is fine, but each stage must still answer:
- what file or boundary represents completion here
- who consumes this stage
- what validates it before the next stage trusts it
In other words, a pipeline may have several boundaries, but each one still needs the same honest publication logic.
Failure signatures worth recognizing¶
"The generated file exists, but it was only half-written when the consumer saw it"¶
That means publication happened too early.
"Validation failed, but the final output path still changed"¶
That means the final path stopped being a trustworthy boundary.
"The pipeline leaves a mix of old and new outputs after failure"¶
That means coupled publication is being modeled too loosely.
"Temporary outputs keep leaking into later stages"¶
That usually means temporary space and trusted space were not separated clearly enough.
A review question that improves pipeline design¶
Take one generator pipeline and ask:
- which step first creates intermediate content
- which step validates the content
- which step publishes the trusted outputs
- what happens on failure before publication
- which files or boundary nodes downstream targets are allowed to depend on
If those answers are weak, the pipeline contract is weak too.
What to practice from this page¶
Choose one multi-stage generator in the capstone or your own build and write down:
- the temporary paths
- the validation step
- the publication step
- the cleanup behavior on failure
- the exact output path or boundary file consumers should trust
If you can explain those without hand-waving, the pipeline has a real publication contract.
End-of-page checkpoint¶
Before leaving this lesson, make sure you can explain:
- why pipeline generation and publication are not automatically the same moment
- why temporary paths help keep partial work out of trusted output paths
- what atomic publication means in practical build terms
- why validation should happen before downstream trust
- how to tell whether a pipeline leaves behind untrustworthy partial outputs