Worked Example: Repairing a Broken Generator Pipeline¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive Make"]
section["Generated Files Multi Output Pipeline Boundaries"]
page["Worked Example: Repairing a Broken Generator Pipeline"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
The five core lessons in Module 06 are easiest to trust when they all show up inside one generator incident that feels real.
This example starts with a build that "mostly works":
- it generates files locally
- it sometimes duplicates work under
-j - it leaves confusing manifests behind
- and when it fails, the team is no longer sure which files are trustworthy
That is the exact moment where code generation stops feeling like a convenience and starts feeling like a correctness problem.
The incident¶
Assume you inherit a small pipeline that produces:
build/include/api.hbuild/api.jsonbuild/api.manifest
from:
schema/api.ymlscripts/gen_api.py- the build mode
MODE
The team reports four symptoms:
- a schema edit sometimes leaves the header stale
make -j2 alloccasionally prints the generator log twice- the manifest changes every run
- a failed generation sometimes leaves one final output updated and another stale
That is enough to begin. No guessing yet.
The starting build sketch¶
The inherited Makefile looks like this:
MODE ?= release
api.h api.json: schema/api.yml scripts/gen_api.py
@python3 scripts/gen_api.py schema/api.yml
api.manifest:
@date > $@
main.o: src/main.c scripts/gen_api.py
$(CC) -Ibuild/include -c $< -o $@
all: api.h api.json api.manifest main.o
Every line here is plausible. That is why the example is useful.
Step 1: identify the first lie¶
Look at the consumer edge:
This tells Make that the compile step cares directly about the generator script. In reality the compile step cares about the published header.
That is the first repair:
This is Core 1 in action:
- the generated header is a real graph target
- the object file depends on what it actually reads
- producer internals and consumer content are separated again
Step 2: explain the duplicate execution¶
The multi-output rule is:
That is the classic loose model of one coupled generation event.
Under -j, the team sees:
The repair is not to blame parallelism. The repair is to name the single publication unit.
One honest repair is a stamp:
API_GEN_STAMP := build/api.stamp
$(API_GEN_STAMP): schema/api.yml scripts/gen_api.py | build/
@python3 scripts/gen_api.py schema/api.yml --out-dir build/tmp
@mv build/tmp/api.h build/include/api.h
@mv build/tmp/api.json build/api.json
@touch $@
build/include/api.h build/api.json: $(API_GEN_STAMP)
This is Core 2:
- one event owns both outputs
- the graph now has one completion point
- duplicate execution is no longer left to chance
Step 3: repair the manifest boundary¶
The old manifest rule is:
That file does not represent build meaning. It represents clock noise.
A healthier manifest might record the facts that actually define the generated set:
build/api.manifest: schema/api.yml | build/
@printf 'schema=schema/api.yml\nmode=%s\n' '$(MODE)' > $@.tmp
@cmp -s $@.tmp $@ 2>/dev/null || mv $@.tmp $@
@rm -f $@.tmp
Now the file has a real role:
- it describes the generator boundary
- it changes only when the boundary meaning changes
- it can participate honestly in convergence
This is Core 3:
- manifests should represent a boundary fact
- they should converge
- they should not replace direct content edges where those edges are still needed
Step 4: fix early publication¶
The original generator wrote directly into final paths. That makes partial failure dangerous.
Suppose the pipeline now becomes:
$(API_GEN_STAMP): schema/api.yml scripts/gen_api.py | build/ build/include/
@python3 scripts/gen_api.py schema/api.yml --out-dir build/tmp
@python3 scripts/validate_api.py build/tmp/api.h build/tmp/api.json
@mv build/tmp/api.h build/include/api.h
@mv build/tmp/api.json build/api.json
@touch $@
This is much stronger because:
- validation happens before final publication
- temporary work stays outside trusted output paths
- the stamp is touched only after the full pipeline succeeded
This is Core 4:
- publication happens after success
- downstream trust begins at a named boundary
- partial outputs stop pretending to be finished work
Step 5: run the failure-mode loop¶
Now take the original four symptoms and classify them:
- stale header after schema edit likely class: missing semantic input or wrong consumer edge
- duplicate generator log under
-jlikely class: dishonest multi-output publication unit - manifest changes every run likely class: unstable boundary file
- one output updated after failure likely class: early publication bug
This is why Core 5 exists. The classifications stop the repair from turning into random shell edits.
The repaired sketch¶
After the hardening pass, the build is closer to this:
MODE ?= release
API_GEN_STAMP := build/api.stamp
build/:
mkdir -p $@
build/include/:
mkdir -p $@
build/api.manifest: schema/api.yml | build/
@printf 'schema=schema/api.yml\nmode=%s\n' '$(MODE)' > $@.tmp
@cmp -s $@.tmp $@ 2>/dev/null || mv $@.tmp $@
@rm -f $@.tmp
$(API_GEN_STAMP): schema/api.yml scripts/gen_api.py build/api.manifest | build/ build/include/
@python3 scripts/gen_api.py schema/api.yml --out-dir build/tmp
@python3 scripts/validate_api.py build/tmp/api.h build/tmp/api.json
@mv build/tmp/api.h build/include/api.h
@mv build/tmp/api.json build/api.json
@touch $@
build/include/api.h build/api.json: $(API_GEN_STAMP)
main.o: src/main.c build/include/api.h
$(CC) -Ibuild/include -c $< -o $@
all: build/include/api.h build/api.json build/api.manifest main.o
This is not fancy. It is simply much more truthful.
What each core contributed¶
flowchart TD
symptom["Generator symptoms"] --> files["Core 1: generated file is a real target"]
files --> multi["Core 2: coupled outputs need one publication event"]
multi --> boundary["Core 3: manifest names a real boundary fact"]
boundary --> publish["Core 4: trust begins after full publication"]
publish --> repair["Core 5: classify the failure and repair the graph"]
repair --> stable["Convergent generator pipeline"]
This is why the module is organized into five cores and then one worked example. The example is where the module becomes operational.
What you should say at the end¶
A strong summary sounds like this:
The pipeline was broken in four different ways: the consumer edge skipped the generated header, the coupled outputs lacked one clear publication event, the manifest recorded unstable noise, and final outputs were published before validation completed. We repaired the graph by restoring direct consumer edges, introducing one generation boundary, making the manifest convergent, and publishing only after the full pipeline succeeded.
That summary is much stronger than "the generator was flaky."
What to practice after this example¶
Take one real generator incident and retell it in the same order:
- state the symptoms precisely
- identify the first graph lie
- name the publication unit
- decide whether any stamp or manifest is justified
- state where trust begins
- rerun convergence and a parallel check
If you can do that cleanly, Module 06 has started to change how you think about generation.