Worked Example: Investigating a Slow and Noisy Build¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive Make"]
section["Performance Observability Incident Response"]
page["Worked Example: Investigating a Slow and Noisy Build"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
The five core lessons in Module 09 make the most sense when they appear in one incident that feels familiar:
- the build still works
- but it suddenly feels slower
- the trace output is painful to use
- and the team is not sure whether the problem is in Make, the tools, or the observability surface itself
This example starts from exactly that situation.
The incident¶
Assume you inherit a repository with these complaints:
- a normal
make allroute now feels slow compared with last month make --trace allproduces far more output than the team can comfortably inspect- one maintainer says the problem is parse-time shelling out
- another says the compiler got slower
- a third person wants to "optimize" by suppressing rebuilds and removing diagnostics
That is a realistic operational moment. The goal is to sort it out without guessing.
The starting habits¶
The repository currently has patterns like:
SRCS := $(shell find src -name '*.c' | sort)
TOOLS := $(shell command -v python3)
.PHONY: trace-count
trace-count:
@make --trace all 2>&1 | wc -l
And the team has fallen into these debugging habits:
- rerun the build several times
- sometimes clean first
- sometimes add prints
- sometimes force
-j1
That is enough to start the refactor.
Step 1: split the cost question¶
The first repair is not to the Makefile. It is to the diagnosis.
Run:
/usr/bin/time -p make -n all >/dev/null
/usr/bin/time -p make all >/dev/null
make --trace all > build/trace.log
wc -l build/trace.log
Suppose the results are:
That immediately suggests:
- parse/evaluation cost is a large part of the problem
- recipe cost exists but does not dominate
- evidence-surface cost is high too
This is Core 1:
- the complaint "slow build" has been split into real cost layers
Step 2: inspect the observability surface¶
The huge trace count is a problem in its own right, but do not "fix" it by hiding everything.
Instead ask:
- which output is high-value evidence
- which output is routine noise
- whether a named diagnostic route would be better than letting every route spill verbose data
Maybe the repository currently has scattered debug echos inside several recipes. The
repair could be:
- remove the always-on prints
- keep
--tracefor causality - keep a bounded
trace-counttarget - add one discovery audit target for resolved source lists
This is Core 2:
- observability becomes intentional
- evidence remains available
- semantic outputs stay clean
Step 3: follow the incident ladder¶
The next question is whether the slow behavior comes from the graph model or from the environment.
A calm triage ladder now looks like:
- confirm the symptom with timings
- preview with
make -n all - explain one route with
make --trace all - inspect the evaluated world with
make -p > build/make.dump - classify the likely boundary
This keeps the team from jumping straight to edits such as:
- removing prerequisites
- disabling diagnostics
- forcing serial mode
This is Core 3.
Step 4: identify a truth-preserving optimization¶
The parse-time cost clue points to:
Maybe this is repeated across multiple files.
A truth-preserving optimization is to move source listing into one stable script or manifest boundary instead of shelling out repeatedly during parse:
build/discovery.manifest: scripts/list_sources.py src/ | build/
@python3 scripts/list_sources.py src > $@.tmp
@cmp -s $@.tmp $@ 2>/dev/null || mv $@.tmp $@
@rm -f $@.tmp
Now the build can:
- reduce repeated parse-time shell work
- keep discovery explicit
- preserve the truth boundary
This is Core 4:
- the tuning removes waste
- it does not hide inputs or suppress necessary rebuilds
Step 5: write the runbook the team actually needed¶
At this point the team has a much better understanding, but the real operational win is to capture it.
A useful runbook note might now say:
- measure
make -n allandmake all - capture
--traceintobuild/trace.log - compare trace line count against the expected band
- if parse cost dominates, inspect discovery and repeated shell work
- do not disable diagnostics or drop inputs as a first performance response
This is Core 5:
- the knowledge leaves the maintainer's head
- the next responder inherits a stable first move
The repaired sketch¶
The repository is now closer to this operational model:
build/discovery.manifest: scripts/list_sources.py src/ | build/
@python3 scripts/list_sources.py src > $@.tmp
@cmp -s $@.tmp $@ 2>/dev/null || mv $@.tmp $@
@rm -f $@.tmp
.PHONY: trace-count
trace-count:
@make --trace -n all 2>&1 | wc -l
.PHONY: discovery-audit
discovery-audit: build/discovery.manifest
@cat build/discovery.manifest
And the team has a clearer incident sequence:
- measure
- trace
- inspect
- classify
- then change
That is a much healthier system than one where everyone reaches for a different trick.
What each core contributed¶
flowchart TD
symptom["Slow and noisy build complaint"] --> measure["Core 1: separate cost layers"]
measure --> observe["Core 2: keep evidence useful"]
observe --> triage["Core 3: fixed incident ladder"]
triage --> tune["Core 4: remove waste without lying"]
tune --> runbook["Core 5: transfer the response path"]
runbook --> result["Operationally understandable build"]
This is why the module is organized as five cores and then one worked example. The example is where the operational advice becomes a reusable incident story.
What you should say at the end¶
A strong summary sounds like this:
The build felt slow, but measurement showed the main cost was parse and evaluation rather than recipe execution. The observability surface was also too noisy to use well. We replaced repeated parse-time shell work with a truthful manifest boundary, kept a bounded set of diagnostic routes, and wrote a runbook that teaches the next responder how to measure, trace, inspect, and classify before editing the build.
That is much stronger than "we optimized the Makefiles."
What to practice after this example¶
Take one real build complaint and retell it in the same order:
- split the cost or symptom into layers
- inspect the current evidence surface
- follow a fixed triage ladder
- identify one truth-preserving optimization
- capture the response path in a short runbook
If you can do that cleanly, Module 09 has started to change how you respond under build pressure.