Module 09: Performance, Observability, and Incident Response¶

Module Position¶

flowchart TD
  family["Reproducible Research"] --> program["Deep Dive Make"]
  program --> module["Module 09: Performance, Observability, and Incident Response"]
  module --> lessons["Lesson pages and worked examples"]
  module --> checkpoints["Exercises and closing criteria"]
  module --> capstone["Related capstone evidence"]

flowchart TD
  purpose["Start with the module purpose and main questions"] --> lesson_map["Use the lesson map to choose reading order"]
  lesson_map --> study["Read the lessons and examples with one review question in mind"]
  study --> proof["Test the idea with exercises and capstone checkpoints"]
  proof --> close["Move on only when the closing criteria feel concrete"]

Read the first diagram as a placement map: this page sits between the course promise, the lesson pages listed below, and the capstone surfaces that pressure-test the module. Read the second diagram as the study route for this page, so the diagrams point you toward the Lesson map, Exercises, and Closing criteria instead of acting like decoration.

By this point the build is correct, layered, and publishable. Module 09 deals with the moment it becomes slow, noisy, or operationally brittle. The point is not premature optimization. The point is to keep a trustworthy build understandable when time pressure hits.

You are not optimizing Makefiles for sport. You are protecting engineering feedback loops.

Capstone exists here as corroboration. The local measurement and incident drills should already tell a coherent story before you inspect the reference build guardrails.

Before You Begin¶

This module works best after Modules 02-08, when the build is already truthful and the real question is how to keep it understandable under time pressure.

Use this module if you need to learn how to:

tell parse cost from recipe cost and diagnostic noise
build an incident ladder another engineer can follow
tune the system without hiding correctness defects

At a glance¶

Focus	Learner question	Capstone timing
measurement	"Am I paying parse cost, recipe cost, or observability cost?"	inspect capstone after you can measure one local build clearly
incident triage	"What is the next diagnostic move under pressure?"	use capstone selftests and repros once the ladder is familiar
safe tuning	"Did I remove waste or just hide evidence?"	compare with capstone guardrails after local experiments

Proof loop for this module:

make trace-count
make --trace -n all
make -p > build/make.dump

Capstone corroboration:

run make PROGRAM=reproducible-research/deep-dive-make capstone-incident-audit
run make PROGRAM=reproducible-research/deep-dive-make capstone-discovery-audit
inspect capstone/tests/run.sh for measurement guardrails

This module is successful when the learner can separate symptoms from causes before changing the build.

1) Table of Contents¶

Table of Contents
Learning Outcomes
How to Use This Module
Core 1 — Measuring Parse Time, Recipe Time, and Trace Volume
Core 2 — Observability for Build Behavior
Core 3 — Incident Triage for Slow or Flaky Builds
Core 4 — Tuning Without Hiding Truth
Core 5 — Building an Operational Runbook
Capstone Sidebar
Exercises
Closing Criteria

2) Learning Outcomes¶

By the end of this module, you can:

distinguish parse-time cost from recipe cost and graph-shape cost
add observability that helps incident response without changing build semantics
diagnose flaky or slow builds with a repeatable triage ladder
remove wasteful shell-outs, unstable discovery, and churny graph generation
write an operational runbook another engineer can use under pressure

Back to top

3) How to Use This Module¶

Take a working build and instrument it with:

one timing measurement for parse or dry-run work
one trace-count or log-volume signal
one reproducible “slow build” scenario
one incident note showing how you isolated the cause

The purpose is to separate symptoms from causes before you optimize anything.

Back to top

4) Core 1 — Measuring Parse Time, Recipe Time, and Trace Volume¶

Three different costs are often mixed together:

parse-time work
recipe execution
debug signal volume

Measure them separately. A build that spends its time in $(shell find ...) has a different problem from a build whose compiler step is expensive. A build that emits too much trace may still be correct but operationally unusable during incidents.

Healthy performance work begins with a cost model, not a hunch.

Back to top

5) Core 2 — Observability for Build Behavior¶

Observability in a Make-based system should answer:

what ran
why it ran
which inputs changed
where time went

Good observability surfaces:

--trace
-p
stable manifests
bounded diagnostic targets such as trace-count or discovery-audit

Bad observability surfaces:

ad hoc shell echos embedded everywhere
unstable timestamps mixed into semantic outputs
debug targets that mutate the real build state

Back to top

6) Core 3 — Incident Triage for Slow or Flaky Builds¶

Use a fixed triage ladder:

confirm the symptom
reproduce with the same target and inputs
preview with -n
explain with --trace
inspect the evaluated world with -p
isolate whether the defect is graph truth, shell behavior, or environmental drift

Most “Make is flaky” incidents are really one of these:

unstable discovery
missing or dishonest prerequisites
shared output paths
parse-time shelling out on every invocation
a build helper that silently changed behavior

Back to top

7) Core 4 — Tuning Without Hiding Truth¶

Allowed optimizations:

cache expensive discovery behind truthful manifests
move repeated shell work into generator scripts with explicit inputs
reduce trace volume while preserving diagnostic targets
simplify graph generation when expansion churn becomes the real bottleneck

Forbidden optimizations:

phony ordering to suppress a race
skipping rebuilds by hiding inputs
mutable temp files shared across targets
removing diagnostics because they reveal a real issue

Fast wrong builds are still wrong.

Back to top

8) Core 5 — Building an Operational Runbook¶

A mature build should ship with a runbook that answers:

how to verify convergence
how to compare serial and parallel outputs
how to inspect discovery and variable provenance
how to collect evidence before editing the Makefile
when to escalate from repair to migration

If this knowledge lives only in one maintainer's head, the build is not operationally healthy no matter how elegant the Makefile looks.

Back to top

Use the capstone to inspect:

trace-count, discovery-audit, and selftest surfaces
the balance between proof artifacts and human-readable diagnostics
performance guardrails in tests and comments
repros that turn incident patterns into repeatable learning

Back to top

10) Exercises¶

Measure one expensive parse-time habit and replace it with a truthful manifest or script boundary.
Add one bounded observability target that helps explain rebuilds without mutating outputs.
Write a short incident runbook for a flaky -j failure and prove it on a repro.
Reduce trace or shell churn without changing graph semantics.

Back to top

11) Closing Criteria¶

You pass this module only if you can demonstrate:

a repeatable measurement of build cost
at least one observability surface that helps incident response
a documented triage ladder for slow or flaky builds
one optimization that preserves truth while improving feedback time

Back to top

Directory glossary¶

Use Glossary when you want the recurring language in this module kept stable while you move between lessons, exercises, and capstone checkpoints.

Module 09: Performance, Observability, and Incident Response¶

Module Position¶

Before You Begin¶

At a glance¶

1) Table of Contents¶

2) Learning Outcomes¶

3) How to Use This Module¶

4) Core 1 — Measuring Parse Time, Recipe Time, and Trace Volume¶

5) Core 2 — Observability for Build Behavior¶

6) Core 3 — Incident Triage for Slow or Flaky Builds¶

7) Core 4 — Tuning Without Hiding Truth¶

8) Core 5 — Building an Operational Runbook¶

9) Capstone Sidebar¶

10) Exercises¶

11) Closing Criteria¶

Directory glossary¶