File APIs, Schemas, and Public Contracts¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive Snakemake"]
section["Scaling Workflows Interface Boundaries"]
page["File APIs, Schemas, and Public Contracts"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
Repository growth becomes dangerous when people cannot answer one basic question:
which files are part of the contract, and which ones are only internal coordination state?
This page is about making that answer explicit.
A file API is a human-facing contract¶
A file API says:
- which paths another consumer may rely on
- what those files mean
- which changes count as interface breaks
That is why FILE_API.md matters in the capstone. It is not decorative documentation. It
is the human description of the stable file boundary.
Not every output is public¶
Workflows create many useful files:
- intermediate processing outputs
- discovered-set artifacts
- logs
- benchmarks
- published summaries
Those files do not all have the same status.
A healthy repository distinguishes at least two classes:
- internal execution state
- public or downstream-facing contract files
If everything under results/ is treated as public, the repository will become harder to
change safely.
What belongs in a file API¶
A useful file API usually answers:
- the stable path or path family
- the semantics of that output
- whether ordering, schema, or naming rules matter
- what kind of change would require versioning or explicit review
This keeps the discussion concrete.
Weak file API note:
reports live in publish.
Stronger file API note:
publish/v1/summary.jsonis the stable machine-readable run summary; changing its keys or meaning requires explicit interface review.
The difference is contract precision.
Schemas protect interface trust¶
Schemas matter because they move interface failure earlier.
Without validation:
- the workflow may emit a structurally wrong config or artifact
- the problem appears later as vague runtime breakage
- the interface boundary becomes harder to review
With validation:
- a malformed boundary fails close to the source
- the repository can explain which interface was violated
That is not bureaucracy. It is scaling discipline.
One healthy interface stack¶
flowchart TD
config["config.yaml"] --> schema["schema validation"]
workflow["workflow outputs"] --> fileapi["FILE_API.md"]
fileapi --> publish["publish boundary"]
schema --> trust["reviewable interface"]
publish --> trust
This model matters because human trust comes from both:
- machine-checked structure
- human-readable contract meaning
You usually need both at scale.
Public paths should be smaller than repository state¶
A common scaling mistake is to let the public contract grow accidentally:
- one notebook starts reading
results/ - another tool relies on a helper TSV
- a teammate assumes logs are part of the downstream interface
This is how internal state becomes accidental API.
The repair is to keep the public boundary intentionally smaller and documented.
Common failure modes¶
| Failure mode | What it looks like | Better repair |
|---|---|---|
internal results/ paths are treated as downstream contract |
every refactor feels dangerous | define a smaller documented public boundary |
FILE_API.md is vague |
reviewers cannot tell what is stable | document paths, semantics, and break conditions clearly |
| schemas exist only for config, not key external artifacts | output boundaries fail late | validate the interface surfaces that matter most |
| logs or benchmarks become accidental API | diagnostics become harder to evolve | keep evidence separate from the public contract |
| versioned publish paths change casually | downstream trust drifts silently | require explicit interface review or version bumps |
The explanation a reviewer trusts¶
Strong explanation:
the workflow keeps internal coordination state under
results/, but the stable downstream contract lives underpublish/v1/and is described inFILE_API.md; schema validation protects config and key structured interfaces so repository growth does not turn internal files into accidental public API.
Weak explanation:
the important files are the ones we usually look at after the run.
The strong version defines a contract. The weak version defines a habit.
End-of-page checkpoint¶
Before leaving this page, you should be able to:
- distinguish internal workflow state from public file contracts
- describe what a good file API must say explicitly
- explain why schemas and validation support scaling rather than merely formalize it
- name one way accidental public APIs emerge in growing repositories