Manifests, Checksums, and Bundle Integrity¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive Snakemake"]
section["Publishing Downstream Contracts"]
page["Manifests, Checksums, and Bundle Integrity"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
Publishing is not finished when files are copied into publish/v1/.
The bundle also needs to defend itself.
That means a reviewer or downstream user should be able to answer questions like:
- what files belong in this bundle?
- how do I know whether the bundle is complete?
- how can I detect whether a file changed?
- where do I look for machine-checkable evidence versus human explanation?
This page is about the evidence surfaces that make a publish bundle reviewable.
A bundle should explain itself¶
The capstone publish bundle includes:
manifest.jsonprovenance.jsonsummary.jsonsummary.tsvreport/index.htmldiscovered_samples.json
Those files do not all do the same job.
One of the most common workflow mistakes is assuming that one file can answer every question.
It usually cannot.
What a manifest is good at¶
A manifest is strongest when it answers:
- which files are part of the published bundle
- how those files are named relative to the bundle root
- how to validate their identity
In the capstone, manifest.json records published paths and SHA-256 digests.
That is valuable because it makes the bundle inventory explicit instead of implied.
What checksums are good at¶
Checksums answer a narrower but essential question:
is this file the same file I expect?
They help detect:
- accidental corruption
- incomplete transfers
- silent replacement of one artifact with another
Checksums are not a substitute for meaning. They are evidence of identity.
What a manifest does not prove by itself¶
A manifest does not fully answer:
- what a file means
- whether the right files were published in the first place
- what software context produced those files
- whether the contract should have been versioned differently
That is why manifests need companion surfaces such as:
- a file API or contract note
- provenance
- review judgment about the publish boundary
One useful separation¶
flowchart LR
manifest["manifest.json"] --> inventory["which files belong here"]
checksums["sha256 entries"] --> identity["whether each file matches"]
provenance["provenance.json"] --> context["what software context produced the bundle"]
file_api["file API or contract doc"] --> meaning["what each published file means"]
This separation matters because reviewers often ask the right question but check the wrong artifact.
A weak integrity story¶
Weak shape:
- the bundle has files but no explicit inventory
- a report exists, so maintainers assume the bundle is self-explanatory
- checksums are absent, so transfers and comparisons rely on trust alone
This leaves a reviewer with no compact way to validate the bundle.
A stronger integrity story¶
Stronger shape:
- the manifest names the published files
- checksum entries let others validate identity
- provenance records software context
- contract documentation explains how to interpret the artifacts
Now the bundle can answer more than one kind of question without forcing one file to do everything.
A practical review loop¶
When reviewing a publish bundle, ask:
- Does the manifest enumerate the expected public artifacts?
- Do the checksum entries cover the files the bundle promises?
- Is provenance present for software-context questions?
- Is there a document or guide that explains artifact meaning and stability?
If the answer to only the first question is yes, the integrity story is still incomplete.
Common failure modes¶
| Failure mode | Why it hurts | Better repair |
|---|---|---|
| manifest lists files but no digests | bundle completeness is visible but file identity is weak | include checksums for published artifacts |
| report is treated as the integrity surface | bundle validation becomes manual and human-only | keep machine-checkable inventory and digests separate |
| provenance is omitted because manifest exists | software context becomes untraceable | pair manifest integrity with provenance evidence |
| manifest includes internal or unstable files | public boundary becomes noisy and brittle | inventory only the intentional publish contract |
| contract meaning lives only in author memory | downstream users cannot interpret the bundle confidently | document the role of each public artifact explicitly |
The explanation a reviewer trusts¶
Strong explanation:
manifest.jsontells us which files belong inpublish/v1/and gives us digests to verify identity, whileprovenance.jsonexplains the producing software context and the file API explains what each artifact means.
Weak explanation:
the manifest is there, so the bundle is self-documenting.
The strong explanation respects the limits of each artifact. The weak explanation assigns too much responsibility to one file.
End-of-page checkpoint¶
Before leaving this page, you should be able to:
- explain what a manifest proves and what it does not prove
- explain why checksums are about identity rather than meaning
- describe why provenance is a separate integrity surface
- explain why a publish bundle still needs a file API or contract note