Manifests, Checksums, and Bundle Integrity¶

Page Maps¶

graph LR
  family["Reproducible Research"]
  program["Deep Dive Snakemake"]
  section["Publishing Downstream Contracts"]
  page["Manifests, Checksums, and Bundle Integrity"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone

flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

Publishing is not finished when files are copied into publish/v1/.

The bundle also needs to defend itself.

That means a reviewer or downstream user should be able to answer questions like:

what files belong in this bundle?
how do I know whether the bundle is complete?
how can I detect whether a file changed?
where do I look for machine-checkable evidence versus human explanation?

This page is about the evidence surfaces that make a publish bundle reviewable.

A bundle should explain itself¶

The capstone publish bundle includes:

manifest.json
provenance.json
summary.json
summary.tsv
report/index.html
discovered_samples.json

Those files do not all do the same job.

One of the most common workflow mistakes is assuming that one file can answer every question.

It usually cannot.

What a manifest is good at¶

A manifest is strongest when it answers:

which files are part of the published bundle
how those files are named relative to the bundle root
how to validate their identity

In the capstone, manifest.json records published paths and SHA-256 digests.

That is valuable because it makes the bundle inventory explicit instead of implied.

What checksums are good at¶

Checksums answer a narrower but essential question:

is this file the same file I expect?

They help detect:

accidental corruption
incomplete transfers
silent replacement of one artifact with another

Checksums are not a substitute for meaning. They are evidence of identity.

What a manifest does not prove by itself¶

A manifest does not fully answer:

what a file means
whether the right files were published in the first place
what software context produced those files
whether the contract should have been versioned differently

That is why manifests need companion surfaces such as:

a file API or contract note
provenance
review judgment about the publish boundary

One useful separation¶

flowchart LR
  manifest["manifest.json"] --> inventory["which files belong here"]
  checksums["sha256 entries"] --> identity["whether each file matches"]
  provenance["provenance.json"] --> context["what software context produced the bundle"]
  file_api["file API or contract doc"] --> meaning["what each published file means"]

This separation matters because reviewers often ask the right question but check the wrong artifact.

A weak integrity story¶

Weak shape:

the bundle has files but no explicit inventory
a report exists, so maintainers assume the bundle is self-explanatory
checksums are absent, so transfers and comparisons rely on trust alone

This leaves a reviewer with no compact way to validate the bundle.

A stronger integrity story¶

Stronger shape:

the manifest names the published files
checksum entries let others validate identity
provenance records software context
contract documentation explains how to interpret the artifacts

Now the bundle can answer more than one kind of question without forcing one file to do everything.

A practical review loop¶

When reviewing a publish bundle, ask:

Does the manifest enumerate the expected public artifacts?
Do the checksum entries cover the files the bundle promises?
Is provenance present for software-context questions?
Is there a document or guide that explains artifact meaning and stability?

If the answer to only the first question is yes, the integrity story is still incomplete.

Common failure modes¶

Failure mode	Why it hurts	Better repair
manifest lists files but no digests	bundle completeness is visible but file identity is weak	include checksums for published artifacts
report is treated as the integrity surface	bundle validation becomes manual and human-only	keep machine-checkable inventory and digests separate
provenance is omitted because manifest exists	software context becomes untraceable	pair manifest integrity with provenance evidence
manifest includes internal or unstable files	public boundary becomes noisy and brittle	inventory only the intentional publish contract
contract meaning lives only in author memory	downstream users cannot interpret the bundle confidently	document the role of each public artifact explicitly

The explanation a reviewer trusts¶

Strong explanation:

manifest.json tells us which files belong in publish/v1/ and gives us digests to verify identity, while provenance.json explains the producing software context and the file API explains what each artifact means.

Weak explanation:

the manifest is there, so the bundle is self-documenting.

The strong explanation respects the limits of each artifact. The weak explanation assigns too much responsibility to one file.

End-of-page checkpoint¶

Before leaving this page, you should be able to:

explain what a manifest proves and what it does not prove
explain why checksums are about identity rather than meaning
describe why provenance is a separate integrity surface
explain why a publish bundle still needs a file API or contract note