Provenance, Manifests, and Publish Boundaries¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive Snakemake"]
section["Dynamic Dags Discovery Integrity"]
page["Provenance, Manifests, and Publish Boundaries"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
Discovery is only half the problem.
Once the workflow has learned a changing set of samples, another question appears:
where does that fact live after the run, and which later consumer is allowed to trust it?
This page answers that question.
Internal execution state is not automatically a public contract¶
Many workflows create useful files during execution:
- discovered sample registries
- per-sample summaries
- logs
- benchmarks
- intermediate reports
Those files may be valuable without all being safe to publish as downstream truth.
Module 02 wants a clean separation:
- internal workflow state helps the run remain inspectable
- published outputs define what downstream users may rely on
If you blur those together, review becomes harder.
The three artifact roles¶
1. Discovery artifacts¶
These answer:
- what did the workflow discover
- from which declared surface
- in what normalized order
Typical example:
results/discovered_samples.json
2. Provenance artifacts¶
These answer:
- what configuration actually ran
- which software or workflow version produced the outputs
- which runtime identity matters for review
Typical example:
publish/v1/provenance.json
3. Publish artifacts¶
These answer:
- which outputs are now part of the downstream contract
- where they live
- how a reviewer can verify the published bundle
Typical examples:
publish/v1/discovered_samples.jsonpublish/v1/manifest.jsonpublish/v1/report/index.html
The names can vary. The separation of roles should not.
Why the discovered set deserves special treatment¶
The discovered sample set is not just one more intermediate file.
In a dynamic workflow it often explains:
- why certain jobs existed at all
- why some downstream outputs are present
- why the publish surface contains the sample families it does
If that artifact disappears or stays private forever, the run story becomes harder to reconstruct.
That is why the capstone carries discovery into the publish boundary instead of treating it as disposable setup state.
A healthy boundary¶
flowchart LR
raw["data/raw"] --> discovery["results/discovered_samples.json"]
discovery --> results["results/{sample}/..."]
results --> publish["publish/v1/"]
discovery --> publish
publish --> manifest["manifest.json"]
publish --> provenance["provenance.json"]
This design does something subtle but important:
- dynamic discovery remains visible inside the run
- the same discovery fact is preserved at the public boundary
- manifest and provenance make the publish surface reviewable as a set
What belongs in a discovery artifact¶
A weak registry usually lists only names.
A stronger registry records enough context for later review, for example:
{
"schema_version": 1,
"source": "data/raw/*.fastq.gz",
"samples": {
"sampleA": {
"reads": {
"R1": "data/raw/sampleA_R1.fastq.gz",
"R2": "data/raw/sampleA_R2.fastq.gz"
}
}
}
}
The exact schema is up to the workflow. The lesson is that the discovered set should be specific enough to answer later review questions without forcing a second raw-data scan.
What belongs in provenance¶
Provenance should capture resolved run facts, not vague aspirations.
Typical examples:
- the materialized config used for the run
- the Snakemake version
- the workflow commit or repository state
- the profile or execution context that matters for interpretation
This page is not asking for maximal metadata. It is asking for enough evidence to explain what run produced the published boundary.
Why manifests matter¶
A manifest gives the publish boundary a durable inventory.
That usually means:
- which files belong to the boundary
- where they are relative to the publish root
- an integrity surface such as checksums or hashes
Without a manifest, a publish directory is often just a bag of files someone hopes is complete.
With a manifest, the boundary becomes reviewable as a declared set.
One useful publication pattern¶
Use the workflow in two stages:
- produce internal execution artifacts under
results/ - promote selected artifacts into
publish/v1/
That pattern makes review easier because it answers two different questions cleanly:
- what did the workflow need to run correctly
- what may a downstream consumer trust later
Those questions overlap, but they are not identical.
Common integrity mistakes¶
| Mistake | Why it hurts | Better repair |
|---|---|---|
| discovery artifact is overwritten or discarded | later review cannot explain the DAG | keep the registry as a durable run artifact |
| publish boundary omits discovery | downstream users cannot tell why sample outputs exist | promote the discovered set into the publish bundle |
| provenance captures only a timestamp | run identity stays vague | record resolved config and software identity too |
| publish directory has no inventory | reviewers cannot tell what belongs there | add a manifest with explicit paths and checksums |
| internal and public files share the same location with no distinction | ownership and review scope blur together | keep results/ and publish/ responsibilities separate |
The explanation a reviewer trusts¶
Strong explanation:
the workflow records discovery in
results/discovered_samples.json, publishes a reviewed copy atpublish/v1/discovered_samples.json, records run identity inpublish/v1/provenance.json, and inventories the bundle inpublish/v1/manifest.json, so a reviewer can explain both why the DAG changed and what the public contract now contains.
Weak explanation:
we keep a few JSON files around for debugging.
The strong version names ownership and trust. The weak version treats integrity as optional decoration.
End-of-page checkpoint¶
Before leaving this page, you should be able to:
- distinguish discovery artifacts from provenance artifacts from publish artifacts
- explain why a discovered-set file may belong both inside the run and at the publish boundary
- describe what a manifest adds that a directory listing does not
- explain why internal execution state should not automatically become the public contract