Deterministic Target Lists and Sample Discovery¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive Snakemake"]
section["Dynamic Dags Discovery Integrity"]
page["Deterministic Target Lists and Sample Discovery"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
Dynamic workflows begin with a simple promise:
if the set of samples changes, the workflow can explain exactly how it learned that fact.
That promise sounds obvious. Many real workflows still break it by discovering work in ways that are unstable, implicit, or impossible to review later.
Discovery is not permission to be vague¶
People often say "Snakemake discovers samples for me" when they really mean one of three different things:
- the sample list is written in config
- a script scans a directory and turns filenames into sample identities
- a checkpoint produces a discovered-set file that later rules consume
Those are not the same design.
The healthy question is not "can Snakemake discover this?" The healthy question is:
what exact input surface decides which targets exist?
If that surface is unclear, the workflow may still run, but the run story is weak.
The stable model¶
Module 02 uses this model throughout:
- choose one declared discovery surface
- normalize it into a deterministic sample list
- validate the result
- materialize it as an artifact when later rules need to trust it
- fan out from that list, not from fresh filesystem guesses
That is how dynamic behavior stays reviewable.
Three discovery surfaces, from simplest to strongest¶
1. Config or sample sheet¶
This is the easiest surface to teach:
- you write the sample list deliberately
- version control captures the change
rule allor a helper function expands from known data
This is a strong default when the workflow does not need to discover new inputs at run time.
2. Deterministic scan at parse time¶
Sometimes the workflow must inspect a directory to know what exists. That can still be honest if the scan is narrow and stable:
- the input directory is declared
- matching rules are explicit
- the result is sorted
- filename parsing is validated
This is fine for a small workflow when the discovered set is not yet an important downstream contract.
3. Checkpoint-produced discovery artifact¶
This is the strongest version:
- discovery happens in one owned step
- the discovered set is written to a durable file
- downstream fanout reads that file
- later review can inspect the exact set that drove the DAG
That is the design the rest of this module is trying to earn.
A bad first draft¶
This shape looks convenient:
from pathlib import Path
SAMPLES = [path.stem for path in Path("data/raw").glob("*.fastq.gz")]
rule all:
input:
expand("results/{sample}/qc.json", sample=SAMPLES)
Why this is weak:
Path.glob()order is not the lesson you want readers to trust- the filename parsing is wrong because
.stemdrops only the final suffix - there is no validation for duplicate logical sample IDs
- nothing records the discovered set for later review
The workflow may appear to work while teaching the wrong habit.
A better pattern for small workflows¶
If parse-time discovery is enough, make it explicit and deterministic:
from pathlib import Path
RAW_DIR = Path("data/raw")
def discover_samples():
fastqs = sorted(RAW_DIR.glob("*.fastq.gz"))
if not fastqs:
raise ValueError(f"No FASTQ files found in {RAW_DIR}")
samples = []
for path in fastqs:
name = path.name.removesuffix(".fastq.gz")
if not name.replace("_", "").isalnum():
raise ValueError(f"Unsafe sample name: {name}")
samples.append(name)
if len(samples) != len(set(samples)):
raise ValueError("Duplicate sample IDs after filename normalization")
return samples
SAMPLES = discover_samples()
rule all:
input:
expand("results/{sample}/qc.json", sample=SAMPLES)
This still uses discovery, but it stops pretending the filesystem is self-explanatory.
When discovery must become an artifact¶
Once discovery drives a dynamic DAG or a publish boundary, a Python list at parse time is usually not enough. Another engineer should be able to inspect the exact discovered set from the run itself.
That is when the workflow should produce something like:
results/discovered_samples.jsonresults/discovered_samples.tsvresults/discovery/registry.json
The filename is less important than the contract:
- it is the owned output of discovery
- it is durable enough to reread later
- downstream rules fan out from it
A simple discovered-set artifact¶
rule discover_samples:
input:
directory("data/raw")
output:
"results/discovered_samples.json"
run:
import json
from pathlib import Path
raw_dir = Path(input[0])
samples = sorted(
path.name.removesuffix(".fastq.gz")
for path in raw_dir.glob("*.fastq.gz")
)
Path(output[0]).parent.mkdir(parents=True, exist_ok=True)
Path(output[0]).write_text(
json.dumps({"samples": samples}, indent=2) + "\n",
encoding="utf-8",
)
This rule does two useful things for you:
- it names discovery as a real step with a real output
- it gives later review one file that answers "what did this run discover?"
What makes discovery deterministic¶
Discovery is deterministic when the same declared inputs produce the same discovered set.
That usually requires four habits:
- sort every scan result
- normalize names exactly once
- reject invalid or duplicate identities early
- avoid ambient inputs that the rule never admits
If discovery depends on hidden machine state, current time, or whichever files happened to be copied first, it is not a trustworthy basis for DAG construction.
Common failure modes¶
| Failure mode | What it looks like | Better repair |
|---|---|---|
| unordered scan | sample order changes between runs or machines | sort explicitly before writing or expanding |
| weak filename parsing | sampleA.fastq and sampleA.fastq.gz collapse badly |
normalize with one clear suffix rule |
| duplicate logical IDs | one sample silently shadows another | validate uniqueness after normalization |
| re-scanning everywhere | each rule performs its own fresh discovery | introduce one owned discovery artifact |
| hidden discovery surface | rules depend on files nobody declared | make the discovery input boundary explicit |
The sentence a reviewer wants to hear¶
Strong explanation:
the workflow discovers samples from
data/raw/*.fastq.gz, sorts the filenames, normalizes them into sample IDs once, writes the result toresults/discovered_samples.json, and fans out downstream work from that file.
Weak explanation:
Snakemake looks around and finds the samples.
The difference is not style. It is whether the workflow meaning is actually reviewable.
End-of-page checkpoint¶
Before leaving this page, you should be able to:
- name the discovery surface for one workflow without hand-waving
- explain when a Python list is enough and when a discovered-set artifact is needed
- describe two validation checks that belong in discovery
- explain why downstream fanout should trust one owned artifact instead of repeated scans