Checkpoints and Reviewed DAG Changes¶

Page Maps¶

graph LR
  family["Reproducible Research"]
  program["Deep Dive Snakemake"]
  section["Dynamic Dags Discovery Integrity"]
  page["Checkpoints and Reviewed DAG Changes"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone

flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

Checkpoints are where many Snakemake courses become mystical.

This module takes the opposite approach:

a checkpoint is acceptable only when it makes a data-dependent change to the DAG more explicit than the alternatives.

If it hides the real target story, it is a bad checkpoint even if the code runs.

What a checkpoint actually does¶

A normal rule lets Snakemake build the job graph up front.

A checkpoint says:

build the graph as far as this step
run the checkpoint
inspect its declared output
reevaluate downstream job construction using that new fact

That is all. A checkpoint is not "dynamic magic." It is controlled graph reevaluation.

When a checkpoint is justified¶

Use a checkpoint when downstream targets genuinely cannot be known until some declared input has been inspected or produced.

Good examples:

discovering which samples are present in a newly staged raw-data directory
splitting one input into a data-dependent set of shards
enumerating outputs from a tool whose count depends on validated input content

In all three cases, the checkpoint exists because the workflow learns a real new fact.

When a checkpoint is a smell¶

Do not use a checkpoint for problems that are really about weak workflow design:

you could have read the sample list from config instead
the target list was already known but nobody modeled it cleanly
multiple rules keep rescanning the filesystem and disagreeing about what exists
the rule wants to hide side effects that should have been explicit outputs

Those are not checkpoint problems. They are modeling problems.

The checkpoint contract¶

A healthy checkpoint has four properties:

one clear discovery question
one owned output that records the answer
deterministic behavior for fixed declared inputs
downstream fanout that reads the recorded answer instead of rediscovering it

If any of those are missing, the checkpoint usually leaves the workflow harder to review.

A good shape¶

flowchart TD
  raw["data/raw/"] --> checkpoint["checkpoint discover_samples"]
  checkpoint --> registry["results/discovered_samples.json"]
  registry --> fanout["per-sample targets"]
  fanout --> publish["publish/v1/discovered_samples.json"]

This picture matters because the discovery step leaves behind a durable registry. The checkpoint is not trusted because it is dynamic. It is trusted because its result is reviewable.

Minimal checkpoint pattern¶

checkpoint discover_samples:
    input:
        directory("data/raw")
    output:
        directory("results/discovery")
    run:
        import json
        from pathlib import Path

        outdir = Path(output[0])
        outdir.mkdir(parents=True, exist_ok=True)

        samples = sorted(
            path.name.removesuffix(".fastq.gz")
            for path in Path(input[0]).glob("*.fastq.gz")
        )
        (outdir / "samples.json").write_text(
            json.dumps({"samples": samples}, indent=2) + "\n",
            encoding="utf-8",
        )


def discovered_qc_targets(_wildcards):
    import json
    from pathlib import Path

    checkpoint_job = checkpoints.discover_samples.get()
    registry = Path(checkpoint_job.output[0]) / "samples.json"
    samples = json.loads(registry.read_text(encoding="utf-8"))["samples"]
    return expand("results/{sample}/qc.json", sample=samples)


rule all:
    input:
        discovered_qc_targets

Two parts matter here:

the checkpoint writes one owned registry
downstream expansion reads that registry, not the raw directory again

That is the core Module 02 pattern.

What the checkpoint must never hide¶

A checkpoint must never be used to smuggle in these facts:

undeclared input files
random or time-based discovery
side effects that later rules depend on but cannot name
a second private discovery pass inside downstream helper code

If the downstream DAG changes, the workflow should be able to point to one declared input surface and one concrete checkpoint output that explain why.

Determinism matters more than dynamism¶

The hardest checkpoint bug to debug is often not code failure. It is discovery drift.

Example bad shape:

import random

checkpoint discover_samples:
    output:
        directory("results/discovery")
    run:
        import json
        from pathlib import Path

        outdir = Path(output[0])
        outdir.mkdir(parents=True, exist_ok=True)
        samples = [f"sample{random.randint(1, 99)}" for _ in range(2)]
        (outdir / "samples.json").write_text(json.dumps({"samples": samples}))

This code proves the point:

the checkpoint output is not a fact about declared inputs
reruns can change the discovered set without any honest cause
downstream results become impossible to defend

The fix is never "dynamic DAGs are flaky." The fix is "discovery must be a deterministic function of declared inputs."

A good review checklist¶

When you review a checkpoint, ask:

what exact question is this checkpoint answering
what file records the answer
could I inspect that file after the run and understand the discovered set
do downstream rules read that file or perform their own fresh discovery
would the same declared inputs produce the same answer again

If those questions are easy to answer, the checkpoint is probably healthy.

The explanation a reviewer trusts¶

Strong explanation:

the checkpoint exists because the sample set is unknown until data/raw/ is inspected, it writes results/discovery/samples.json, and all downstream fanout reads that file, so the dynamic change to the DAG is explicit and reproducible.

Weak explanation:

we needed a checkpoint because Snakemake could not figure it out otherwise.

The first sentence explains the contract. The second sentence hides it.

End-of-page checkpoint¶

Before leaving this page, you should be able to:

explain a checkpoint as controlled DAG reevaluation
name one case where a checkpoint is justified
name one case where a checkpoint is really covering for weak target design
describe the durable output a healthy checkpoint should leave behind