Staging, Shared Filesystems, and Data Locality¶

Page Maps¶

graph LR
  family["Reproducible Research"]
  program["Deep Dive Snakemake"]
  section["Production Operations Policy Boundaries"]
  page["Staging, Shared Filesystems, and Data Locality"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone

flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

Many production failures are blamed on "the cluster" or "Snakemake being weird" when the real issue is simpler:

the repository never stated what it assumes about where files live and when they become visible.

This page is about making those assumptions explicit.

Data locality is an operational boundary¶

The workflow meaning may be unchanged while the operating context changes:

local filesystem
CI workspace
shared cluster filesystem
scratch or staged working directory

Those contexts can differ safely, but only if the repository treats them as policy surfaces instead of invisible background facts.

The two questions to ask first¶

When an output "goes missing" or appears late, ask:

where was the job really writing
when should another process be allowed to trust that write

Those questions are often more useful than staring at one failed command.

Shared filesystems add timing pressure¶

On a shared filesystem, a job may finish before another process sees the output immediately. That does not mean the workflow semantics changed. It means the operating context needs an explicit patience policy.

This is where settings such as latency handling belong:

they acknowledge the storage model
they remain operational rather than semantic
they help the workflow remain honest under normal infrastructure lag

That is a legitimate profile concern.

Scratch and staging are not semantic detours¶

Teams often stage work to local scratch or temporary directories for good reasons:

faster local IO
reduced pressure on shared storage
simpler cleanup during execution

That can be healthy. It becomes dangerous when the repository stops answering:

which paths are temporary
which path is the final trusted publication surface
how staged outputs become durable outputs

Staging is safe only when the final contract remains clear.

One healthy mental model¶

flowchart LR
  input["declared input surface"] --> work["temporary work area or scratch"]
  work --> final["declared final outputs"]
  final --> publish["published contract surface"]

This model matters because it keeps three roles separate:

where work happens
where final workflow outputs live
what downstream users are allowed to trust

When those collapse into one vague directory story, incidents get harder to explain.

A weak staging habit¶

Weak operational habit:

write directly to whatever path is convenient on the current machine
move files around ad hoc when the scheduler changes
let each maintainer decide whether scratch is used

This creates repository behavior that feels situational rather than intentional.

The repository may still run. Another maintainer will not know which path story to trust.

A stronger staging pattern¶

Healthy staging design usually has these properties:

the final output path is still the declared contract
temporary or scratch paths are clearly operational
publication into the final path is deliberate
profiles or operating docs explain the context difference

This keeps the workflow semantics stable even while the operating context changes.

What should stay out of workflow meaning¶

The following often belong in policy rather than workflow meaning:

latency expectations
scratch or staging location
executor-facing storage behavior
log-location conventions for one operating context

The following usually do not belong purely in policy:

which files count as final outputs
whether a file is part of the publish boundary
which sample identities the workflow is meant to process

That is the same module boundary in a new setting.

Common failure modes¶

Failure mode	What it looks like	Better repair
shared-filesystem lag is treated as random workflow failure	reruns feel arbitrary	make latency handling explicit in policy
scratch usage changes the apparent final path story	maintainers cannot tell what is durable	keep final outputs and scratch paths separate
local and CI use different path assumptions without review	one context works and the other feels haunted	document and encode the context difference in profiles or operation docs
staging hides partial publication	files appear in final locations too early	keep publication explicit and deliberate
temporary paths become accidental contracts	downstream tools start depending on scratch layout	reserve stable trust only for declared final outputs

The explanation a reviewer trusts¶

Strong explanation:

the workflow may stage work in a context-specific scratch area, but the final outputs are still published into the same declared contract paths; profile settings handle latency and execution context, while publish and results paths remain semantically stable.

Weak explanation:

on the cluster we write files somewhere else first because that is just how it works.

The first explanation gives a boundary. The second gives a habit without a contract.

End-of-page checkpoint¶

Before leaving this page, you should be able to:

explain why data locality is an operational boundary
describe one legitimate policy use for latency handling
distinguish scratch space from final output contracts
explain one staging design that keeps workflow meaning stable across contexts