The First Honest Workflow Inventory¶
Page Maps¶
graph LR
family["Reproducible Research"]
program["Deep Dive DVC"]
section["Reproducibility Failures Real Teams"]
page["The First Honest Workflow Inventory"]
capstone["Capstone evidence"]
family --> program --> section --> page
page -.applies in.-> capstone
flowchart LR
orient["Orient on the page map"] --> read["Read the main claim and examples"]
read --> inspect["Inspect the related code, proof, or capstone surface"]
inspect --> verify["Run or review the verification path"]
verify --> apply["Apply the idea back to the module and capstone"]
Before you start using DVC, you need one skill that is more basic than any command:
they need to describe their current workflow without flattering it.
That is what this inventory is for.
What the inventory is trying to reveal¶
The goal is not to embarrass anyone or prove the workflow is bad.
The goal is to answer, in plain language:
- what inputs really matter
- what outputs people actually trust
- what steps are still manual or implicit
- what parts of the story still live in memory
Without that inventory, tool adoption often becomes cosmetic.
A simple inventory structure¶
Use five sections:
- source inputs
- control inputs
- execution assumptions
- outputs and who trusts them
- weak points and missing evidence
This is enough for Module 01. The later modules will sharpen each section.
Section 1: Source inputs¶
Write down:
- which datasets or raw inputs exist
- where they come from
- whether the team knows their exact identity or only their path
Bad sign:
the data is in
data/final.csv
Stronger note:
the workflow reads
data/final.csv, but the team does not yet have a durable way to prove which exact bytes that filename referred to over time.
That sentence already points toward why DVC will matter later.
Section 2: Control inputs¶
List the things that change behavior:
- config files
- parameter files
- CLI flags
- seeds
- thresholds or defaults buried inside code
If an important setting is remembered socially rather than recorded somewhere durable, put it in the inventory as a weakness.
Section 3: Execution assumptions¶
Ask what the run is quietly assuming:
- a particular Python or R environment
- local caches or temp directories
- notebook state
- machine-specific filesystems or services
- manually prepared folders
This is often the hardest section because teams are so used to these assumptions that they stop seeing them as inputs at all.
Section 4: Outputs and trust¶
Not every output is equally important.
Ask:
- which files or reports people actually share
- which files are only internal intermediates
- which outputs are treated as authoritative in meetings, pull requests, or releases
This matters because reproducibility work is partly about protecting what people truly trust, not every file in the working tree equally.
Section 5: Weak points and missing evidence¶
Finish by naming the weak points directly:
- unknown data identity
- undocumented manual preprocessing
- environment drift risk
- outputs trusted without a clear contract
- one-person memory bottlenecks
The point is not to fix them all in Module 01.
The point is to stop pretending they are not there.
A small example¶
Imagine writing this inventory:
- source input:
customers.csv - control input:
threshold=0.8insidescore.py - execution assumption: runs only from one laptop with a specific Conda env
- trusted output:
metrics.csvemailed to the team - weak point: nobody can prove which raw CSV produced last quarter's metric
That is already a strong Module 01 result.
It is concrete, honest, and ready for later modules.
A useful diagram for this inventory¶
flowchart LR
inputs["real inputs"] --> run["current workflow"]
controls["controls and defaults"] --> run
assumptions["execution assumptions"] --> run
run --> outputs["trusted outputs"]
outputs --> review["inventory of weak points"]
This is not a future-state diagram. It is a present-state mirror.
What a good inventory feels like¶
A good inventory often feels mildly uncomfortable because it replaces confidence with specificity.
That is useful discomfort.
The inventory is working when you can say:
- here is what we really trust
- here is what still depends on memory
- here is what we cannot yet recover cleanly
That is a much better starting point than "our workflow is fine, we just want better tooling."
Keep this standard¶
Do not let the first inventory turn into a wishlist about future tools.
Keep it about the present workflow:
- what it depends on
- what it produces
- what it cannot currently defend
That honesty is the real prerequisite for everything that follows in Deep Dive DVC.