Systematic Refactor¶
Concept Position¶
flowchart TD
family["Python Programming"] --> program["Python Functional Programming"]
program --> module["Module 10: Refactoring and Sustainment"]
module --> concept["Systematic Refactor"]
concept --> capstone["Capstone pressure point"]
flowchart TD
problem["Start with the design or failure question"] --> example["Study the worked example and trade-offs"]
example --> boundary["Name the boundary this page is trying to protect"]
boundary --> proof["Carry that question into code review or the capstone"]
Read the first diagram as a placement map: this page is one concept inside its parent module, not a detached essay, and the capstone is the pressure test for whether the idea holds. Read the second diagram as the working rhythm for the page: name the problem, study the example, identify the boundary, then carry one review question forward.
Module 10
Core question:
How do you iteratively refactor imperative spaghetti—entangled loops, shared state, hidden deps, and effects—into a layered FuncPipe pipeline with pure core, explicit boundaries, and verifiable equivalence, using a heuristics-guided cycle to handle real legacies safely?
In this core, we outline a heuristics-guided migration cycle for refactoring imperative code to functional pipelines in the FuncPipe RAG Builder (now at funcpipe-rag-10). We separate the abstract protocol (reusable framework), a realistic case study (tangled legacy script), and a verification appendix (patterns for equivalence evidence). Builds on Module 09 standards.
Motivation Bug: Tangled imperatives are flaky; big-bang risks regressions—heuristic-guided cycle with equivalence evidence enables safe, adaptive migration.
Delta from Module 09: Standards define good FP; this gives the adaptive playbook.
Audience: Engineers refactoring legacies safely.
Outcome: 1. Audit/classify impurities, including structural tangles. 2. Apply heuristics-guided cycle with diverging paths. 3. Migrate RAG, verifying equivalence/laws.
Migration Protocol (Abstract, Reusable)¶
- Cycle: Heuristics-guided loop (graph + scores below) until stop: no impurities, equivalence holds, laws pass, budget met.
- Semantics: Preserve evolving equivalence via code predicates (below); mocks for effects.
- Purity: Progressive: tag → parameterize → extract; full when untagged.
- Error Model Default: Per-record Result streams during migration (preserve continuation); collapse to batch post-purity when (a) adapters idempotent, (b) no per-record branching, (c) consumers don't need partials.
- Resource Safety: Ports return ContextManager[Iterator[Result[T, E]]]; IoAction enters/exits during perform. Core never stores raw iterators.
- Integration: Apply to legacies; multi-layered verification (appendix).
- Mypy Config: --strict incremental; shims for sigs (schedule below).
- Stop Condition: No tags, equivalence holds, laws pass, budget met.
Equivalence Predicates (Code, Normative)¶
Parameterized to domain.
from collections import Counter
def eq_pure(xs: list, ys: list, key: Callable[[Any], Hashable], order_sensitive: bool = False) -> bool:
if order_sensitive: return [key(x) for x in xs] == [key(y) for y in ys]
return Counter(key(x) for x in xs) == Counter(key(y) for y in ys) # Multiset; assumes hashable keys. If keys aren’t hashable, map to a stable serialization or use sorted lists under a total order.
def eq_boundary(a: dict, b: dict, normalize: Callable[[dict], dict]) -> bool:
return normalize(a) == normalize(b) # Per-artifact; e.g., strip timestamps/logs
def eq_error(e1: list[ErrInfo], e2: list[ErrInfo], classify: Callable[[ErrInfo], tuple[str, str]]) -> bool:
return Counter(classify(e) for e in e1) == Counter(classify(e) for e in e2) # Multisets of (code, stage)
Decision Graph + Heuristics (Operational)¶
Graph:
graph TD
start["Start: audit and tag smells, then choose the highest-scoring path"]
pure["Extract Pure Islands<br/>score: mutation density + cyclomatic complexity"]
boundaries["Extract Boundaries<br/>score: I/O density + exception density"]
errors["Normalize Errors<br/>score: exception density + silent swallows"]
deps["Parameterize Dependencies<br/>score: hidden globals + fan-in or fan-out of shared state"]
verify["Verify and optimize if over budget"]
stop["Stop when impurities are isolated, equivalence holds, laws pass, and budget is met"]
pureFail["Laws or equivalence fail -> loop or undo"]
boundaryFail["Swappability fails -> loop or undo"]
errorFail["Error equivalence fails -> loop or undo"]
depFail["Equivalence fails -> loop or undo"]
verifyFail["Verification fails -> loop or escalate"]
start --> pure
start --> boundaries
start --> errors
start --> deps
pure --> pureFail
boundaries --> boundaryFail
errors --> errorFail
deps --> depFail
pure --> verify
boundaries --> verify
errors --> verify
deps --> verify
verify --> verifyFail
verify --> stop
Heuristics (measurable; tie-breaker: choose pure if tied; unit = smallest refactorable region, usually loop nest or helper; split >80 LOC by effect boundary first. Normalize each score to [0,1] per region; sum with equal weights unless domain dictates otherwise. If top-two scores are within 10%, do a paired pass (e.g., parameterize + pure extract) to avoid churn. Approximate without tooling: grep symbols for fan-in/out, count branches/loops for CC):
- Mutation density: # mut ops (assigns, .append) / lines.
- Cyclomatic complexity (CC): # branches/loops per region.
- % lines in I/O: IO calls / total.
- Exception density: try/except / lines.
- # global/hidden: Count reads/writes to globals/time/RNG/env.
- Fan-in/out: # refs to shared objects.
Iterations continue until stop; variance wide.
Resource Pattern (Canonical Combinator)¶
def within(cm_factory: Callable[[], ContextManager[Iterator[T]]], fn: Callable[[Iterator[T]], Iterator[U]]) -> IoAction[Iterator[U]]:
def thunk():
with cm_factory() as it:
yield from fn(it)
return IoAction.delay(thunk) # Enter/exit in perform; safe partial
# Usage
read_desc = within(lambda: storage.read_docs(path), identity) # Streaming safe
Artifact Taxonomy (For eq_boundary)¶
Artifacts classed as: (a) line logs (normalize: sort + strip timestamps), (b) structured records (normalize: key-based multisets), (c) blobs (normalize: hash/content eq). Each port exposes normalizer for its class.
Signature Evolution Schedule (3-Phase)¶
- Shadow: Run new in parallel; log/assert equivalence; no output switch.
- Canary: Subset traffic to new; monitor; non-flaky tests via mocks/seeds.
- Cutover: Switch; remove legacy/shims after stability (e.g., 1 sprint).
Case Study: Tangled RAG God Script Migration¶
Legacy (Before: Realistic Ugliness)¶
Anonymized from internal: partials feed phases; hidden coupling; mutable backoff; shared mutables.
# real_tangled_rag.py (anonymized)
import csv, json, random, time, os # Hidden deps
CONFIG = {"chunk_size": 256, "retry": 3, "backoff": [1,2,4]} # Global mutable
SHARED_STATE = {"partials": [], "error_counts": {}, "cache": {}} # Shared mutables across phases/modules
def tangled_rag(in_path: str, out_path: str, temp_dir: str) -> list[Chunk]: # Coupled to OS via temp
global CONFIG, SHARED_STATE
chunks = []
with open(in_path) as f:
reader = csv.DictReader(f)
for row in reader: # Parsing + retry + partials intertwined
backoff = CONFIG['backoff'][:] # Mutable copy; modifies over retries
for attempt in range(CONFIG['retry']):
try:
doc = RawDoc(**row)
SHARED_STATE['cache'][doc.doc_id] = doc # Mutate shared
break
except KeyError as e:
SHARED_STATE['error_counts'][str(e)] = SHARED_STATE['error_counts'].get(str(e), 0) + 1
if SHARED_STATE['error_counts'][str(e)] > 5: break # Cross-cut threshold
time.sleep(backoff[attempt % len(backoff)]) # Hidden time; mutates backoff?
backoff[attempt % len(backoff)] *= 1.5 # Mutate backoff state
if 'cs.' not in row.get('categories', ''): continue # Flag in loop
cleaned = CleanDoc(...) # ...
SHARED_STATE['partials'].append(cleaned) # Feed to next phase
# Inner loop: chunk + check partials (coupling)
temp_file = os.path.join(temp_dir, f"{cleaned.doc_id}.tmp") # OS couple
with open(temp_file, 'w') as tf: # Interleaved IO
offset = random.randint(0,10)
for i in range(0, len(cleaned.abstract), CONFIG['chunk_size'] + offset):
chunk = Chunk(...)
if any(chunk.text in p.abstract for p in SHARED_STATE['partials']): continue # Cross-phase check
tf.write(json.dumps(chunk.to_dict()) + '\n')
chunks.append(chunk)
# Reuse partials in write phase
with open(out_path, 'w') as f:
for chunk in chunks:
if chunk.doc_id in SHARED_STATE['cache']: # Hidden coupling
f.write(json.dumps(chunk.to_dict()) + '\n')
return chunks
Iteration 1: Audit + Parameterize Deps (High # Globals/Fan-in)¶
Score: High globals (6+) + fan-in (partials/cache/error_counts). Micro-commit Delta (Tagged → Parameterized):
# Before (tagged excerpt)
global CONFIG, SHARED_STATE # Impurity: global mutable
backoff = CONFIG['backoff'][:] # Impurity: mutable copy
time.sleep(backoff[attempt % len(backoff)]) # Impurity: hidden time
backoff[attempt % len(backoff)] *= 1.5 # Impurity: mutate state
offset = random.randint(0,10) # Impurity: hidden RNG
# After (parameterized)
def tangled_rag(..., config: dict | None = None, shared_state: dict | None = None, clock: Callable[[float], None] = time.sleep, rng: random.Random = random.Random(42)) -> list[Chunk]:
config = config or CONFIG.copy()
shared_state = shared_state or {"partials": [], "error_counts": {}, "cache": {}}
backoff = config['backoff'][:] # Still mutable; next iteration
clock(backoff[attempt % len(backoff)])
offset = rng.randint(0,10)
# Preserve eq_pure(chunks, key=lambda c: (c.doc_id, c.text)); eq_error(classify=lambda e: (type(e), e.key))
Iteration 2: Normalize Errors + Extract Boundaries (High Exceptions/IO %)¶
Score: Exception density high; % IO ~30. Port: As protocol; adapter handles retries/backoff immutably. Verify: eq_error; metamorphic (retry idempotence).
Loop: Mutations remain → Next pure.
Iteration 3: Extract Pure Islands (High Mutations/Complexity)¶
Extract cleaning/chunking; shared state → explicit dicts. Assume constructors are pure; if not, treat them as effect boundaries and port them first. Score: Mutation density high; complexity from loops. Collapse errors? No—adapters not idempotent. Verify: Metamorphic families (appendix).
Stop: Checks pass.
Verification Appendix (Patterns)¶
Multi-Layered Evidence¶
- Equivalence: Per pass, using predicates + domain params.
- Local Laws: Idempotence/associativity for new pure (Module 05/06).
- Swappability: Mock adapter == real (eq_boundary).
- Metamorphic: Derive via:
- Invariant (e.g., order irrelevance, retry idempotence, chunk monotonicity, add-noise, scale).
- Transform (permute, rerun with seed, vary size, add errors, scale input).
- Assert: eq_pure/eq_error post-transform.
Canonical Families: - Permutation: For order-irrelevance. - Retry: Same input/seeds → same output (idempotence). - Monotonicity: Smaller chunk_size → more chunks. - Noise: Add irrelevant records → same core outputs. - Associativity: Split/recombine folds → same aggregate.
... (examples per iteration in case study)
6. Runtime Preservation Guarantee¶
Set system-specific budget (multi-dimensional: latency, throughput, peak RSS). Profile per pass; fuse if over (imperative behind pure); defer non-blocking regressions to Core 2.
7. Anti-Patterns & Immediate Fixes¶
| Anti-Pattern | Symptom | Fix |
|---|---|---|
| Vibes-based choices | Inefficient paths | Use heuristics/scores |
| Naïve predicates | Pass regressions | Parameterize + multisets |
| Early collapse | Lose continuation | Gate on adapter idempotence |
| Synthetic examples | Unrealistic prep | Use anonymized real code |
8. Pre-Core Quiz¶
- Choose paths by…? → Heuristics/scores
- Predicates parameterized by…? → Domain key/normalize
- Collapse errors when…? → Adapters idempotent
- Metamorphic derivation? → Invariant + transform + assert
- Sig evolution? → 3-phase schedule
9. Post-Core Exercise¶
- Audit real tangled function; score + run 1 iteration.
- Define predicates + metamorphic family.
- Iterate to stop; profile.
Pipeline Usage (Idiomatic)
Continue with: Performance Budgeting