Skip to content

Pydantic Smart Constructors

Page Maps

graph LR
  family["Python Programming"]
  program["Python Functional Programming"]
  section["Algebraic Data Modelling Validation"]
  page["Pydantic Smart Constructors"]
  capstone["Capstone evidence"]

  family --> program --> section --> page
  page -.applies in.-> capstone
flowchart LR
  orient["Orient on the page map"] --> read["Read the main claim and examples"]
  read --> inspect["Inspect the related code, proof, or capstone surface"]
  inspect --> verify["Run or review the verification path"]
  verify --> apply["Apply the idea back to the module and capstone"]

Make the Pydantic boundary explicit enough that you stop arguing with yourself about where it belongs. The answer in this course is simple: use it where raw data enters or leaves, then cross into plain domain values and keep the core clean.

Start With the Boundary Leak

Many teams either skip validation entirely or let framework models leak through the whole system. The lesson needs to show that there is a narrower, more durable middle path.

  • If raw dicts become core dataclasses directly, bad data may explode much later than it should.
  • If Pydantic models are passed everywhere, framework semantics start replacing domain semantics.
  • If derived fields and serialization rules are not centralized at the edge, the same invariants get reimplemented repeatedly.

Core question
How do you use Pydantic v2 only at the edges as smart constructors — enforcing runtime invariants, providing stable serialization, and computing derived fields — while keeping the core domain as plain frozen dataclasses for maximum performance and purity?

This lesson introduces Pydantic as an edge-only construction tool:

  • validate and normalize raw input once
  • compute derived fields where the boundary information still exists
  • convert to plain domain values before business logic and hot paths begin

The motivating raw-JSON example matters because it shows the whole failure chain: silent acceptance now, expensive confusion later.

The naïve pattern everyone writes first:

# BEFORE – raw dict → core dataclass, silent garbage
chunk = make_chunk(**raw_json)   # accepts missing fields, wrong types, NaN embedding
serialized = json.dumps(asdict(chunk))   # order-unstable, no version, no validation on read

This is the boundary leak to catch early.

The production pattern keeps runtime validation at the edge, then crosses into pure domain values quickly so the rest of the system stays predictable and fast.

# AFTER – safe at edge, pure in core
validated = ChunkModel.model_validate(raw_json)   # clear ValidationError early
core_chunk = to_core_chunk(validated)             # → frozen dataclass, zero runtime cost inside pipeline
serialized = validated.model_dump_json(by_alias=True)  # stable, versioned, reproducible

That one-time validation boundary is the key design idea this lesson should lock in.

Use this when you have debugged bad input too late and want strong runtime validation without dragging framework models through the core.

Outcome 1. Every raw JSON/dict → validated Pydantic model → core frozen ADT. 2. Runtime invariants enforced exactly once at the edge. 3. Stable, versioned, round-trippable serialization forever.

Tiny Non-Domain Example – Production Config Loading

class ProdConfigModel(BaseModel):
    model_config = ConfigDict(extra="forbid", frozen=True, populate_by_name=True)

    port: int = Field(ge=1, le=65535)
    host: str = Field(pattern=r"^[a-z0-9.-]+$")
    timeout_ms: int = Field(gt=0)
    debug: bool = False

    @model_validator(mode="after")
    def _no_localhost(self) -> "ProdConfigModel":
        if self.host in {"localhost", "127.0.0.1", "::1"}:
            raise ValueError("localhost disallowed in prod")
        return self

    @computed_field
    def timeout_seconds(self) -> float:
        return self.timeout_ms / 1000.0

# Usage
config_model = ProdConfigModel.model_validate(raw_dict)   # raises clear error if bad
core_config = CoreConfig(
    port=config_model.port,
    host=config_model.host,
    timeout_seconds=config_model.timeout_seconds,
    debug=config_model.debug,
)

All checks in one place, derived field free, core stays plain frozen dataclass.

Why Pydantic at the Edges Only? (Three bullets every engineer should internalise)

  • Runtime enforcement: model_validator, field constraints, discriminated unions → illegal states impossible at boundary.
  • Stable serialization: model_dump_json(by_alias=True) + discriminators → order-independent, versioned, reproducible JSON forever.
  • Zero cost in core: Validate once at edge → convert to frozen dataclass → full speed + mypy totality inside pipeline.

Pydantic is only for I/O and config. Core domain stays pure frozen dataclasses.

1. Laws & Invariants (machine-checked)

Invariant Description Enforcement
Construction Invariant Invalid input raises ValidationError early Pydantic validation + tests
Round-Trip deserialize_model(serialize_model(x)) == x (with exclude_unset=True and defaults) Hypothesis property tests
Schema Stability Schema changes explicit and reviewed Snapshot tests
Discriminator Uniqueness No ambiguous union parsing Pydantic + tests
Computed Field Purity Derived fields deterministic, no side effects Reproducibility tests

2. Decision Table – Where to Use Pydantic

Location Need runtime validation? Need stable serde? Use Pydantic?
Ingress (JSON → domain) Yes Yes Yes
Core pipeline No (already validated) No No
Egress (domain → JSON) No Yes Yes
Config loading Yes Yes Yes
Hot loops No No No

3. Public API (boundaries/pydantic_edges.py – mypy --strict clean)

from __future__ import annotations

from typing import Annotated, Any, Dict, List, Literal, TypeVar
from pydantic import BaseModel, Field, ConfigDict, model_validator, computed_field, TypeAdapter
import math

from funcpipe_rag.fp.core import Chunk, make_chunk  # plain frozen core

__all__ = [
    "ChunkModel",
    "to_core_chunk",
    "from_core_chunk",
    "serialize_model",
    "deserialize_model",
]

T = TypeVar("T")

StrictConfig = ConfigDict(
    strict=True,
    frozen=True,
    extra="forbid",
    populate_by_name=True,
)

class ChunkModel(BaseModel):
    model_config = StrictConfig

    version: Literal[1] = 1
    text: str = Field(min_length=1, max_length=200_000)
    metadata: Dict[str, Any] = Field(default_factory=dict)
    embedding: List[float] | None = None

    @model_validator(mode="after")
    def _validate_embedding(self) -> "ChunkModel":
        if self.embedding is None:
            return self
        if not self.embedding:
            raise ValueError("embedding must be non-empty if present")
        if len(self.embedding) > 8192:
            raise ValueError("embedding too long")
        for i, v in enumerate(self.embedding):
            if not math.isfinite(v):
                raise ValueError(f"embedding[{i}] not finite")
            if abs(v) > 100.0:
                raise ValueError(f"embedding[{i}] out of reasonable range")
        return self

    @computed_field
    def length(self) -> int:
        return len(self.text)

def to_core_chunk(model: ChunkModel) -> Chunk:
    return make_chunk(
        text=model.text,
        path=(),
        metadata=model.metadata,
    )

def from_core_chunk(core: Chunk) -> ChunkModel:
    return ChunkModel(
        text=core.text,
        metadata=core.metadata,
    )

def serialize_model(model: BaseModel) -> str:
    return model.model_dump_json(by_alias=True, exclude_unset=True)

def deserialize_model(json_str: str, typ: type[T]) -> T:
    return TypeAdapter(typ).validate_json(json_str)

3.1 Pattern: Discriminated Unions for Core ADTs (e.g. Result)

from typing import Annotated, Generic, Literal, TypeAlias, TypeVar, Union
from pydantic import BaseModel, ConfigDict, Field

StrictConfig = ConfigDict(
    strict=True,
    frozen=True,
    extra="forbid",
    populate_by_name=True,
)

T = TypeVar("T")

class ErrInfoModel(BaseModel):
    model_config = StrictConfig
    code: str
    msg: str

class OkModel(BaseModel, Generic[T]):
    model_config = StrictConfig
    kind: Literal["ok"] = "ok"
    value: T

class ErrModel(BaseModel):
    model_config = StrictConfig
    kind: Literal["err"] = "err"
    error: ErrInfoModel

ResultModel: TypeAlias = Annotated[Union[OkModel[T], ErrModel], Field(discriminator="kind")]

4. Reference Implementations (continued)

4.1 Before vs After – Chunk Ingestion

# BEFORE – raw dict → core dataclass, silent garbage
chunk = make_chunk(**raw_json)   # accepts negative length, NaN embedding, etc.

# AFTER – validated at edge, safe core
validated = ChunkModel.model_validate(raw_json)   # clear ValidationError if bad
core_chunk = to_core_chunk(validated)             # → pure frozen dataclass

4.2 RAG Integration – Safe Ingestion Pipeline

def ingest_raw_chunk(raw: dict[str, Any]) -> Chunk:
    validated = ChunkModel.model_validate(raw)
    return to_core_chunk(validated)

def persist_chunk(core: Chunk) -> str:
    model = from_core_chunk(core)
    return serialize_model(model)   # stable, versioned JSON

5. Property-Based Proofs (capstone/tests/test_pydantic_edges.py)

import math
import pytest
from hypothesis import given, strategies as st
from funcpipe_rag.boundaries.pydantic_edges import ChunkModel, serialize_model, deserialize_model

nonfinite = st.sampled_from([float("nan"), float("inf"), float("-inf")])

@given(text=st.text(min_size=1, max_size=1000),
       metadata=st.dictionaries(st.text(), st.integers() | st.text()))
def test_chunk_roundtrip(text, metadata):
    model = ChunkModel(text=text, metadata=metadata)
    json_str = serialize_model(model)
    reloaded = deserialize_model(json_str, ChunkModel)
    assert model == reloaded
    assert reloaded.length == len(text)

@given(bad_emb=st.lists(nonfinite, min_size=1))
def test_nonfinite_embedding_rejected(bad_emb):
    with pytest.raises(ValueError):
        ChunkModel(text="x", embedding=bad_emb)

@given(emb=st.lists(st.floats(allow_nan=False, allow_infinity=False), min_size=1))
def test_large_embedding_rejected_if_out_of_range(emb):
    if any(abs(x) > 100 for x in emb):
        with pytest.raises(ValueError):
            ChunkModel(text="x", embedding=emb)

def test_schema_stable(snapshot):
    assert ChunkModel.model_json_schema() == snapshot

6. Big-O & Allocation Guarantees

Operation Time Heap Notes
Validation O(#fields) O(#fields) Once at edge
Serialization O(#fields) O(#fields) Stable order via aliases
computed_field O(1) or O(N) O(1) Recomputed on access; keep pure & fast

7. Anti-Patterns & Immediate Fixes

Anti-Pattern Symptom Fix
Raw **kwargs → dataclass Silent invalid states Pydantic model_validate at edge
Manual JSON serde Unstable order, no versioning model_dump_json / validate_json
Pydantic in hot path 10–100× slowdown Validate once → convert to frozen core
Missing discriminator Union parse ambiguity Annotated[Union[...], Field(discriminator="kind")]
Mutable models in core Accidental mutation frozen=True + extra="forbid"

8. Pre-Core Quiz

  1. Pydantic at edges for…? → Runtime validation + stable serde
  2. model_validator(mode="after") for…? → Cross-field checks
  3. Discriminated unions use…? → kind tag
  4. computed_field gives…? → Pure derived properties
  5. Core stays…? → Plain frozen dataclasses

9. Post-Core Exercise

  1. Wrap one core ADT in a Pydantic model → add model_validator + computed_field.
  2. Add discriminated union for a sum type → test parsing.
  3. Replace one raw JSON → dataclass with Pydantic edge + bridge.
  4. Add schema snapshot test for a model → verify stability.

Continue with: Pattern Matching

You now have bulletproof I/O: every external payload is validated exactly once at the edge, serialized stably forever, and the core pipeline runs at full speed on pure frozen ADTs. The rest of Module 5 adds pattern matching for orchestration and final serialization contracts.