Health, Readiness, and Drain¶
Atlas exposes separate ideas that operators should not collapse into one boolean:
- health
- readiness
- overload or drain state
Endpoint Model¶
flowchart LR
Runtime[Atlas runtime] --> Health[Health route]
Runtime --> Ready[Readiness route]
Runtime --> Overload[Overload route]
Runtime --> Live[Liveness route]
This endpoint model is here to stop one of the most common operator mistakes: treating every probe as if it were answering the same operational question.
Why the Distinction Matters¶
flowchart TD
Healthy[Process is alive] --> NotReady[May still be unready]
Ready[Can accept traffic] --> Draining[May later drain traffic]
Overloaded[Overload state] --> Traffic[Traffic shaping decisions]
This distinction diagram explains why Atlas exposes multiple routes. A runtime can be alive, unready, or intentionally shedding work in different combinations, and traffic policy should respond accordingly.
Health answers “is the process alive enough to answer basic liveness checks?”
Readiness answers “should this instance currently receive normal traffic?”
Drain or overload state answers “is the instance reducing or refusing certain work classes?”
Operators get into trouble when they collapse those into a single success signal. Atlas exposes separate endpoints because a process can be alive, not yet ready, and already overloaded in meaningfully different combinations.
Operational Usage¶
- use liveness checks to detect dead processes
- use readiness checks to gate traffic
- use overload or drain signals to avoid making a bad situation worse
- decide traffic routing from readiness and overload, not from liveness alone
Practical Checks¶
curl -s http://127.0.0.1:8080/healthz
curl -s http://127.0.0.1:8080/readyz
curl -s http://127.0.0.1:8080/healthz/overload
Operator Advice¶
- do not route normal traffic based only on liveness
- treat readiness regression as a first-class operational signal
- observe overload behavior under stress before calling a deployment “ready for production”
- do not declare an incident resolved just because
/healthzcame back
What a Healthy Probe Story Looks Like¶
- liveness stays boring and stable
- readiness reflects whether the instance should receive normal traffic
- overload and drain signals help prevent healthy-looking saturation failures
Purpose¶
This page explains the Atlas material for health, readiness, and drain and points readers to the canonical checked-in workflow or boundary for this topic.
Source of Truth¶
ops/observe/readiness.jsonops/observe/contracts/endpoint-observability-contract.jsonops/observe/contracts/overload-behavior-contract.jsondocs/bijux-atlas-ops/kubernetes/rollout-safety.md
Probe and Decision Map¶
Use the endpoint surfaces for different operational decisions:
- liveness decides whether the process should be restarted
- readiness decides whether the instance should receive normal traffic
- overload or drain state decides whether the instance should shed or limit work even while it remains alive
These signals should feed rollout and service-routing decisions differently.
When Readiness Passes but User Latency Fails¶
Treat this as a real operational mismatch, not as a false alarm. It usually means:
- the instance is technically available but overloaded
- the readiness contract is narrower than the user-facing performance contract
- load, alert, or dashboard evidence must be consulted before promotion
In that situation, do not promote just because readiness is green. Cross-check overload behavior, latency alerts, and rollout-under-load evidence first.
Stability¶
This page is part of the canonical Atlas docs spine. Keep it aligned with the current repository behavior and adjacent contract pages.