Skip to content

Failure Injection Load

Atlas combines failure injection and load scenarios so resilience claims are measured under degraded conditions, not only happy-path traffic.

Purpose

Use this page when validating graceful degradation, correctness preservation, or recovery behavior while Atlas is under meaningful traffic.

Source of Truth

  • ops/e2e/scenarios/failure/
  • ops/load/scenarios/
  • ops/load/thresholds/

Combined Resilience Model

The failure program defines injections such as invalid config, missing artifacts, corrupted shards, disk exhaustion, ingest crashes, query crashes, bad request floods, and slow-query warning conditions. The load program pairs those ideas with traffic scenarios such as:

  • store-outage-under-spike
  • noisy-neighbor-cpu-throttle
  • pod-churn
  • stampede
  • cheap-only-survival

What Operators Are Testing

Operators should state the hypothesis before running the scenario:

  • graceful degradation: protected traffic classes stay available and Atlas reports overload honestly
  • correctness preservation: degraded conditions do not return wrong data or break contract semantics
  • recovery: the service stabilizes after the injected failure is removed

How to Judge the Outcome

  • graceful degradation means Atlas may slow down or shed selected traffic, but the protected surface stays within the declared thresholds
  • correctness failure means the service returns invalid, inconsistent, or contract-breaking results even if latency looks acceptable
  • resilience failure means the service does not recover or the incident surface becomes opaque to operators
  • ops/e2e/scenarios/failure/
  • ops/load/scenarios/
  • ops/load/thresholds/