Test Stability Policy
This document defines how the project treats flaky and unstable tests: how often a test may be retried, when it may be quarantined, and who is responsible for triaging failures. The goal is a test suite where a red run means a real regression — not noise — so contributors can trust CI.
A flaky test is one that passes and fails non-deterministically on the same code (same commit, same inputs). Flakiness erodes trust in CI, hides real regressions, and wastes reviewer time, so we treat it as a defect with the same seriousness as a failing test.
Principles
- Determinism is the default. Tests must use fixed seeds, fixed clocks,
and deterministic inputs. The suite already enforces this for generated
code via
tests/check_determinism.pyand for wire formats viatests/check_golden.py; new tests are expected to hold the same bar. - A failure is real until proven flaky. Never re-run red CI hoping for green without first inspecting the failure. Re-running to “make it pass” is prohibited.
- Flakiness is a bug. A confirmed flaky test gets an issue, an owner, and a deadline — exactly like any other defect.
- Quarantine is a last resort, never a destination. Quarantining buys time to fix; it does not close the problem.
Retry policy
Retries exist to absorb genuinely external, non-reproducible failures (network hiccups installing dependencies, runner provisioning errors), not to mask product or test defects.
| Scope | Max automatic retries | Where it applies |
|---|---|---|
| Whole CI job (infrastructure) | 1 re-run of the job | Manual “Re-run failed jobs” only, with justification in the PR |
| Network / dependency install step | 2 | apt-get, pip install, npm ci, toolchain setup |
| Individual test case logic | 0 | The functional assertions in tests/ |
Rules:
- No automatic retries are added around functional assertions. The test
bodies under
tests/must not wrap logic in retry loops to paper over non-determinism. If a test needs a retry to pass, it is flaky and must be fixed or quarantined. - Infrastructure retries must be bounded (see table) and must back off.
Unbounded
whileretry loops are not allowed. - Re-running a red CI run manually is allowed at most once and only when the failure is identified as infrastructure (not a test or product defect). The person who re-runs records why in the PR conversation.
- Timeouts count as failures, not as a reason to retry. A test that times out intermittently is flaky.
Quarantine policy
Quarantine isolates a confirmed-flaky test so it stops blocking unrelated PRs while its owner fixes the root cause.
A test may be quarantined when all of the following hold:
- It has failed non-deterministically on
main(or on an unrelated PR) at least twice, with evidence linked (CI run URLs). - A tracking issue exists, labelled
flaky-test, with an assigned owner. - The flake is not masking a real regression — i.e. the failure mode has been understood well enough to be confident it is non-deterministic noise.
How to quarantine:
- Preferred: narrow the scope. Skip only the unstable case, not the whole
file, using the suite’s existing skip mechanism (
--skip-langfor a whole language phase; an explicit earlyreturn/skip with a# QUARANTINE: <issue-url>comment for a single case). - Every quarantine must reference its tracking issue in a
QUARANTINE: <issue-url>comment adjacent to the skip, so the reason is discoverable in code review and greppable in CI logs. - A quarantined test still runs in non-gating mode where practical (e.g. a
continue-on-errormatrix leg, mirroring howtest.ymlalready tolerates the .NET 8 leg) so we keep collecting signal on the flake.
Quarantine has a maximum lifetime of 14 days. If the tracking issue is not resolved by then, the owner must either (a) delete the test if it provides no value, or (b) escalate to the maintainers to reprioritise. Quarantined tests are reviewed at least weekly.
Triage ownership
| Role | Responsibility |
|---|---|
| PR author | Investigates any failure on their PR before re-running. Must not merge over a red or quarantined-without-issue state. |
Flake owner (assignee on the flaky-test issue) | Reproduces, root-causes, and fixes the flake within the quarantine lifetime; closes the tracking issue. |
| Maintainers | Own flakes seen on main with no obvious owner; assign an owner within 2 business days; run the weekly quarantine review. |
Default ownership routing for a flake on main:
- The author of the most recent change to the affected test or the code under test is the first-line owner.
- If that is unclear, the maintainers triage and assign.
Lifecycle of a flaky test
- Detect — a test fails non-deterministically. Capture the CI run URL(s).
- File — open an issue labelled
flaky-testwith the evidence, the affected test, and a hypothesis. Cross-link it from the Test Coverage Triage table if the flake also represents a coverage gap. - Assign — an owner is set per the routing above.
- Quarantine (optional) — if the flake is blocking others and the criteria
above are met, quarantine it with a
QUARANTINE:comment linking the issue. - Fix — make the test deterministic (fix seeds/clocks/ordering, remove shared mutable state, stabilise timing assumptions) or fix the underlying product bug the flake exposed.
- Un-quarantine — re-enable the test, confirm it is green across several consecutive runs, and close the issue.
Common flake sources to check first
- Unseeded randomness. When property tests are promoted to gating CI,
pin Hypothesis seeds/
derandomize(the currenttests/test_property_roundtrip.pysuite is non-gating). Fuzz harnesses (tests/c/fuzz_parser.c,tests/py/fuzz_parser.py, the Rust target) save the reproducing input on failure — attach it to the issue. - Ordering / shared state. Tests that depend on filesystem ordering or
reuse generated artifacts under
tests/generated/without regenerating. - Timing. Real sleeps, wall-clock comparisons, or transport timeouts in the
SDK tests. Prefer injected clocks and mock transports (as the
StructFrameSdksubscribe/dispatch tests already do). - Toolchain drift. Behaviour that differs across the GCC/Clang, Python, Node, or .NET matrix legs — pin versions and reproduce on the failing leg.
Related
- Test Coverage — the generated coverage matrix and its auto-derived triage table.
- Testing — how to run the suite locally.