Test Flakiness

What Is Test Flakiness?

Test flakiness describes the phenomenon where automated tests produce inconsistent results — passing on one run and failing on the next — without any changes to the code under test. A flaky test is unreliable: it fails intermittently for reasons unrelated to the correctness of the application, such as timing issues, shared mutable state, network latency, resource contention, or non-deterministic behavior.

Flaky tests are one of the most damaging problems in software development. Google published a widely cited study revealing that 1.5% of their tests were flaky, and these flaky tests were responsible for a disproportionate share of developer frustration and wasted time. When developers see a test that “sometimes fails,” they learn to distrust the test suite. They start re-running failed builds without investigating, marking test failures as “known flaky,” and eventually ignoring test failures altogether. This erosion of trust defeats the purpose of automated testing.

Flakiness is most common in end-to-end tests and integration tests because these tests interact with external systems (browsers, databases, APIs, queues) that introduce non-determinism. Unit tests can also be flaky if they depend on system clock values, random number generators, file system ordering, or shared global state. The root cause is always the same: the test assumes deterministic behavior from something that is not deterministic.

How It Works

Flaky tests fail for specific, identifiable reasons. Understanding the common causes is the first step toward eliminating flakiness.

Race conditions and timing issues are the most common cause. A test waits for an asynchronous operation to complete but uses a fixed timeout instead of waiting for a specific condition:

// FLAKY: Uses a fixed timeout
test("shows success message after save", async ({ page }) => {
  await page.click('[data-testid="save-button"]');
  await page.waitForTimeout(1000); // Hope that 1 second is enough
  const message = await page.textContent('[data-testid="status"]');
  expect(message).toBe("Saved successfully");
});

// STABLE: Waits for the actual condition
test("shows success message after save", async ({ page }) => {
  await page.click('[data-testid="save-button"]');
  await expect(page.locator('[data-testid="status"]'))
    .toHaveText("Saved successfully", { timeout: 5000 });
});

Shared mutable state causes flakiness when tests modify data that other tests depend on:

# FLAKY: Tests share database state
class TestOrders:
    def test_create_order(self):
        create_order(user_id=1, product="Widget")
        orders = get_orders(user_id=1)
        assert len(orders) == 1  # Fails if another test also created an order

    def test_delete_order(self):
        # Deletes orders created by other tests
        delete_all_orders(user_id=1)

# STABLE: Each test manages its own state
class TestOrders:
    def setup_method(self):
        self.user_id = create_test_user()

    def teardown_method(self):
        delete_test_user(self.user_id)

    def test_create_order(self):
        create_order(user_id=self.user_id, product="Widget")
        orders = get_orders(user_id=self.user_id)
        assert len(orders) == 1

Test order dependencies occur when tests pass only when run in a specific order:

// FLAKY: test_b depends on test_a running first
describe("Cart", () => {
  let cartId;

  test("creates a cart", async () => {
    const cart = await createCart();
    cartId = cart.id; // Sets state for the next test
  });

  test("adds item to cart", async () => {
    // Fails if run in isolation or if test_a is skipped
    await addItem(cartId, { product: "Widget" });
  });
});

// STABLE: each test is independent
describe("Cart", () => {
  test("adds item to a new cart", async () => {
    const cart = await createCart();
    await addItem(cart.id, { product: "Widget" });
    const items = await getItems(cart.id);
    expect(items).toHaveLength(1);
  });
});

Non-deterministic data like timestamps, random IDs, and floating-point arithmetic causes flakiness when used in assertions:

// FLAKY: Timestamp changes between creation and assertion
test("creates a record with current timestamp", () => {
  const record = createRecord();
  expect(record.createdAt).toBe(new Date().toISOString());
  // Fails if the clock advances between the two calls
});

// STABLE: Uses a tolerance or mock clock
test("creates a record with current timestamp", () => {
  jest.useFakeTimers();
  jest.setSystemTime(new Date("2025-01-15T12:00:00Z"));
  const record = createRecord();
  expect(record.createdAt).toBe("2025-01-15T12:00:00.000Z");
  jest.useRealTimers();
});

Why It Matters

Flaky tests have a corrosive effect on development velocity and team culture that goes far beyond the individual test failures. When the test suite is unreliable, developers stop trusting it. They re-run failures without investigating, assume failed tests are “just flaky,” and eventually merge code without waiting for green builds. This is precisely the state of affairs that allows real bugs to slip through to production.

The cost of flaky tests is measurable. A study at Google found that developers spent an average of 2-16 minutes investigating each flaky test failure, and with millions of test runs per day, this added up to enormous time waste. At many organizations, flaky tests are the single largest source of CI pipeline delays, requiring re-runs that double or triple the time from commit to merge.

Flaky tests also mask real failures. When a test that usually flakes starts failing due to an actual bug, developers dismiss the failure and merge the change. The real bug reaches production, and the team loses hours or days diagnosing and fixing an incident that the test suite should have caught.

The psychological impact is equally important. A test suite that reliably fails for no reason teaches developers that test failures are noise rather than signal. Rebuilding that trust — even after the flakiness is resolved — takes months of consistent green builds.

Best Practices

Quarantine flaky tests immediately. When a test is identified as flaky, move it to a quarantine suite that runs separately from the main build. This prevents the flaky test from blocking deployments while you investigate the root cause.
Track flakiness metrics. Monitor which tests fail intermittently and how often. Tools like Jest’s --bail flag, BuildPulse, and Datadog CI Visibility provide flakiness detection and tracking.
Use deterministic replacements. Mock the system clock, use seeded random number generators, and control network responses with stubs. Eliminate every source of non-determinism from your test environment.
Isolate test state. Each test should create its own data, use its own database transactions (rolled back after each test), and never depend on the output of another test. Test independence is the foundation of reliability.
Replace sleep calls with explicit waits. Never use sleep(2000) in a test. Use framework-provided waiting mechanisms that poll for a specific condition: waitForSelector, waitForResponse, eventually, or polling assertions.

Common Mistakes

Retrying flaky tests instead of fixing them. Adding automatic retries (jest --retries=3) masks flakiness without addressing the root cause. The underlying issue — a race condition, a resource leak, a timing dependency — still exists and may affect production behavior.
Dismissing flaky tests as “infrastructure issues.” While CI infrastructure can contribute to flakiness (resource contention, slow disk I/O), the test itself is usually at fault for not being resilient to these conditions. A well-written test should produce consistent results even on a slow machine.
Deleting flaky tests instead of fixing them. Removing a flaky test also removes whatever coverage it provided. Instead, fix the root cause: replace timeouts with condition-based waits, isolate shared state, and mock non-deterministic dependencies.
Not reproducing flakiness locally. Developers often say “it passes on my machine” and move on. Use tools like --repeat flags to run tests hundreds of times locally, revealing timing-dependent failures that only appear under repetition.