---
title: Flaky Test 排雷指南 — 重現 / 隔離 / 根治的 5 步驟
description: Flaky test 不能用 retry 蓋住。系統化的 reproduce → isolate → diagnose → fix → prevent，含 race condition / timing / 環境污染常見 root cause。
category: automation
tags: [flaky-test, debugging, ci, automation, e2e]
date: 2026-06-10
---

# Flaky Test 排雷指南 — 重現 / 隔離 / 根治的 5 步驟

Flaky test = 同樣 code 跑兩次結果不一樣。它比測試失敗還可怕 — **失敗你會修，flaky 你會習慣**。等到團隊習慣 retry、CI pass rate 從 99% 掉到 80%，就沒人相信測試了。

## Flaky 的本質

```
測試結果不一致 = 測試或被測物有「不可控的變數」
```

可能的不可控變數：

- **時間**：等不夠久、跑太快
- **順序**：前一個 test 留下殘跡
- **環境**：DB、cache、第三方狀態
- **並發**：race condition
- **資源**：CPU / RAM 壓力下行為變
- **外部**：網路、第三方 API

90% 的 flaky 都在前 4 個。

## 為什麼不能用 retry 蓋住

`retries: 3` 看似省事，實際上：

1. **CI 變慢** — 每個 retry 是一次跑 → 慢 3 倍
2. **掩蓋真 bug** — Race condition 是真的問題、上線會炸
3. **產生「flaky 容忍文化」** — 越用越鬆、最後 retry: 10
4. **誤導判斷** — 「reverted ↔ flaky?」分不清

**retry 是最後手段，不是第一手段**。

## 5 步驟根治流程

```
1. Reproduce → 2. Isolate → 3. Diagnose → 4. Fix → 5. Prevent
```

### Step 1: Reproduce — 先把它逼出來

flaky 最痛苦的就是「我跑不出來」。**目標：讓失敗率穩定**。

```bash
# Playwright 連跑 50 次
for i in {1..50}; do
  npx playwright test login.spec.ts --workers=1 || echo "FAIL #$i"
done
```

```bash
# pytest 用 pytest-repeat
pytest test_login.py --count=50
```

```bash
# Jest 用 jest-runner-repeat 或自己 wrap
for i in {1..50}; do npx jest test/login.test.js; done
```

跑 50 次看失敗率：

- **0%** — 你環境跟 CI 不同，要在 CI 跑 → 加 `workflow_dispatch` 手動觸發
- **2-5%** — 真 flaky、繼續
- **80%+** — 你以為是 flaky，其實穩定壞、是 regression

### Step 2: Isolate — 找出哪個 case / 哪一步

可能性：

#### A. 跨 case 污染

```bash
# 單獨跑那個 case 不 flaky
npx playwright test login.spec.ts:25 --count=50  # → 100% pass

# 跟前面 case 一起跑就 flaky
npx playwright test login.spec.ts --count=50      # → 5% fail
```

→ **前面 case 留了狀態**（user session、test data、cache）

#### B. 步驟內 race condition

case 內某 step 偶爾失敗。**加更多 log、video、trace**：

```typescript
test('login flow', async ({ page }) => {
  await page.goto('/login');
  await page.getByLabel('Email').fill('test@example.com');
  await page.getByLabel('Password').fill('password');
  await page.getByRole('button', { name: '登入' }).click();

  // 這一行偶爾 fail
  await expect(page.getByText('Welcome')).toBeVisible();
});
```

**檢查 trace**：失敗時是「點完按鈕 → 沒跳轉？」還是「跳轉了 → 文字慢出現？」

#### C. 並發測試互打

```bash
# 單 worker 不 flaky
npx playwright test --workers=1 --count=20  # → pass

# 4 worker 就 flaky
npx playwright test --workers=4 --count=20  # → 10% fail
```

→ **多測試共享 test data**（同一個帳號被多 test 改）

### Step 3: Diagnose — 找 root cause

最常見的 5 類：

#### 類別 1: Timing / 等不夠久

```typescript
// ❌ flaky
await page.click('button');
await page.click('next-button');  // 第一個 click 可能還在 loading

// ✅ 等到動作完成
await page.click('button');
await page.waitForLoadState('networkidle');
await page.click('next-button');

// ✅ 更好：等到 specific 條件
await page.click('button');
await expect(page.getByText('已儲存')).toBeVisible();
await page.click('next-button');
```

**核心原則**：**不要等時間、要等狀態**。

#### 類別 2: Auto-wait 沒覆蓋

```typescript
// ❌ Playwright auto-wait 不會等 animation 結束
await page.click('open-modal');
expect(await page.screenshot()).toMatchSnapshot();  // animation 中途截圖

// ✅ 等 animation 結束（看 CSS transition / 用 class 判斷）
await page.click('open-modal');
await expect(page.locator('.modal')).toHaveClass(/open/);
expect(await page.screenshot()).toMatchSnapshot();
```

#### 類別 3: Test data 污染

```typescript
// ❌ 兩個 test 用同個帳號
test('change password', async () => {
  await login('user1@test.com', 'oldpass');
  await changePassword('newpass');  // 跑完密碼是 newpass
});

test('login with password', async () => {
  await login('user1@test.com', 'oldpass');  // 如果 changePassword 先跑、會 fail
});

// ✅ 每個 test 自己的帳號（fixture 動態建）
test('change password', async ({ freshUser }) => {
  await login(freshUser.email, freshUser.pass);
  await changePassword('newpass');
});
```

#### 類別 4: 順序依賴

```typescript
// ❌ 順序依賴
test.serial('step 1', async () => { /* ... */ });
test.serial('step 2', async () => { /* depends on step 1 */ });

// 平行跑、test order 不固定 → flaky

// ✅ 每個 test 獨立 setup
test('step 2 with prereqs', async () => {
  await setupStep1Data();
  // 做 step 2 本身的事
});
```

#### 類別 5: 第三方 / 外部依賴

```typescript
// ❌ 直接打第三方
test('payment', async () => {
  await pay({ card: '4111...' });  // Stripe sandbox 偶爾 timeout
});

// ✅ Mock 或 stub
test('payment', async ({ page }) => {
  await page.route('**/stripe.com/**', route => route.fulfill({
    status: 200,
    body: JSON.stringify({ status: 'succeeded' }),
  }));
  await pay({ card: '4111...' });
});
```

### Step 4: Fix — 對症下藥

| Root cause | 解法 |
|-----------|------|
| Timing | `expect.toBeVisible()` 代替 `waitForTimeout` |
| Animation | 加 `animations: 'disabled'` 或等 final state |
| Test data 污染 | 每 test fresh data（fixture） |
| 順序依賴 | 拆獨立或明確 serial 標 |
| 並發互打 | `test.describe.parallel` / `serial` 標清楚 |
| 第三方 | mock / stub |
| Race condition (real bug) | **修 product code，不是測試** |

最後一個最重要：**flaky 有時是真 bug**。Race condition 在測試中表現為 flaky、在 prod 表現為偶發 bug。**先確認不是真 bug 再說**。

### Step 5: Prevent — 不要再來

修完一個還會有下一個。**加 guard rail**：

#### A. CI 上跑 stress test

```yaml
# .github/workflows/flaky-detector.yml
name: Flaky Detector
on:
  schedule:
    - cron: '0 2 * * *'  # 每晚 2am
  workflow_dispatch:

jobs:
  detect:
    runs-on: ubuntu-latest
    steps:
      - run: npx playwright test --repeat-each=10
```

連跑 10 次、抓 flaky case。

#### B. 設 quality gate

```yaml
- name: Check pass rate
  run: |
    PASS_RATE=$(jq '.passed / .total' summary.json)
    if (( $(echo "$PASS_RATE < 0.95" | bc -l) )); then
      echo "Pass rate below 95% — failing"
      exit 1
    fi
```

#### C. Code review 標記

PR template 加：

```markdown
- [ ] 是否新增測試？
- [ ] 測試獨立可重跑？
- [ ] 沒有 sleep / waitForTimeout？
```

#### D. 給 flaky 一個 lifecycle

```
[Detected] → [Investigating] → [Skipped temporarily] → [Fixed] → [Verified stable for 7 days] → [Closed]
```

不要永遠 skip。skip 超過一週 = 上個 ticket、排修。

## Flaky 偵測腳本

寫個簡單的 detector：

```bash
#!/bin/bash
# flaky-detector.sh — 對指定 test 連跑 N 次、回報失敗率

TEST=$1
RUNS=${2:-20}
FAILED=0

for i in $(seq 1 $RUNS); do
  if ! npx playwright test "$TEST" --workers=1 > /tmp/result-$i.log 2>&1; then
    FAILED=$((FAILED + 1))
  fi
done

PASS_RATE=$(echo "scale=2; ($RUNS - $FAILED) * 100 / $RUNS" | bc)
echo "Pass rate: $PASS_RATE% ($((RUNS - FAILED))/$RUNS)"

if (( $(echo "$PASS_RATE < 100" | bc -l) )); then
  echo "FLAKY DETECTED"
  exit 1
fi
```

跑：

```bash
./flaky-detector.sh tests/login.spec.ts 50
```

## 常見坑

1. **怪測試框架** — Playwright / Cypress 本身很穩。99% flaky 是測試寫法或 product code。
2. **本機過 = OK** — 本機 4 核、CI 2 核。資源緊張時 race 才出現。一定要 CI 上驗。
3. **加 sleep 治標** — `sleep(5000)` 過了今天 → 後天又 flaky → 一直加。永遠等狀態不等時間。
4. **跳過 `it.only`** — 修了一半改別的事、忘了拿掉。CI 上應該 fail-on-only。
5. **不記錄 flaky history** — 同一個 case 修了三次 → 表示沒解決根因。

## 工具

| 工具 | 用途 |
|------|------|
| Playwright Trace Viewer | 看失敗瞬間的 DOM/network |
| `pytest-repeat` / `pytest-randomly` | 反覆 / 隨機順序跑 |
| `jest --runInBand --testSequencer` | 控制順序 |
| Allure history | 看哪些 case 過去常 flaky |
| `CONCURRENCY=1` env var | 強制單 worker debug |

## 反模式總集

1. retry = 解法
2. skip = 解法
3. 把 flaky 跟 product bug 混為一談
4. 本機跑過就 push
5. 不記錄 flaky 統計

## 最後

Flaky 是測試文化的照妖鏡。**團隊容忍 5% flaky → 一年後變 20% → CI 變裝飾**。盯緊每一個 flaky、寫進 incident review 流程，才能讓自動化長期有效。