---
title: RAG 系統測試 — Retrieval / Augmentation / Generation 三層完整 QA 流程
description: RAG (Retrieval-Augmented Generation) 系統完整測試指南。Retrieval 評估（recall/precision）、Chunking 策略測試、Citation 驗證、幻覺偵測、知識庫漂移。
category: ai-qa
tags: [rag, retrieval, llm, knowledge-base, evaluation]
date: 2026-06-13
faq:
  - q: RAG 跟一般 LLM 系統測試最大差別？
    a: RAG 多了「Retrieval 層」。你不只要測 LLM 回答對不對、還要測它有沒有撈到對的文件。Retrieval 錯了 → LLM 再強都沒用。Spec 階段就該分層列指標。
  - q: Chunking 策略怎麼測？
    a: 改 chunking 後跑 retrieval eval set、看 recall@5 / recall@10 變化。一般 512 token 配 overlap 50 是好起點、但要看你的 domain（程式碼 chunk 跟法律條文 chunk 不一樣）。
  - q: 怎麼知道 RAG 在 hallucinate？
    a: 三招 — (1) 強制 citation、無 source 不回；(2) Output 含的事實對 retrieved chunks 比對；(3) LLM-as-judge 評答案是否 grounded in context。
  - q: 知識庫更新後該重測什麼？
    a: 至少跑 retrieval eval set（recall / precision）+ 抽 20 個典型問題看答案是否還對。Embedding 重算後要驗 vector search 還能找到舊文件。
---

# RAG 系統測試 — Retrieval / Augmentation / Generation 三層 QA

「我們做了個 AI 客服、查知識庫回答」= RAG 系統。**測試比純 LLM 多一倍工作** — 因為要分開測「找的對嗎」跟「答的對嗎」。這篇給你完整 framework。

## RAG 是什麼

```mermaid
flowchart LR
    User[User 問題] --> Q[Embed query]
    Q --> Vec[Vector Search]
    Vec --> KB[(Knowledge Base)]
    KB --> Chunks[Top-K chunks]
    Chunks --> Prompt[Augment prompt<br>with chunks]
    Prompt --> LLM[LLM]
    LLM --> Answer[Answer + citations]

    style Vec fill:#06b6d4,color:#fff
    style LLM fill:#a855f7,color:#fff
    style Answer fill:#10b981,color:#fff
```

**三層**：
1. **Retrieval**：找對 chunks
2. **Augmentation**：組 prompt
3. **Generation**：LLM 生答案

每層都要測。

## 為什麼 RAG 比純 LLM 難測

```mermaid
flowchart TD
    Pure[純 LLM] --> P1["輸入 → 輸出<br>一層 eval"]
    RAG[RAG] --> R1["輸入 → retrieve → augment → output<br>三層 eval"]

    RAG --> R2[Retrieval 錯 → LLM 救不了]
    RAG --> R3[Chunks 對 → LLM 還可能 hallucinate]
    RAG --> R4[知識庫更新 → 全部重測]

    style Pure fill:#10b981,color:#fff
    style RAG fill:#ef4444,color:#fff
```

## Layer 1: Retrieval 測試

**問題**：給某 query、向量搜尋是否撈到對的 chunks？

### Retrieval Eval Set 範例

```json
[
  {
    "id": "R-001",
    "query": "怎麼設定多語系？",
    "expected_doc_ids": ["doc-i18n-setup", "doc-language-switcher"],
    "min_recall_at_5": 1.0
  },
  {
    "id": "R-002",
    "query": "API 限流",
    "expected_doc_ids": ["doc-rate-limit", "doc-throttling"],
    "min_recall_at_5": 0.5
  }
]
```

### 關鍵指標

```mermaid
flowchart LR
    Metrics[Retrieval Metrics] --> M1["Recall@K<br>前 K 個中、相關文件比例"]
    Metrics --> M2["Precision@K<br>前 K 個中、有多少真的相關"]
    Metrics --> M3["MRR<br>Mean Reciprocal Rank"]
    Metrics --> M4["NDCG<br>含 rank 的相關度"]

    style Metrics fill:#06b6d4,color:#fff
```

```python
def retrieval_eval(query, expected_docs, retrieved_docs, k=5):
    top_k = retrieved_docs[:k]
    relevant_in_top_k = [d for d in top_k if d.id in expected_docs]
    recall = len(relevant_in_top_k) / len(expected_docs)
    precision = len(relevant_in_top_k) / k
    return {"recall@k": recall, "precision@k": precision}
```

## Chunking 策略測試

```mermaid
flowchart TD
    Doc[Document] --> Strategy{Chunking}

    Strategy --> S1["Fixed size<br>512 tokens"]
    Strategy --> S2["Semantic<br>按段落"]
    Strategy --> S3["Hierarchical<br>章/節/段"]
    Strategy --> S4["Sliding window<br>overlap 50"]

    S1 --> R1[簡單、可能切斷句子]
    S2 --> R2[語意完整、長度不均]
    S3 --> R3[適合長文檔]
    S4 --> R4[Context 連續、index 大]
```

**怎麼測哪個策略好**：

1. 同樣 eval set
2. 改 chunking、重 index
3. 跑 retrieval eval
4. 比較 recall@5 / NDCG

```bash
# A/B test
python eval/run.py --chunking fixed_512 --output ./out_a/
python eval/run.py --chunking semantic --output ./out_b/
python eval/compare.py ./out_a/ ./out_b/
```

## Layer 2: Augmentation 測試

```mermaid
flowchart TD
    Aug[Augmentation 測試] --> A1["Prompt 結構正確?"]
    Aug --> A2["Context 大小不超 model limit?"]
    Aug --> A3["Citation 標示對?"]
    Aug --> A4["System prompt 沒被覆蓋?"]
    Aug --> A5["敏感資訊 mask?"]

    style Aug fill:#a855f7,color:#fff
```

### 範例：Prompt 組裝測試

```python
def test_prompt_assembly():
    chunks = [{"id": "doc-1", "text": "Lorem ipsum...", "score": 0.9}]
    prompt = build_rag_prompt(user_q="How?", chunks=chunks)

    assert "Lorem ipsum" in prompt
    assert "Source: doc-1" in prompt or "[1]" in prompt
    assert len(prompt) < 16000  # model context limit
    assert "ignore previous" not in user_q.lower()  # injection guard
```

## Layer 3: Generation 測試

回到一般 LLM eval、但**多兩個維度**：

```mermaid
flowchart TD
    Gen[Generation Eval] --> G1[正確性]
    Gen --> G2[流暢度]
    Gen --> G3[相關性]
    Gen --> G4[Grounded?<br>RAG 特有]
    Gen --> G5[Citation 對嗎?<br>RAG 特有]

    style G4 fill:#ef4444,color:#fff
    style G5 fill:#ef4444,color:#fff
```

### Grounded-ness 評估

**「回答的內容是否真的來自 retrieved chunks」**？

```python
GROUNDED_PROMPT = """
評估回答是否基於 context。

Context:
{chunks}

回答:
{answer}

對每個事實聲明、回 JSON:
- "claim": "聲明",
- "in_context": true/false,
- "source_chunk": "chunk_id 或 null"

最後給整體 grounded score 0-1。
"""
```

### Citation 驗證

```python
def verify_citations(answer, chunks):
    citations = re.findall(r"\[(\d+)\]", answer)
    for c in citations:
        idx = int(c) - 1
        if idx < 0 or idx >= len(chunks):
            return {"error": f"Citation [{c}] 指向不存在的 chunk"}
    return {"valid": True}
```

## 幻覺偵測

```mermaid
flowchart LR
    Out[LLM Output] --> Extract[抽取事實聲明]
    Extract --> Check{對 chunks 驗證}
    Check -->|找得到| OK[✓ Grounded]
    Check -->|找不到| Hall[❌ 可能幻覺]
    Hall --> Action[標記或拒絕]

    style OK fill:#10b981,color:#fff
    style Hall fill:#ef4444,color:#fff
```

工具：
- **Vectara HHEM** — 開源幻覺偵測 model
- **RAGAS** — RAG 評估框架（含 faithfulness）
- **TruLens** — 即時 RAG 監控

## 知識庫更新後該測

```mermaid
flowchart TD
    Update[KB 更新] --> Q{什麼變了?}
    Q --> Q1[加新文件]
    Q --> Q2[改舊文件]
    Q --> Q3[刪文件]
    Q --> Q4[換 embedding model]

    Q1 --> T1[新文件能被撈到?]
    Q2 --> T2[相關 query 答案還對?]
    Q3 --> T3[舊 query 有 fallback?]
    Q4 --> T4[全 reindex + 全測]

    style Q4 fill:#ef4444,color:#fff
```

### 自動回歸 alert

```python
# CI: 每天跑 retrieval eval set
def daily_rag_health_check():
    baseline = load("baseline_metrics.json")
    today = run_eval()

    diff = {k: today[k] - baseline[k] for k in baseline}

    if diff["recall@5"] < -0.05:
        alert(f"⚠️ Retrieval recall 下降 {abs(diff['recall@5'])}")

    if diff["faithfulness"] < -0.10:
        page("on-call", "Faithfulness 大幅下降")
```

## 完整 RAG QA Workflow

```mermaid
flowchart LR
    Dev[Dev 改 chunking / prompt] --> CI[CI 觸發]
    CI --> R[Retrieval Eval<br>50 queries]
    R --> G[Generation Eval<br>50 examples]
    G --> H[Hallucination Check]
    H --> Comp[跟 baseline 比較]
    Comp --> Pass{>= threshold?}
    Pass -->|是| Merge[Merge]
    Pass -->|否| Block[Block PR]

    Merge --> Prod[Production]
    Prod --> Mon[Monitor metric]
    Mon --> Sample[人工抽樣 review]
    Sample --> Feedback[Feedback 回 eval set]

    style CI fill:#06b6d4,color:#fff
    style Block fill:#ef4444,color:#fff
    style Merge fill:#10b981,color:#fff
```

## 工具地圖

| 工具 | 用途 |
|------|------|
| **RAGAS** | RAG 專用 eval (faithfulness, answer_relevance) |
| **TruLens** | Open source observability |
| **LangSmith** | LangChain 工作流 trace + eval |
| **Phoenix (Arize)** | Embedding visualization |
| **Promptfoo** | YAML eval、CLI |
| **DeepEval** | Pytest-style |
| **Vectara HHEM** | 幻覺偵測 |

## 反模式

```mermaid
flowchart TD
    Anti[RAG 測試反模式] --> A1["只測 LLM 回答、忽略 retrieval"]
    Anti --> A2["沒 citation 強制"]
    Anti --> A3["KB 更新後不重測"]
    Anti --> A4["Chunking 拍腦袋決定"]
    Anti --> A5["沒測 prompt injection"]
    Anti --> A6["沒監控 production faithfulness"]
    Anti --> A7["embedding 升級沒對齊"]

    style A1 fill:#ef4444,color:#fff
    style A2 fill:#ef4444,color:#fff
    style A3 fill:#ef4444,color:#fff
    style A4 fill:#ef4444,color:#fff
    style A5 fill:#ef4444,color:#fff
    style A6 fill:#ef4444,color:#fff
    style A7 fill:#ef4444,color:#fff
```

## 給 RAG QA 的 5 句

1. **Retrieval 錯了、LLM 救不了 — 分層測**
2. **Citation 必驗、沒 source 不能回**
3. **每改一次 chunking、跑全 retrieval eval**
4. **Grounded-ness > 正確性**
5. **KB 更新 = 全測、不是「應該沒事」**

## 最後

RAG 系統測試是 LLM 應用最複雜的領域。**新人從 retrieval 50 個 eval + faithfulness 監控開始**、半年後你會變團隊不可缺的 AI QA。傳統 QA 跨進 RAG = 薪資 +30%、職缺翻倍。

延伸：
- [LLM Evaluation Testing](/ai-qa/llm-evaluation-testing.html)
- [AI / LLM 功能 Spec Review](/spec-review/ai-feature-spec-review.html)
- [AI 共存的 QA 工具箱](/ai-qa/ai-toolkit-for-qa.html)