AI 生 test data 比 Faker 強在哪？

Faker 給「格式對」假資料、LLM 給「業務上 realistic」假資料。例如生 100 個訂單、Faker 隨機 SKU + 隨機客戶、LLM 能依「VIP 客戶常買電子產品 / 新客戶低消費」這種商業邏輯生成。

該用 GPT-4 還 Claude 還 local model 生 test data？

看資料量 + 隱私。 10000 筆 → local Llama 3 省成本；含 PII → 一律 local。

生成的資料怎麼保證 valid？

用 JSON Schema constraint + Outlines / Instructor / OpenAI function calling 強制 schema。產出後再用 validator 驗 + 拒絕 / 重生。

Production data sampling 安全嗎？

必須 anonymize + differential privacy。直接複製 prod 是 GDPR 違法。建議：用 LLM 從 prod schema 生 synthetic、保留統計分佈、不保留個資。

AI Test Data Generation — 用 LLM 生 realistic 假資料

「Faker 給的測試資料太假、跑不出真實 bug」是 QA 常痛。LLM 能生 business-realistic 假資料、抓出 Faker 抓不到的 edge case。這篇給你完整 workflow + Cost optimization。

為什麼 Faker 不夠

flowchart LR
    Faker[Faker / 隨機 generator] -->|產出| Surface[格式對的假資料]
    Surface --> B1[隨機 SKU]
    Surface --> B2[隨機客戶名]
    Surface --> B3[隨機金額]

    Real[真實業務] --> R1[VIP 常買電子]
    Real --> R2[新客戶低消費]
    Real --> R3[週末訂單暴增]
    Real --> R4[退款集中某品類]

    LLM[LLM 生 test data] -->|產出| Realistic[Business-realistic]
    Realistic --> R1
    Realistic --> R2
    Realistic --> R3
    Realistic --> R4

    style Faker fill:#9ca3af,color:#fff
    style LLM fill:#10b981,color:#fff

LLM 生 test data 的 5 種場景

mindmap
  root((LLM Test Data<br>場景))
    1 Synthetic Production
      從 schema 生 realistic
      保留統計分佈
      無 PII
    2 Edge Case Generation
      負數 / 空 / 超長
      特殊字元 / Unicode
      文化邊界
    3 Adversarial Data
      SQL injection 嘗試
      XSS payload
      Prompt injection
    4 Localization
      多語姓名
      地址格式
      電話 / 身分證 各國
    5 Domain-specific
      醫療術語
      法律文件
      程式 code

Workflow 1: Schema-aware 生成

from openai import OpenAI
from pydantic import BaseModel
from typing import List

class Order(BaseModel):
    order_id: str
    customer_email: str
    items: List[str]
    total_usd: float
    status: str
    created_at: str

class TestDataResponse(BaseModel):
    orders: List[Order]

client = OpenAI()

def generate_test_orders(count=20, persona="VIP 客戶"):
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "你生 e-commerce 測試訂單。資料要 business-realistic、不是純隨機。"},
            {"role": "user", "content": f"""
            產 {count} 個 {persona} 訂單。
            條件：
            - email 用 [email protected] 格式
            - 金額分佈 realistic（VIP 平均 $300、有 outlier $5000）
            - items 是真實電子產品名（iPhone / MacBook / AirPods 等）
            - status 70% paid / 20% pending / 10% refunded
            - created_at 在過去 30 天
            """},
        ],
        response_format=TestDataResponse,
    )
    return response.choices[0].message.parsed.orders

強制 schema = 100% valid JSON、不用後處理。

Workflow 2: Edge Case 列舉

EDGE_CASE_PROMPT = """
針對 e-commerce 訂單系統的「商品價格」欄位、列出 30 個 edge case 測試值。

涵蓋：
- 邊界（0, -1, 0.001, 999999.99）
- 特殊字元（含貨幣符號 / 千分位）
- 跨地區（USD / TWD / EUR）
- 攻擊（SQL injection / overflow / underflow）
- 文化（阿拉伯數字 / 中文「壹仟」）

回 JSON array、每項含：
- "value": 測試值
- "category": "boundary" / "special" / "locale" / "attack" / "cultural"
- "expected_behavior": "accept" / "reject" / "normalize"
- "reason": 為什麼測這個
"""

LLM 出 30 個你想不到的 case、人工 review 留下 20 個 → 加進 test suite。

Workflow 3: Production Data Anonymization

flowchart LR
    Prod[(Production DB)] --> Schema[Extract schema + 統計分佈]
    Schema --> Prompt[餵 LLM<br>「依此 schema 生 realistic synthetic」]
    Prompt --> Synthetic[Synthetic Data]
    Synthetic --> Verify[Anonymity verify]
    Verify --> Staging[(Staging DB)]

    Prod -.❌.-> Staging[直接 copy = GDPR 違法]

    style Prod fill:#ef4444,color:#fff
    style Synthetic fill:#10b981,color:#fff

安全做法

def generate_synthetic_users(count=1000, stats=None):
    """
    從 prod 統計分佈（不含個資）生 synthetic data
    stats: {"age_mean": 35, "age_std": 12, "city_dist": {"Taipei": 0.4, ...}}
    """
    response = llm.complete(f"""
    Generate {count} synthetic users matching these distributions:
    {json.dumps(stats)}

    Use Faker-style fake names / emails.
    Do NOT use any real person's data.
    """)
    return parse(response)

永遠用 differential privacy 思維：synthetic data 不能逆推到任何真人。

Workflow 4: Multilingual / Localization

def generate_l10n_users(locales=["zh-TW", "ja-JP", "ar-SA", "de-DE"], per_locale=50):
    all_users = []
    for locale in locales:
        prompt = f"""
        產 {per_locale} 個 {locale} 使用者測試資料。
        - 姓名要 culturally appropriate
        - 地址符合該國格式
        - 電話符合該國格式
        - 中東 RTL 場景必涵蓋
        """
        users = llm_generate(prompt)
        all_users.extend(users)
    return all_users

抓 i18n bug（RTL / 字串展開 / 日期格式）必備。

Workflow 5: Adversarial / Security

ADVERSARIAL_PROMPT = """
產 50 個惡意輸入測試 user input 欄位（如註冊 email / 名字）。

涵蓋：
- SQL injection (' OR 1=1--)
- XSS (<script>alert(1)</script>)
- LDAP injection
- Path traversal (../../etc/passwd)
- Unicode normalization attack
- Buffer overflow (超長字串)
- Prompt injection (給 LLM-backed 系統)
- NoSQL injection
- 控制字元 (null byte / CRLF)

回 JSON array、含 "payload" 和 "category"。
"""

Cost Optimization

flowchart TD
    Cost[Cost 控制策略] --> S1[Batch 一次生 100 筆<br>勝過 100 次生 1 筆]
    Cost --> S2[Local Llama 3 免費跑<br>大量資料用]
    Cost --> S3[Cache prompt<br>同 schema 重用]
    Cost --> S4[Tier strategy<br>少量用 GPT-4 + 大量擴展]
    Cost --> S5[Sampling<br>產 baseline 500 筆<br>程式擴增到 50000]

    style S5 fill:#10b981,color:#fff

Tier Strategy 範例

# Tier 1: 100 筆「seed」用 GPT-4 高品質
seeds = generate_with_gpt4(count=100, prompt=detailed_prompt)

# Tier 2: 用 seeds 程式擴增到 10000 筆
import random
def augment(seeds, target=10000):
    expanded = []
    for _ in range(target):
        base = random.choice(seeds)
        new = base.copy()
        # 程式變化
        new["order_id"] = generate_uuid()
        new["total"] += random.uniform(-50, 50)
        new["email"] = f"qa-test-{uuid()}@example.com"
        expanded.append(new)
    return expanded

成本從 $50 → $0.50。

Quality 驗證

def validate_generated_data(data, schema):
    """驗證生成資料品質"""
    issues = []

    # 1. Schema valid
    for item in data:
        try:
            schema.validate(item)
        except ValidationError as e:
            issues.append(f"Schema fail: {e}")

    # 2. 唯一性（避免 LLM 重複輸出）
    ids = [item["id"] for item in data]
    if len(ids) != len(set(ids)):
        issues.append("Duplicate IDs detected")

    # 3. 分佈合理（avoid mode collapse）
    dist = Counter(item["status"] for item in data)
    if max(dist.values()) / len(data) > 0.9:
        issues.append(f"Mode collapse: {dist}")

    # 4. 沒有真實 PII（檢查）
    for item in data:
        if not is_synthetic_email(item.get("email", "")):
            issues.append(f"Possible real email: {item}")

    return issues

CI 整合

name: Generate Test Data
on:
  workflow_dispatch:
  schedule:
    - cron: '0 0 * * 0'  # 每週日重新生

jobs:
  generate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
      - run: pip install -r requirements.txt

      - run: python scripts/gen_test_data.py --count 1000
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

      - run: python scripts/validate_data.py test-data.json

      - uses: actions/upload-artifact@v4
        with:
          name: test-data
          path: test-data.json
          retention-days: 30

工具地圖

工具	用途
Outlines	LLM structured output、強制 schema
Instructor	Pydantic + LLM、type safety
Faker	傳統 random data（補位）
Mostly AI	商用 synthetic data
Gretel.ai	隱私安全 synthetic
Synthesized	企業向
OpenAI function calling	Structured JSON output
Anthropic Tool Use	Same

反模式

flowchart TD
    Anti[Test Data Gen 反模式] --> A1["沒 schema constraint"]
    Anti --> A2["直接複製 prod data"]
    Anti --> A3["LLM 每筆都單獨呼叫"]
    Anti --> A4["不驗證唯一性 / 分佈"]
    Anti --> A5["產出含真人 PII"]
    Anti --> A6["不 cache prompt"]
    Anti --> A7["生完不 review 直接用"]

    style A1 fill:#ef4444,color:#fff
    style A2 fill:#ef4444,color:#fff
    style A3 fill:#ef4444,color:#fff
    style A4 fill:#ef4444,color:#fff
    style A5 fill:#ef4444,color:#fff
    style A6 fill:#ef4444,color:#fff
    style A7 fill:#ef4444,color:#fff

跟 Test Data Management 串聯

LLM 生 data → 用 Test Data Management strategies 管理（Fixture / Factory / Seed）。

給 QA 的 5 句

LLM 生 business-realistic data、Faker 補格式
永遠強制 schema constraint
Tier strategy 省 100x 成本
Prod data 含 PII 一律 anonymize
生完一定要驗證唯一性 + 分佈 + 不含真人

最後

AI Test Data Generation 是 QA 跨進 AI 領域最快上手的技能。比測 LLM 系統簡單、但效果立刻看見。從建一個 schema + 用 OpenAI function calling 生 100 筆起步、你的 test data 品質會從「能跑」躍升到「能抓真 bug」。

延伸： - Test Data Management - 用 LLM 生 Test Case - Test Data 產生器（純前端工具）

AI Test Data Generation — 用 LLM 生 realistic 假資料的完整 workflow