Prompt Caching：降低 90% 重複成本的技術

Claude API & Agent SDK 完全指南第 7 / 15 篇

本篇是「Claude API & Agent SDK 完全指南」系列的第 7 / 15 篇。你可以從系列總覽開始閱讀，也可以直接接著看本文。

如果你已經開始用 Claude API 做應用，帳單大概已經讓你有點心痛了。

不用擔心，我也走過這段路。在我把 prompt caching 導入生產環境之前，有個 RAG 應用每個月的 API 費用大約是 $800。導入之後，降到了 $180。這不是神話，是 prompt caching 的正常效果。

這一章我要把 prompt caching 的原理、用法和我的實戰心得全部告訴你。

為什麼這是最重要的成本優化技術？

先說結論，再解釋為什麼。

在所有 Claude API 的成本優化技術裡，prompt caching 是投資報酬率最高的一個。原因很簡單：

大多數 AI 應用都有一個結構特徵——重複的前綴，變化的後綴。

你的 system prompt 每次請求都一樣。你用來做 RAG 的文件上下文，在相同的查詢 session 裡幾乎不變。你的 few-shot examples，每次都是同樣那幾組。你的工具定義（tool definitions），幾乎從不改變。

這些「重複的前綴」在每次 API 請求時都要重新計算，這就是浪費。Prompt caching 解決的就是這個問題。

快取的運作原理

Claude 的 prompt caching 基於一個很直觀的概念：相同的輸入前綴只需要計算一次。

當你標記一段內容為可快取，Claude 的後端會：

計算這段內容的哈希值
第一次請求時，計算完整的 KV cache 並存起來（這次叫「cache write」）
後續相同前綴的請求，直接讀取 cache 跳過計算（這次叫「cache read」）

重點：快取是基於完整的前綴匹配。意思是，如果你有三段標記為快取的內容，Claude 需要找到這三段全部匹配的快取才能命中。你改了第一段，第二段和第三段的快取就都失效了。

這個特性很重要，後面設計快取架構的時候我們會用到。

Cache Control 的用法

在 API 層面，你用 cache_control 字段來標記哪些內容要快取：

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "你是一位專業的技術文件助手，專門幫助工程師理解複雜的 API 文件。",
        },
        {
            "type": "text",
            "text": """以下是完整的 API 參考文件（共 50,000 字）：

            [在這裡放你的長文件內容]
            """,
            "cache_control": {"type": "ephemeral"}  # 標記這段為可快取
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "請解釋 /api/v1/orders 端點的 pagination 參數怎麼用？"
        }
    ],
)

cache_control 的值目前只有一個選項：{"type": "ephemeral"}。

「ephemeral」聽起來像「短暫的」，但快取的存活時間其實不短——預設是 5 分鐘，可以延長到 1 小時（透過特定設定，稍後說明）。

快取可以放在哪裡？

cache_control 可以加在三個地方：

1. System prompt 區塊

system=[
    {"type": "text", "text": "短的指令，不快取"},
    {
        "type": "text",
        "text": "超長的背景知識文件...",
        "cache_control": {"type": "ephemeral"}
    }
]

2. Messages 裡的 user 內容

messages=[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "這是一篇很長的合約文件，請幫我分析...\n[合約全文]",
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": "第一個問題：違約金的條款在第幾頁？"  # 動態問題，不快取
            }
        ]
    }
]

3. Tool definitions（工具定義）

tools=[
    {
        "name": "search_database",
        "description": "搜尋資料庫...",
        "input_schema": {
            "type": "object",
            "properties": {...}
        },
        "cache_control": {"type": "ephemeral"}  # 工具定義通常很長，適合快取
    }
]

快取定價：寫入貴，讀取便宜

這是很多人看漏的細節，必須搞清楚：

操作	相對於標準輸入 token 的費率
標準輸入（無快取）	1x
Cache write（建立快取）	1.25x（貴 25%）
Cache read（讀取快取）	0.1x（便宜 90%）

以 Claude Opus 4.5 為例（2026 年的定價，$15/MTok 輸入）：

標準輸入：$15/百萬 tokens
Cache write：$18.75/百萬 tokens
Cache read：$1.50/百萬 tokens

Cache write 比標準輸入貴 25%，這意味著如果你的快取每次都沒有命中（每次都是 write 不是 read），你反而比不用快取還貴。

所以 prompt caching 的核心策略是：最大化快取命中率。

什麼內容適合快取？

根據快取的特性，適合快取的內容是：

高度適合：

System prompt：幾乎每次請求都相同
長文件上下文（RAG 的文件、合約、手冊）：同一個 session 內不變
Few-shot examples：固定的示例集
Tool definitions：工具定義幾乎不變

中度適合：

用戶的會話歷史：在多輪對話中，前幾輪的對話可以快取

不適合：

動態的 user input：每次都不同，快取命中率趨近於零
包含時間戳或隨機 ID 的內容：這些讓每次請求的前綴都不同

設計高快取命中率的 Prompt 架構

這是最關鍵的部分，也是我花了最多時間摸索的地方。

核心原則：把穩定的內容放前面，把變化的內容放後面。

一個典型的 RAG 應用的 prompt 結構應該是這樣：

[System Prompt - 快取] ← 穩定
[文件 1 - 快取] ← 相對穩定（同一 session）
[文件 2 - 快取] ← 相對穩定
[對話歷史 - 可考慮快取] ← 逐漸累積
[當前用戶問題 - 不快取] ← 每次都變

如果你把用戶問題放在文件之前，快取就永遠不會命中。

def build_rag_request(
    system_prompt: str,
    documents: list[str],
    conversation_history: list[dict],
    user_question: str
) -> dict:
    """建立具有最優快取結構的 RAG 請求"""

    # System prompt 最穩定，標記為快取
    system = [
        {
            "type": "text",
            "text": system_prompt,
            "cache_control": {"type": "ephemeral"}
        }
    ]

    # 建立 messages 結構
    messages = []

    # 把文件放進第一個 user turn，標記為快取
    if documents:
        doc_content = "\n\n---\n\n".join(
            f"[文件 {i+1}]\n{doc}" for i, doc in enumerate(documents)
        )
        messages.append({
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"以下是參考文件：\n\n{doc_content}",
                    "cache_control": {"type": "ephemeral"}
                },
                {
                    "type": "text",
                    "text": "好的，我已讀取這些文件。請問有什麼問題？"  # 假設這是第一個問題的佔位
                }
            ]
        })
        messages.append({
            "role": "assistant",
            "content": "好的，我已閱讀完這些參考文件，隨時可以回答你的問題。"
        })

    # 加入對話歷史
    messages.extend(conversation_history)

    # 最後加入當前問題（不快取，每次都變）
    messages.append({
        "role": "user",
        "content": user_question
    })

    return {
        "system": system,
        "messages": messages
    }

Python 完整實作

一個真實的應用案例：文件問答系統，帶快取監控：

import anthropic
from dataclasses import dataclass

@dataclass
class CacheStats:
    cache_creation_tokens: int = 0
    cache_read_tokens: int = 0
    input_tokens: int = 0
    output_tokens: int = 0

    @property
    def cache_hit_rate(self) -> float:
        total_cacheable = self.cache_creation_tokens + self.cache_read_tokens
        if total_cacheable == 0:
            return 0.0
        return self.cache_read_tokens / total_cacheable

    @property
    def estimated_savings_usd(self) -> float:
        # 假設 Claude Opus 4.5 定價：$15/MTok 輸入
        price_per_token = 15 / 1_000_000
        # 如果沒有快取，這些 cache_read 的 token 就是標準費率
        saved = self.cache_read_tokens * price_per_token * 0.9  # 節省 90%
        return saved


class DocumentQASystem:
    def __init__(self, documents: list[str], system_prompt: str):
        self.client = anthropic.Anthropic()
        self.documents = documents
        self.system_prompt = system_prompt
        self.conversation_history = []
        self.stats = CacheStats()

    def _build_system(self) -> list[dict]:
        return [
            {
                "type": "text",
                "text": self.system_prompt,
                "cache_control": {"type": "ephemeral"}
            }
        ]

    def _build_document_block(self) -> dict:
        doc_text = "\n\n---\n\n".join(
            f"[文件 {i+1}]\n{doc}" for i, doc in enumerate(self.documents)
        )
        return {
            "type": "text",
            "text": f"參考文件庫：\n\n{doc_text}",
            "cache_control": {"type": "ephemeral"}
        }

    def ask(self, question: str) -> str:
        # 建立 messages 列表
        messages = []

        # 第一輪加入文件（帶快取標記）
        if not self.conversation_history:
            messages.append({
                "role": "user",
                "content": [
                    self._build_document_block(),
                    {"type": "text", "text": question}
                ]
            })
        else:
            # 已有對話歷史：文件放在最前面的 user 訊息
            # conversation_history 已經包含了第一輪（有文件），直接繼續
            messages = self.conversation_history.copy()
            messages.append({
                "role": "user",
                "content": question
            })

        response = self.client.messages.create(
            model="claude-opus-4-5",
            max_tokens=1024,
            system=self._build_system(),
            messages=messages,
        )

        # 更新統計數據
        usage = response.usage
        self.stats.cache_creation_tokens += getattr(usage, 'cache_creation_input_tokens', 0)
        self.stats.cache_read_tokens += getattr(usage, 'cache_read_input_tokens', 0)
        self.stats.input_tokens += usage.input_tokens
        self.stats.output_tokens += usage.output_tokens

        answer = response.content[0].text

        # 更新對話歷史
        if not self.conversation_history:
            self.conversation_history.append({
                "role": "user",
                "content": [
                    self._build_document_block(),
                    {"type": "text", "text": question}
                ]
            })
        else:
            self.conversation_history.append({
                "role": "user",
                "content": question
            })
        self.conversation_history.append({
            "role": "assistant",
            "content": answer
        })

        return answer

    def print_stats(self):
        print(f"快取命中率: {self.stats.cache_hit_rate:.1%}")
        print(f"快取寫入 tokens: {self.stats.cache_creation_tokens:,}")
        print(f"快取讀取 tokens: {self.stats.cache_read_tokens:,}")
        print(f"預估節省費用: ${self.stats.estimated_savings_usd:.4f}")


# 使用範例
if __name__ == "__main__":
    # 模擬一個有大量文件的 RAG 系統
    documents = [
        "產品規格文件 v2.3...\n[5000 字的文件內容]",
        "API 參考手冊...\n[8000 字的文件內容]",
        "常見問題集...\n[3000 字的文件內容]",
    ]

    system_prompt = """你是一位專業的技術支援人員，熟悉公司的所有產品和 API。
請根據提供的文件回答用戶問題。
回答要準確、簡潔，必要時引用文件的具體內容。"""

    qa_system = DocumentQASystem(documents, system_prompt)

    # 第一次問：cache write
    answer1 = qa_system.ask("API 的 rate limit 是多少？")
    print(f"問題 1: {answer1}\n")

    # 第二次問：cache read（節省 90% 費用）
    answer2 = qa_system.ask("如何處理 429 Too Many Requests 錯誤？")
    print(f"問題 2: {answer2}\n")

    # 第三次問：cache read
    answer3 = qa_system.ask("SDK 支援哪些程式語言？")
    print(f"問題 3: {answer3}\n")

    qa_system.print_stats()

TypeScript 實作

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

interface ConversationTurn {
  role: 'user' | 'assistant';
  content: string | Anthropic.ContentBlockParam[];
}

async function buildRagRequest(
  systemPrompt: string,
  documentContext: string,
  conversationHistory: ConversationTurn[],
  userQuestion: string
): Promise<Anthropic.Messages.MessageCreateParamsNonStreaming> {
  const system: Anthropic.Messages.TextBlockParam[] = [
    {
      type: 'text',
      text: systemPrompt,
      cache_control: { type: 'ephemeral' },
    },
  ];

  const messages: ConversationTurn[] = [];

  if (conversationHistory.length === 0) {
    // 第一次請求：文件 + 問題合在第一個 user turn
    messages.push({
      role: 'user',
      content: [
        {
          type: 'text',
          text: documentContext,
          cache_control: { type: 'ephemeral' },
        } as Anthropic.Messages.TextBlockParam,
        {
          type: 'text',
          text: userQuestion,
        },
      ],
    });
  } else {
    // 後續請求：帶入歷史，新問題放最後
    messages.push(...conversationHistory);
    messages.push({
      role: 'user',
      content: userQuestion,
    });
  }

  return {
    model: 'claude-opus-4-5',
    max_tokens: 1024,
    system,
    messages: messages as Anthropic.Messages.MessageParam[],
  };
}

async function main() {
  const systemPrompt = '你是一位專業的技術支援人員...';
  const documentContext = '參考文件庫：\n\n[這裡是大量文件內容]';

  const history: ConversationTurn[] = [];

  // 第一次請求（cache write）
  const request1 = await buildRagRequest(
    systemPrompt,
    documentContext,
    history,
    'API 的 rate limit 是多少？'
  );

  const response1 = await client.messages.create(request1);
  const answer1 = (response1.content[0] as Anthropic.Messages.TextBlock).text;

  console.log('回答 1:', answer1);
  console.log('Cache 統計:', {
    cacheWrite: (response1.usage as any).cache_creation_input_tokens ?? 0,
    cacheRead: (response1.usage as any).cache_read_input_tokens ?? 0,
    inputTokens: response1.usage.input_tokens,
  });

  // 更新歷史
  history.push({
    role: 'user',
    content: [
      {
        type: 'text',
        text: documentContext,
        cache_control: { type: 'ephemeral' },
      } as Anthropic.Messages.TextBlockParam,
      { type: 'text', text: 'API 的 rate limit 是多少？' },
    ],
  });
  history.push({ role: 'assistant', content: answer1 });

  // 第二次請求（cache read）
  const request2 = await buildRagRequest(
    systemPrompt,
    documentContext,
    history,
    '如何處理 429 錯誤？'
  );

  const response2 = await client.messages.create(request2);
  const answer2 = (response2.content[0] as Anthropic.Messages.TextBlock).text;

  console.log('\n回答 2:', answer2);
  console.log('Cache 統計:', {
    cacheWrite: (response2.usage as any).cache_creation_input_tokens ?? 0,
    cacheRead: (response2.usage as any).cache_read_input_tokens ?? 0,
    inputTokens: response2.usage.input_tokens,
  });
}

main();

快取命中率的計算與監控

在 API 回應裡，usage 物件包含以下欄位：

usage = response.usage

# 標準輸入 tokens（未命中快取的部分）
input_tokens = usage.input_tokens

# 快取寫入 tokens（這次建立快取消耗的 tokens）
cache_creation_input_tokens = usage.cache_creation_input_tokens  # 可能為 None 或 0

# 快取讀取 tokens（命中快取節省的 tokens）
cache_read_input_tokens = usage.cache_read_input_tokens  # 可能為 None 或 0

快取命中率計算：

def calculate_cache_efficiency(usage) -> dict:
    cache_write = getattr(usage, 'cache_creation_input_tokens', 0) or 0
    cache_read = getattr(usage, 'cache_read_input_tokens', 0) or 0
    input_tokens = usage.input_tokens

    total_processed = input_tokens + cache_write + cache_read
    hit_rate = cache_read / (cache_write + cache_read) if (cache_write + cache_read) > 0 else 0

    return {
        "hit_rate": hit_rate,
        "cache_write_tokens": cache_write,
        "cache_read_tokens": cache_read,
        "standard_input_tokens": input_tokens,
        "total_tokens_processed": total_processed,
    }

理想的快取命中率因應用而異，但我的基準是：

文件問答系統：>70%
多輪對話：>50%（第 2 輪以後應該都命中）
批次處理：>90%（相同 system prompt，大量不同問題）

真實案例：RAG 系統的省錢計算

我把一個內部知識庫問答系統的費用做了計算：

不用快取（每月）：

每次查詢 tokens：system prompt 500 + 文件上下文 15,000 + 問題 200 = 15,700 tokens
每天 1,000 次查詢 = 每月 30,000 次
總輸入 tokens：30,000 × 15,700 = 471M tokens
費用（Claude Opus 4.5 $15/MTok）：$7,065/月

使用快取後（每月）：

系統 prompt + 文件上下文 = 15,500 tokens → 快取後只在第一次請求付 1.25x
後續請求：問題 200 tokens（標準費率）+ 15,500 tokens（0.1x 費率）
假設每個 session 平均 5 次對話，快取存活 5 分鐘內完成
有效快取命中率約 80%
費用：大幅降低，約 $1,500/月

節省：約 80%

這還是保守估計。如果你的系統有很長的文件，節省比例可以更高。

Prompt caching 是我見過 ROI 最高的 Claude API 優化。設置時間大概 2-4 小時，但可以立刻看到帳單下降。

下一章我們換個話題，聊聊另一種降低成本和提升吞吐量的方法：Batch API。如果你需要一次處理幾千份文件，Batch API 能讓你用 50% 的價格完成任務。

為什麼這是最重要的成本優化技術？

快取的運作原理

Cache Control 的用法

快取可以放在哪裡？

快取定價：寫入貴，讀取便宜

什麼內容適合快取？

設計高快取命中率的 Prompt 架構

Python 完整實作

TypeScript 實作

快取命中率的計算與監控

真實案例：RAG 系統的省錢計算

相關文章

成本控制：省錢是一門工程藝術

完整案例：從 0 到 1 打造 AI 客服系統

生產環境部署：錯誤處理、限流與可觀測性

留言討論