About Prompt cache, why won't the content I send get cached?

justin.wu · October 9, 2024, 4:16am

I sent the same request twice, and according to the documentation, the second time it should be cached, but it wasn’t, what’s wrong with it?

Request:

curl 'https://api.openai.com/v1/chat/completions' \
  -H 'accept: text/event-stream' \
  -H 'accept-language: zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7,zh-TW;q=0.6,zh-HK;q=0.5' \
  -H 'authorization: Bearer xxx' \
  -H 'content-type: application/json' \
  -H 'openai-organization: xxx' \
  -H 'openai-project: xxx' \
  -H 'origin: https://platform.openai.com' \
  -H 'priority: u=1, i' \
  -H 'referer: https://platform.openai.com/' \
  -H 'sec-ch-ua: "Chromium";v="128", "Not;A=Brand";v="24", "Google Chrome";v="128"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'sec-fetch-dest: empty' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-site: same-site' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36' \
  --data-raw '{"messages":[{"role":"system","content":[{"type":"text","text":""}]},{"role":"user","content":[{"type":"text","text":"## OpenAI\n\nhttps://platform.openai.com/docs/guides/prompt-caching\nhttps://openai.com/index/api-prompt-caching/\n\n- 对于长 Prompt 可以节省 80% 的延迟和 50%的费用\n- 支持的模型 \n\t- gpt-4o (excludes gpt-4o-2024-05-13 and chatgpt-4o-latest)\n\t- gpt-4o-mini\n\t- o1-preview\n\t- o1-mini\n- 对于 Prompt 长度 >= 1024 token 的自动启动（不需要任何处理）\n- 缓存时间是 5-10 分钟，高峰时可能缓存 1 小时\n- 费用: Cache 的 Input 部分，费用是原来 50%，Output 部分，费用不变\n\nAzure-Openai:\n- 目前没有支持：https://learn.microsoft.com/en-us/answers/questions/2085985/prompt-caching-in-azure-openai\n\n## Claude\n\nhttps://www.anthropic.com/news/prompt-caching\n\n- 对于长文本，减少 85% 的延迟，减少 90% 的费用\n- 费用：使用 Cache 之后，Input 价钱提升约 25%，Cache 之后的 Input 只需要 10% 的费用，Output 部分费用不变\n- 支持的模型\n\t- Claude 3.5 Sonnet\n\t- Claude 3 Haiku\n\t- Claude 3 Opus\n如何使用：\n- 在 Prompt 中添加 ```\"cache_control\": {\"type\": \"ephemeral\"}```\n- 长度大于 1024 tokens(3.5 Sonnet, 3 Opus) 2048(3 Haiku) 才会生效\n- 最多支持 4 个 cache_control 标记位\n\nDify:\n- 目前没有支持 https://github.com/langgenius/dify/issues/7382\nAWS-Bedrock\n- 没有支持 https://github.com/boto/boto3/issues/4262\n### Best Practices\n\n- 把固定的、可以复用的内容，长上下文，经常要用的 tool 定义，放在 Prompt 的前面。\n\n\n\n\n关于 Prompt Caching\n优点：可以减少延迟，节省费用。\n缺点：我们现在还使用不了。\n\n现在只有 OpenAI 和 Claude 两家自己的模型支持 Prompt caching。azure 和 aws-bedrock 都不支持 caching。\n\nClaude 使用 Prompt caching，需要在 Prompt 数组中加入 cache_control 的标记位，而且最多支持 4 个标记位。\n\n\n## OpenAI\n\nhttps://platform.openai.com/docs/guides/prompt-caching\nhttps://openai.com/index/api-prompt-caching/\n\n- 对于长 Prompt 可以节省 80% 的延迟和 50%的费用\n- 支持的模型 \n\t- gpt-4o (excludes gpt-4o-2024-05-13 and chatgpt-4o-latest)\n\t- gpt-4o-mini\n\t- o1-preview\n\t- o1-mini\n- 对于 Prompt 长度 >= 1024 token 的自动启动（不需要任何处理）\n- 缓存时间是 5-10 分钟，高峰时可能缓存 1 小时\n- 费用: Cache 的 Input 部分，费用是原来 50%，Output 部分，费用不变\n\nAzure-Openai:\n- 目前没有支持：https://learn.microsoft.com/en-us/answers/questions/2085985/prompt-caching-in-azure-openai\n\n## Claude\n\nhttps://www.anthropic.com/news/prompt-caching\n\n- 对于长文本，减少 85% 的延迟，减少 90% 的费用\n- 费用：使用 Cache 之后，Input 价钱提升约 25%，Cache 之后的 Input 只需要 10% 的费用，Output 部分费用不变\n- 支持的模型\n\t- Claude 3.5 Sonnet\n\t- Claude 3 Haiku\n\t- Claude 3 Opus\n如何使用：\n- 在 Prompt 中添加 ```\"cache_control\": {\"type\": \"ephemeral\"}```\n- 长度大于 1024 tokens(3.5 Sonnet, 3 Opus) 2048(3 Haiku) 才会生效\n- 最多支持 4 个 cache_control 标记位\n\nDify:\n- 目前没有支持 https://github.com/langgenius/dify/issues/7382\nAWS-Bedrock\n- 没有支持 https://github.com/boto/boto3/issues/4262\n### Best Practices\n\n- 把固定的、可以复用的内容，长上下文，经常要用的 tool 定义，放在 Prompt 的前面。\n\n\n\n\n关于 Prompt Caching\n优点：可以减少延迟，节省费用。\n缺点：我们现在还使用不了。\n\n现在只有 OpenAI 和 Claude 两家自己的模型支持 Prompt caching。azure 和 aws-bedrock 都不支持 caching。\n\nClaude 使用 Prompt caching，需要在 Prompt 数组中加入 cache_control 的标记位，而且最多支持 4 个标记位。\n\n\n总结上面内容"}]}],"temperature":1,"max_tokens":2048,"top_p":1,"frequency_penalty":0,"presence_penalty":0,"seed":null,"model":"gpt-4o-mini","response_format":{"type":"text"},"stream":false}'

Response1:

{
  "id": "chatcmpl-AGILLKKyB37RNaYl2fnhGgyVgNkVt",
  "object": "chat.completion",
  "created": 1728447123,
  "model": "gpt-4o-mini-2024-07-18",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "### 关于 Prompt Caching 的总结\n\n#### 优点\n- **减少延迟**：可以显著缩短响应时间，OpenAI 节省约 80% 的延迟，Claude 节省约 85% 的延迟。\n- **节省费用**：可以降低访问成本，OpenAI 节省约 50% 的费用，Claude 通过缓存可以将输入费用降低至 10%，尽管使用缓存可能导致输入费用总体提升约 25%。\n\n#### 缺点\n- **目前可用性限制**：不所有平台均支持 Prompt Caching。目前只有 OpenAI 和 Claude 的模型支持这一特性，其它平台如 Azure 和 AWS-Bedrock 目前还不支持。\n\n#### 支持的模型\n- **OpenAI**：\n  - gpt-4o（不包括 gpt-4o-2024-05-13 和 chatgpt-4o-latest）\n  - gpt-4o-mini\n  - o1-preview\n  - o1-mini\n  - 自动在长度 >= 1024 tokens 时启用缓存。\n  - 缓存时间为 5-10 分钟，高峰时可延长至 1 小时。\n\n- **Claude**：\n  - Claude 3.5 Sonnet\n  - Claude 3 Haiku\n  - Claude 3 Opus\n  - 需在 Prompt 中添加标记 `{\"cache_control\": {\"type\": \"ephemeral\"}}`，并且有效长度分别为超过 1024 tokens（3.5 Sonnet、3 Opus）或超过 2048 tokens（3 Haiku），支持最多 4 个 cache_control 标记位。\n\n#### 最佳实践\n- 在 Prompt 的前面放置固定、可复用的内容、长上下文和经常使用的工具定义，以提高缓存的效果。\n\n综上所述，Prompt Caching 是一种提高模型响应效率并降低使用成本的有效手段，但当前的实现仍然有限，仅限于一些特定的模型和平台。",
        "refusal": null
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 1111,
    "completion_tokens": 434,
    "total_tokens": 1545,
    "prompt_tokens_details": {
      "cached_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0
    }
  },
  "system_fingerprint": "fp_f85bea6784"
}

Response2:

{
  "id": "chatcmpl-AGILgXNpgVkk06Cum7k8zZdjTBQI1",
  "object": "chat.completion",
  "created": 1728447144,
  "model": "gpt-4o-mini-2024-07-18",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "### Prompt Caching 概述\n\n**优点：**\n- 降低延迟。\n- 节省费用。\n\n**缺点：**\n- 当前仅限于 OpenAI 和 Claude 的模型支持，Azure 和 AWS Bedrock 不支持。\n\n### OpenAI 的 Prompt Caching\n\n- **优势：**\n  - 对于长 Prompt，能节省 80% 的延迟和 50% 的费用。\n  \n- **支持的模型：**\n  - gpt-4o（不包括 gpt-4o-2024-05-13 和 chatgpt-4o-latest）\n  - gpt-4o-mini\n  - o1-preview\n  - o1-mini\n  \n- **使用规则：**\n  - 自动支持对于长度 ≥ 1024 tokens 的 Prompt（不需额外处理）。\n  - 缓存时间为 5-10 分钟，高峰时长可达 1 小时。\n\n- **费用结构：**\n  - Cached Input 部分费用降至原来的 50%，Output 部分费用不变。\n\n- **Azure OpenAI：**\n  - 当前不支持 Prompt Caching（参见 [Microsoft 文档](https://learn.microsoft.com/en-us/answers/questions/2085985/prompt-caching-in-azure-openai)）。\n\n### Claude 的 Prompt Caching\n\n- **优势：**\n  - 对于长文本，减少 85% 的延迟和 90% 的费用。\n\n- **支持的模型：**\n  - Claude 3.5 Sonnet\n  - Claude 3 Haiku\n  - Claude 3 Opus\n\n- **使用规则：**\n  - 在 Prompt 中添加 `{\"cache_control\": {\"type\": \"ephemeral\"}}`。\n  - 支持的 Prompt 长度为：> 1024 tokens（3.5 Sonnet, 3 Opus）和 2048 tokens（3 Haiku）。\n  - 最多支持 4 个 cache_control 标记位。\n\n- **费用结构：**\n  - 使用缓存后，Input 的价格提升约 25%，但缓存之后的 Input 仅需支付 10% 的费用，Output 部分费用不变。\n\n### 其他平台支持情况\n\n- **Dify 和 AWS Bedrock：**\n  - 目前均不支持 Prompt Caching。\n\n### 最佳实践\n\n- 将固定的、可复用的内容和常用工具定义放在 Prompt 的前面，以利于缓存的最大化利用。\n\n### 总结\n\n目前，只有 OpenAI 和 Claude 的模型支持 Prompt Caching，带来明显的延迟和费用节省，但 Azure 和 AWS Bedrock 仍未实现此功能，用户应谨慎选择使用场景。",
        "refusal": null
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 1111,
    "completion_tokens": 572,
    "total_tokens": 1683,
    "prompt_tokens_details": {
      "cached_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0
    }
  },
  "system_fingerprint": "fp_f85bea6784"
}

Topic		Replies	Views
Cache not caching more than 1024 tokens (expected: increments of 128 tokens) Bugs prompt-caching	6	157	November 14, 2024
How does Prompt Caching work? Prompting api , prompt-caching	8	1778	October 29, 2024
How Prompt caching works? API assistants-api , prompt-caching	17	4161	February 4, 2025
Understanding Prompt caching API prompt-caching	0	129	January 2, 2025
Using fixed prompt templates with over 4000+ tokens did not trigger the half-price caching discount API chatgpt	5	91	October 27, 2024

About Prompt cache, why won't the content I send get cached?

Related topics