How can I reduce API costs with repeated prompts?

Hello community,

I’m working on a project where a large prompt is repeatedly sent to the API, with only a small input string appended at the end each time. To reduce costs, I’m exploring ways to avoid resending the full prompt every time.

One idea I’ve been considering is using the Assistants API. If I set the large prompt as the assistant’s instructions, then each new message would only include the short user input. However, I haven’t been able to confirm whether this approach actually leads to cost savings.

Has anyone tested this or found a more efficient way to handle this kind of use case?

Thanks in advance!

You can use caching. There are a few requirements and it needs to be called in a short interval that varies depending on the demand. Read more about it here.

Also, I would use the responses API, where you can refer the param previous_response_id, and it will fork a response from there using the stateful API. I haven’t tested further but I don’t think there is a limit to how many forks you can do.

2 Likes

Great minds…ask the same questions?

To reduce costs: use the batch API. You can submit chat completions jobs to be run within a 24hr turnaround, and receive a 50% discount on all.

Hello! I think this is too slow for my use case.

Hello thanks for the reply.
For previous_response_id, wouldn’t that technically grow the token counter exponentially because we’re just reusing the chat as history? I’m trying to find a way to instead of submitting 600 tokens numerous times, to submit 550 of those tokens once, and the remained 50 tokens (being unique) to just be submitted individually. (If this is possible!)

instead of 600 tokens * number of submissions
to do 550 tokens + (50 tokens * number of submissions)

The generation of output relies on all the input context being supplied. API AI models are stateless, except for the chance of activating the cache mechanism which stores some precomputation, which, with OpenAI, is only going to provide a discount on runs in close proximity to each other that have over 1024 initial tokens in common (usually needing closer to 1200).

Go to Google. They allow you to store a specific cache you can reuse instead of it being automatic and expiring, then for around 75% discount depending on the model. Up to 90% with Anthropic.

The only parallel to what you describe is an AI model that has fine-tuning on a task by training on hundreds or thousands of examples. Then it may acquire the behavior innately, and not need the in-context training of your 550 tokens of instructions.

1 Like

For 600 tokens cache won’t activate, but considering cache is substantialy cheaper you might want to fullfill it depending on the situation.

But supposing you had a 1024 token input (or that openai lowers that threshold in the future), you would make a prompt like this: “during this conversation you will do … if you understand answer ‘ok’.”, and then you have a first “imutable” prompt that can be reused as a prefix and will be cached.

I think it won’t grow if you keep previous_response_id fixed (haven’t tested yet tho) to the same prompt, but if you keep the initial prompt exactly the same (prefix), changing only the subsequent short dynamic sequence it certainly works.

Best thing is to run a test: you can send 2 requests using either responses or completions, with more than 1024 or more tokens that are exactly the same in the beginning of the request, followed by different sequences, and checking the usage details that are returned (they specify how much cache was used).

Or take a look at this example in the cookbook.

1 Like

I was curious too, so I just tested it. It seems to work fine (each request follow its individual track of previous_response_id):

And you can even navigate in the playground:

2 Likes

create .md memory lattices as a form of a local cache. Saves me a ton.
I run most of my AI stuff based on those rules. I can run quite a bit on minimal MCP and API credits doing this.

I’m not sure what the point of this would be? You’re not saving any money here because whenever the LLM is invoked, the entire chain of inputs has to be re-processed. And cache savings aren’t ever guaranteed, and won’t even kick in for OP’s case.

OP could consider using a smaller model, adapting the environment to work with batched processing, or fine-tuning a smaller model to reduce the length of the prompt. Repeatedly invoking the models inevitably adds up, so it’ll likely become necessary to sacrifice latency, accuracy, or both in order to keep costs down.

I’m still curious as to what specifically the application is though.

1 Like

Sorry, but perhaps this subsequent post got out of context: In this example I only wanted to test if the parameter previous_response_id would allow to fork the conversation, or if it would ignore, always adding to the whole conversation. (no caching applies in this test).

The result? It does not accumulate, unless you keep changing previous_response_id to the next answer and so on.

Responses API caching
So, if you save the 1st prompt id and continuously pass it as the previous_response_id parameter on 100 prompts, they will not create a 100 turns conversation, but a 100 x “2 turn” conversations, thus caching the 1st turn.
I’m not sure how long it takes to start applying cache, but it seems to not be immediate as my 2nd turn failed to activate caching.

Prompt used as 1st turn
You are an expert multi‑disciplinary summarizer engine.  
You will only answer in english.
Your task is: given a single **Keyword** provided by the user, generate an ultra‑concise, information‑dense **50‑word report** about that keyword.  Follow this exhaustive, step‑by‑step protocol to ensure consistency, accuracy, and maximum density of content.
---

### 1. INPUT INTERPRETATION  
1.1. Read the user’s input strictly as a single token string (the **Keyword**).  
1.2. Normalize casing (Title Case for presentation), verify spelling against authoritative dictionaries, and resolve any ambiguities by consulting domain heuristics (e.g., proper noun vs. common noun).  
1.3. If the user uses the keyword 'test' then you abort all other steps and just answer 'ok' to indicate you understood the command.

### 2. KNOWLEDGE ACQUISITION  
2.1. Automatically select up to three relevant knowledge bases (encyclopedic, technical glossary, news feed).  
2.2. For each: perform a rapid, parallel semantic search on definitions, historical context, primary functions or mechanisms, and current relevance or controversy.  
2.3. Extract:  
  - **Core Definition**: one clause summarizing “what is it?”  
  - **Primary Attributes**: two to three noun phrases of essential features.  
  - **Significance**: one phrase of “why it matters.”  
  - **Applications/Examples**: one noun phrase or mini‑clause.  
  - **Metrics or Statistics** (if numeric data exists): include one number with unit.  

### 3. CONTENT FILTERING & RANKING  
3.1. Assign each extracted element a **Relevance Score** (1–100) based on novelty, frequency, and expert consensus.  
3.2. Discard any element scoring below 30.  
3.3. If more than five items remain, select the top five by score; otherwise, use what remains.  

### 4. SYNTHESIS & CONDENSATION  
4.1. Organize the five highest‑scoring elements into a single, grammatically coherent sentence cluster.  
4.2. Merge clauses using punctuation (commas, semicolons) to minimize function words.  
4.3. Use active‑voice verbs only.  
4.4. Replace common multi‑word expressions with domain‑specific abbreviations or acronyms when widely recognized.  
4.5. Eliminate articles (“the,” “a”) where clarity is not compromised.  
4.6. Ensure the final output is **exactly 50 words**.  Count precisely; if over, remove low‑impact adjectives or adverbials.  If under, reintegrate one qualifying phrase.  

### 5. STYLE & TONE  
- **Technical Neutrality**: no subjective qualifiers (e.g., “very,” “extremely”).  
- **Nominal Density**: maximize nouns and numerals; minimize pronouns.  
- **Parallel Structure**: list multiple items in matching grammatical form.  
- **No List Format**: present as a single paragraph.  
- **No Headings**: the user sees only the 50‑word paragraph.  

### 6. ERROR HANDLING  
- If the input keyword is ambiguous (multiple domains with equal relevance), default to the most prevalent contemporary usage (verified by news‑feed frequency).  
- If no data is found: return “No authoritative data available for ‘<Keyword>’.”  
- If numeric data is inconsistent across sources, choose the median value and note unit.  

### 7. EXAMPLES  
- **Input**: “Photosynthesis”  
  **Process**: Definition (light‑driven CO₂ reduction), Attributes (chlorophyll pigment, thylakoid membrane), Significance (global oxygen supply), Application (crop yield enhancement), Metric (≈100 Gt C/yr).  
  **Output** (50 words):  
  > Photosynthesis: light‑driven CO₂ reduction in chlorophyll‑rich thylakoid membranes, producing O₂ and organic carbon; fundamental to global oxygen balance; enhances crop yield via optimized light absorption; regulated by photoprotective enzymes; fixes ≈100 Gt C/yr, underpinning terrestrial and aquatic ecosystems.  

- **Input**: “Blockchain”  
  *(similarly processed to 50 words)*  

---

### 8. RESPONSE FORMAT  
- **Only** output the final 50‑word paragraph.  
- Do **not** include any commentary, step breakdowns, or metadata.  
- Ensure punctuation and spacing yield exactly 50 words.  

---

Begin now.  
If you understand, just answer 'ok'and await the next user prompt. 

The idea as to throw small themes (any single words) as the subsequent messages and cache the first big prompt.

Example:

  • 1st prompt: Total: 1012 (1010+2) Input Tokens: 1010 (cached: 0) Output Tokens: 2
  • 2nd prompt: Usage: Total: 1093 (1021+72) Input Tokens: 1021 (cached: 0) Output Tokens: 72
  • 3rd prompt: Total: 1091 (1020+71) Input Tokens: 1020 (cached: 1006) Output Tokens: 71
  • 4th prompt: Total: 1091 (1020+71) Input Tokens: 1020 (cached: 1006) Output Tokens: 71

Obviously, you don’t need to use the previous_response_id parameter, you can just keep the first turn and send it over and over, changing only the 2nd turn message, saving even more input tokens. I was just explaining how it could be done with this parameter, which simplifies things.

Considering caching is 25% of the input tokens price, it may suit some applications, but this is just one of many alternatives as you and the others have pointed. :man_shrugging:

2 Likes