Hope you all are having a great day.
I’m working on developing an app using the GPT-4 Turbo model. However, I’ve noticed that the associated costs are quite high. I previously tried using the GPT-3 Turbo model, but I didn’t achieve the same results.
I was wondering if any of you have experimented with any improvements or specific approaches to reduce token consumption and, consequently, associated costs. I would greatly appreciate any advice, suggestions, or experiences you can share on this matter.
Thank you in advance for your time and assistance!
Yep I noticed the exact same thing in my development and needed to take take steps to reduce tokens being sent and received to the Completions API (and hence reduce dev costs). My main approach has been to limit the input content to what is strictly needed. In the case of my application this has meant carefully classifying the relation of inputs in a sequence of inputs (had ChatGPT do the classifying). This allowed me to limit the input tokens significantly without hindering the goodness of the responses.
From GPT-4 but useful, I think…
To optimize token usage and reduce costs when using GPT-4 or GPT-3 Turbo for your app, you can consider the following strategies:
- Streamline Input and Output: Ensure that the input text is concise and directly relevant to what you want the model to do. Trim unnecessary details or redundancies. Similarly, format the output to be as brief as possible while still conveying the necessary information.
- Use Stop Sequences and Token Limits: Implement stop sequences to signal the model when to stop generating text. Set a maximum token limit for the output to prevent overly lengthy responses.
- Batch Requests: If possible, batch multiple requests into a single prompt. This can be more efficient than processing each request separately.
- Cache Responses: For frequently asked questions or common prompts, cache the responses. This way, you can reuse them without having to process the same request multiple times.
- Optimize Prompt Engineering: Fine-tune your prompts to be more effective. This might involve experimenting with different phrasings to achieve the desired response in fewer tokens.
- Monitor and Analyze Usage: Keep track of how tokens are being used. Identify patterns or types of requests that consume more tokens and see if there’s a way to handle them more efficiently.
- Feedback Loop for Improvement: Implement a feedback system where users can indicate if the response was helpful or not. Use this data to refine your prompts and reduce unnecessary token usage.
- Utilize Pre-Processing: Pre-process the data before sending it to the model. This might involve summarizing or rephrasing lengthy inputs.
Remember, the key is to balance efficiency with effectiveness to ensure that the user experience remains high while minimizing costs.
- are you making model calls via the chat completions API, or are you using “assistants” which gives you no control?
- How are you managing past conversation turns? Do you count tokens, count number of turns? What determines the cutoff point where old chat is no longer sent.
- Are you making use of functions that have extensive descriptions?
- Excessive system instructions that go along with each request.
Without knowing what you are doing, it is hard to recommend improvements. It is equivalent to “How can I get to San Jose faster?”