I racked up $200 over an hour of running chatgpt on some data it was cleaning. I had an assistant with probably 5 paragraphs of text in the intro box, and 1 paragraph of content on every request limited to 8000 characters.
What goes into the calculation for the tokens. I see $0.03 for chatgpt-4 which is what I was using, but does my 5-paragraph intro text go into the calculation every request? Or does the response data size count against me? What exactly goes into the calculation, I can’t figure it out.
At first I was using gpt 3.5 and was getting charged $1.50, $2, then I switch to gpt4 and run it for an hour and I’m at $200.
Everything in the conversation goes to the model every time you make a call, that is how the conversational aspect of the model works. The model is stateless internally and so needs to be fed all of the context required for that query, that means all of the prior context. A+B , then A+B+C then A+B+C+D then …
Yes, up to a maximum determined by what model you are using, then the system will truncate the message history and loose things from the start, if you are using a 128K model, this can be 300 pages worth of text before truncating.
You can of course handle the thread yourself by removing elements and keeping it smaller, but you may loose contextual accuracy and awareness.
If I am using message history, then it is going to max out my token usage after a few calls, right? So 8192 tokens every call, 0.03 * 8 = 25 cents per input, 50 sents per output. That is basically 75 cents a call! Am I doing this correctly?
At that point its cheaper to hire a human to do what I’ve been doing lol.
What is your technique for limiting the request size? If I don’t use message history anymore, does the intro text count against me for the assistant? Would it be better to just go back to the completions API?
It sounds like you’re using gpt-4 rather than gpt-4-1106-preview? The preview model is around 2.75 times cheaper, and it can take a higher Tokens Per Minute (probably 300,000 depending on your usage tier).
You don’t need to add the entire conversation into your new prompt as context either, just append the last message on.
if prompt := st.chat_input("What is up?"):
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.markdown(prompt)
with st.chat_message("assistant"):
message_placeholder = st.empty()
full_response = ""
for response in client.chat.completions.create(
model=st.session_state["openai_model"],
messages=[
{"role": m["role"], "content": m["content"]}
for m in st.session_state.messages
],
stream=True,
max_tokens=4000,
):
# Extract the message content using the proper JSON keys
#st.write(response)
# Access the first 'Choice' object in the 'choices' list.
choice = response.choices[0] # 'choices' is a list, so you can use index access here.
# Access the 'ChoiceDelta' object from 'choice' which contains 'content' field.
choice_delta = choice.delta # 'delta' is an attribute of 'Choice' object.
# Access the 'content' field from 'choice_delta'.
message_content = choice_delta.content # 'content' is an attribute of 'ChoiceDelta' object.
# Append the extracted content to the 'full_response' string with a newline character
full_response += message_content if message_content is not None else ""
# Update the placeholder with the 'full_response' using Streamlit's markdown to render it
message_placeholder.markdown(full_response + "▌")
#content = response.choices
#full_response += content.text.value + "\n" # `.text.value` instead of content["text"]["value"]
#full_response += response['choices'][0]['message']['content']
#message_placeholder.markdown(full_response + "▌")
message_placeholder.markdown(full_response)
st.session_state.messages.append({"role": "assistant", "content": full_response})
This is how I do mine for my streamlit interface (it’s like GPT Plus for my co-workers without actually getting GPT Plus )
You cannot “handle the thread yourself”. With what, the “truncate chat” feature not offered? You can add more metadata, and it is unclear if this is just more tokens for the AI to read and ignore.
You can only put user messages in, only the AI can write assistant messages, a hard firewall against utility and creating a smaller conversation.
role
string
Required
The role of the entity that is creating the message. Currently only user is supported.
Frist, does message history provide a material benefit to your use case? if so consider reducing that to half, remember we are dealing with compounding data here, so a reduction in half would make a much larger reduction.
If you don’t get a context completely filled with messages, by restarting so the assistant AI can’t even answer “what about the other one”, then any retrieval will also make sure that the context is also filled before the AI is set loose iterating on function calling against your API or code interpreter.
The thread object and the messages it contains can be modified, so just reduce all messages to 50% of what they were, as in, loose the top 50%. Next time you perform a run, it will be less data to process. Context may be lost, but that is a cost of token reduction. Do this if the messages are … lets say >4096 tokens worth.
The models tend to repeat the instructions they were given anyways, you could fairly easily have a max-token context or a maximum number of prompt-response interactions (maybe you only want the most recent 3 back-forths to be included). I agree with @Foxalabs