Prompt Fatigue Question For API Calls

mcmase1212 · January 14, 2025, 7:56pm

Trying to process large transcripts of spoken english to remove natural stutters and filler words. Works with >99% accuracy for about 3 pages, and then devolves, first forgetting to do anything and just printing the input text, and then just summarizing the remaining text.

I should note that since this is a large text file, over the token limit, I have my python script breaking the text into blocks before making the call to the API. But it would seem that even though these calls are supposed to be completely independent, they are happening together, i.e., I’m getting significant prompt fatigue.

I have a lot to do and so if I have to manually break each page apart and make a new script I might as well not use this at all and go back to what I was doing before… I’m just hoping for a workaround so that the output quality of the first 3 pages is maintained. Some way of quitting and initializing a new instance of the model, etc… but I can’t figure it out or find any documentation on removing prompt fatigue.

The model doesn’t need any prior context since it’s just performing rote editing. So that’s not an issue either.

Thanks!

Code for reference. I call this method once per page with the prompt that includes my task plus that page of text. Max Tokens is 4000. But I want each call here to be a totally new instance of the LLM to avoid prompt fatigue.

response = client.chat.completions.create(
    model="gpt-3.5-turbo"
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content":  "prompt"}
    ],
    max_tokens=MAX_TOKENS,
    temperature=0,
    top_p=0,
    frequency_penalty=0,
    presence_penalty=0

jlvanhulst · January 14, 2025, 8:04pm

I’m afraid with gpt-3.5-turbo you won’t be able to fix it.
Try 4o or 4o-mini to compare and know for sure

mcmase1212 · January 14, 2025, 8:07pm

What do you mean? My results with 4o were indistinguishable from 3.5-turbo so I obviously went with the far cheaper option. The issue isn’t the understanding, but the carrying of context through 30+ subsequent calls of the model.

I want each call to be independent.

If I manually break the 30 pages into 30 files, write 30 scripts to call the API, then I get good results. But surely there’s a way to make the calls to the API in a similar fashion where each one is independent?

jlvanhulst · January 14, 2025, 8:38pm

I would sure think so, but I guess I am not understanding how the script call and the API calls are currently different.

client.chat.completions.create() would still be done 30 times and those would not have any ‘connection’ to each other. Or is that what you are experiencing?

mcmase1212 · January 14, 2025, 8:41pm

I am experiencing a clear prompt fatigue when calling the client.chat.completions.create() 30x in one script. The first few do great with near perfect accuracy and then the rest gets worse and worse. I’m also not understanding how the API calls are working, as I thought they were independent as well, but clearly something is happening.

It’s confusing to me though, as the limits on the model are 500 calls per minute. So if OpenAI is trying to get that level of usage from their larger customers, they would clearly need to not have prompt fatigue after 5 calls of the same prompt. So I’m assuming I did something wrong in how I designed my script but idk.

jlvanhulst · January 14, 2025, 8:46pm

I have not heard of prompt fatigue yet, but that doesn’t mean it can’t be real
Is it somehow possible (especially since you mention your 30x script going ok, which are also API calls?) - that some value(s) are re-used (like ‘response’ in your example code? And then you keep adding a message to the same chat? Curious to see what is in your dashboard (platform.openai.com) where you can now see your completions. It should show 30 completions.?

_j · January 14, 2025, 8:46pm

There is no connection between API calls, except for other models that can use cache for a discount.

It sounds like you might be growing a chat history and re-sending that. A symptom of gpt-3.5-turbo (and especially 16k) is not performing the rewrite task at max context.

Or if truly single messages, that the quality of the task on mid-document is not as high. You might also include an outline or summary of what the document is about, outside of the data needing production.

mcmase1212 · January 14, 2025, 8:58pm

I do see multiple API calls in my account platform. But as you both asked, I am not making the same call with a larger and larger prompt, as my token count in the platform corresponds with a correct approach where each API call has about a page of input and a page of output.

Thing is, if there were no connection between API calls, then how do you explain what is happening? First 3 pages >99% task fidelity down to <50% for the remainder as the model “forgets” what it was doing.

Context is irrelevant, as I’m only asking the model to read a page at a time and remove things like “um”, “like”, stutters and repeated words, etc.

It appears to work fine with calls in separate scripts.

_j · January 14, 2025, 9:21pm

First: a less generic system message can improve the quality overall.

“You have a singular automated task and purpose: to improve the quality of text transcripts, removing any pauses or unnatural interjections in English writing that may have been transcribed. You do this by focusing on one sentence at a time from input while you produce a new version of that sentence in the output, constructing a cleaned document as your response.”

Then for prompt, you would make a briefer command:

Here is the current section from within a transcript for you to rewrite.
Output only the better version of this text as response:

"""
{document}
"""

Are you making the distinction here between “I run one script about rewriting page 3” and “I send the same identical API calls in a loop, and my successive calls become lower in quality”? There is just no mechanism for that “prompt fatigue” to happen. You can run thousands of API calls a minute and the API doesn’t care.

I would look more closely at your document splitting and parsing code techniques, recording exactly what “messages” are being sent to the API in a log. Temporarily replace chat.completions.create(**params) with a function to create timestamp log files instead, for example.

Or if you think this effect is actually happening, in your iterating script, put in “if page_loop = 3: api_call” so only one page is submitted in the same processing style to test.

You will have highest quality in general if you don’t go by existing “pages”, but instead keep the input data to be processed at a time in one model call under 500 words or so. If you can split at logical paragraphs, even better.

Also, try gpt-4o-mini - it might not forget what its job is. It sounds like the problem is forgetting the “task” with larger inputs (and further into the growing response), and instead just getting into a pattern of repeating the input sequence to output.

mcmase1212 · January 14, 2025, 9:40pm

Thank you, I will try these things, but I just don’t know how we’re getting from 99% accuracy to just completely appearing to miss the prompt entirely by the end.

Yes, I mean that I wrote a different script for each page, and that significantly improved the response.

I didn’t review the entire thing word for word in both cases, but running everything in one script resulted in half of the page count in the output since the model began to significantly summarize everything, whereas in the first couple of pages there was hardly a 5% reduction in word count because it correctly removed only filler words. That’s a clear deviation in behavior using the same prompt across multiple calls. Running one script per page (it took me much longer and that’s what I’ve been trying to avoid) resulted in an output that was much better and nearly doubled in page count from the single-script output. So I’m not understanding what’s happening, if there is truly no mechanism for deviation from the prompt across multiple calls.

But I understand what you mean, and I agree, how can it handle 1,000s of calls per minute with no issue if there is prompt fatigue. But regardless, I can’t determine what IS causing the issue, I’ll try better prompting but I don’t see how that could be the cause of that behavior.

And yes, I do have the script parse the text block based on character count, I just didn’t provide that detail before.

And yes, I did try 4o-mini, but the behavior across 4o, 4o-mini, and 3.5-turbo was indistinguishable. This is a rote editing task and does not require advanced LLM. And yeah, I was surprised that the alleged prompt fatigue happened with all models.

vb · January 14, 2025, 10:02pm

My best guess is also that you are inadvertently sending more and more context without realizing it when using only one script.

My suggestion is to compare your expected prompt and input text token count to what you see in your platform’s billing and usage page. You can do this manually using the tool linked below and it should take only a few minutes.

https://platform.openai.com/tokenizer

You can also share your code.

mcmase1212 · January 14, 2025, 10:12pm

Tokenizer says 16,000 tokens.

OpenAI platform says I used 17,500 uncached and 1,500 cached tokens. I’m not sure what that means unless the cached are the instructions?

That doesn’t seem like I’m making the call with the entire body of text or sending back the same thing multiple times, as I would expect at least some multiple, like 100k tokens or something.

mcmase1212 · January 14, 2025, 10:14pm

Currently working on a much tighter prompt, just so that I try and eliminate that as a possibility. But I don’t see how my previous prompt was the issue if the first calls resulted in >99% accuracy in removing filler words and stutters, and then later just changed to summarizing the text. That’s what keeps stumping me.

hugebelts · January 14, 2025, 10:17pm

Every tried to add things like:

“Let’s think step by step.”

Some courses and even OpenAI itself tells in some documentation on here that you can give the model time to think.

So the “Let’s think step by step.” would be one of time.
Another one could be “Take a deap breath and count to 20…” and the like.

I guess that should help at least somewhat.

vb · January 14, 2025, 10:17pm

GPT 3.5 and it’s turbo variants have a maximum number of output tokens equal to 4,096.

If your task is proofreading, then the number of output tokens should be approximately 16,000 as well.

Also, the model allows only 16k input tokens.

Something doesn’t add up.

https://platform.openai.com/docs/models#gpt-3-5-turbo

mcmase1212 · January 14, 2025, 10:27pm

@vb I’m not sure what you mean.

I’ve set max tokens to 4,000 for the model and then break my 30 page document into roughly 1,500 token chunks to pass as input to the model. So each call to the API should only send a small bit of the document. Then there’s some standard python code to take each response and append to the output.

@hugebelts I was just rewriting my prompt on another users suggestion…

Results of the prompt update with more explicit, step by step instructions, the output is significantly worse.

I went into detail and had a “better” prompt. 1st, the formatting. Then, the revisions. Examples and specifics, first look for this “example”. Then, look for this “example”. You aren’t allowed to summarize at all. Success is when the output closely matches the input.

Well, now the model completely summarized the entire first page. Then, the rest of the document was printed completely unedited.

I was extremely, and I mean extremely impressed by my initial output. Seriously breezed through my review of the first 3 pages in minutes on double speed of the audio. Made a single correction. So I know the model is capable, I just don’t know what is happening. And I’m not seeing anything that’s not working in the code (which ChatGPT also wrote and I fixed up–don’t use python much so I don’t know all of the syntax).

mcmase1212 · January 14, 2025, 10:31pm

Here’s my code. If you take the time to look at this… thank you.

from openai import OpenAI
import os

# Set your OpenAI API key here
client = OpenAI(api_key="MY_API_KEY")

# Define the maximum token limit for GPT-3 (example: 4000 tokens)
MAX_TOKENS = 4000  # Adjust depending on your model (GPT-3.5, GPT-4, etc.)

# Function to split the text into manageable chunks
def split_text_into_chunks(text, max_tokens=MAX_TOKENS):
    # Tokenize the text and split it based on token limit
    tokens = text.split()  # Simple word-based split (you could use a real tokenizer for more accuracy)
    chunks = []
    current_chunk = []

    for token in tokens:
        current_chunk.append(token)
        # Wrap the chunk at half of the token limit to leave space for output
        if len(current_chunk) > max_tokens // 3:
            chunks.append(" ".join(current_chunk[:-1]))  # Add the current chunk (without the last token)
            current_chunk = [current_chunk[-1]]  # Start a new chunk with the last token

    if current_chunk:
        chunks.append(" ".join(current_chunk))  # Add the last chunk

    return chunks

def process_chunk(chunk):
    # Updated API usage with the new 'chat' method and gpt-3.5-turbo model
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",  # Use a newer model (or gpt-4 if you have access)
        messages=[
            {"role": "system", "content": f"You have a singular automated task and purpose: to improve the quality of text transcripts. You do this by first taking the entire transcript and making the text into a single paragraph. Then, given the context, you make one new paragraph for each speaker in the audio, with **Speaker 1:** at the start of each paragraph. You may use speaker names if they are presented in the text, but you may also just number the speakers sequentially. Next, in each paragraph, you perform a revision according to the following instructions: You are not allowed to remove any words of content, such that when comparing the output back to the input, there should be no loss of meaning. You are not allowed to correct grammar or substitute incorrect word usage (such as replacing mute point with moot point if the speaker uses the common incorrect word), or add words for clarity. You are only allowed to remove stutters or repeated words (except when the repeated words are for emphasis, such as saying it was very, very, difficult), you may also remove a thought fragment (such as a speaker saying I was--He told me something...) when the speaker changes thought entirely midsentence. You may also remove an interjection from another speaker that disrupts a thought from the current speaker and does not add content or new information, especially single word interjections like wow! or really? (but not necessarily limited to single word interjections). In this case, the interjection speaker paragraph should be removed entirely and the current speakers dialogue should continue uninterrupted in a single paragraph. You may also remove filler words or speakers ticks, such as repeated usage of um like you know, but you should not remove natural lead ins or transitive words that naturally break up dialogue, such as now or then. You are not allowed to review content for correctness, in fact, other than performing your removal of filler words, you do not care about context at all. You succeed when the output has the maximum possible similarity with the input in content and the minimum possible extraneous words according to the above guidelines, nothing was summarized in the output, no words are present in the output text that was not in the input text aside from the new paragraph headers (this task is purely reductive in nature), and each speaker has a single paragraph of text per speaking engagement with a paragraph header to identify the speaker."},
            {"role": "user", "content": f"Here is the current section of the transcript for you to revise according to your instructions. Remember, do not summarize anything or add to the text except for the paragraph headers. Focus only on your instructions. {chunk}"}
        ],
        max_tokens=MAX_TOKENS,
        temperature=0,
        top_p=0,
        frequency_penalty=0,
        presence_penalty=0
    )
    return response.choices[0].message.content.strip()

# Function to edit a large transcript
def edit_large_transcript(input_file, output_file):
    # Read the large text file
    with open(input_file, "r", encoding="utf-8") as f:
        text = f.read()

    # Split the text into manageable chunks
    chunks = split_text_into_chunks(text)

    # Process each chunk and store the results
    edited_text = ""
    for chunk in chunks:
        edited_text += process_chunk(chunk) + "\n"

    # Write the final edited text to an output file
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(edited_text)

    print(f"Editing completed. Results saved to {output_file}")

# Example usage
if __name__ == "__main__":
    input_file = r"C:\Users\mcmas\Desktop\TriDot Podcast\Output Files\TDP - 253 - Swim Straight! Your Guide to Open-Water Success.txt"  # Input file path
    output_file = r"C:\Users\mcmas\Desktop\TriDot Podcast\Revised Output Files\TDP - 253 - Swim Straight! Your Guide to Open-Water Success.txt"  # Output file path

    edit_large_transcript(input_file, output_file)

mcmase1212 · January 14, 2025, 10:33pm

The worse prompt with a great output for the first 3 pages was:

{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: “Please edit the following transcript by removing filler words, stutters, repeated words (unless for emphasis), and interjections by other speakers that do not add content (such as “wow!”). Group dialogue from each speaker into single paragraphs, only start a new paragraph when the speaker changes or appears to change. Start each paragraph with the speaker’s name, bolded, followed by a colon (e.g., ‘Andrew:’). If a speaker’s name is unknown, use ‘Speaker 1:’, ‘Speaker 2:’, etc. Do not make any changes to the content of the text or summarize anything, the output text should closely match the content and length of the input text, with only extraneous words removed. Transcript: {chunk}”}

hugebelts · January 14, 2025, 10:49pm

I would refrain from patterns/structures/wordings/sentences like

Don’t x.

Instead either write:

Refrain from x.
Or use y instead of x.
Instead of what not to do, tell what to do.
etc.

Normally this is a good exercise to get happier.
BUT, turns out we can also use this for prompting:

for example:

Don’t write a sad text. → Write a happy text.
Don’t say “no”. → Say “yes”.

AND, there are sentences where you might have to rebuild the sentence completely. That’s how much we’re used to phrase sentence in a negative way (in terms of wording.)

For example:

Don’t let it feel forced.

Instead you may use something like:

Aim for a relaxed and effortless feel.

P.S.: Or in another words:

Instead of narrowing the context by negation.
do so by specification.

It’s sometimes not easy for us humans to wrap our head around a single “no”, like in:

“Are you not sick?”

mcmase1212 · January 14, 2025, 10:53pm

Sorry, the prompt is hard to read the way the code appears above…

But you’re saying it’s all a prompt issue?

Again, it all goes back to the fact that I had a much more positive sounding prompt that I wrote like I typically do, that resulted in >99% accuracy for the first 3 pages, then devolved as subsequent calls were made. Doesn’t make sense to me, if the prompt was already good enough to get that level of accuracy, then how can I improve it?

edit because I’ve reached max posts for new accounts:

to reply to the below…

In the previous prompt that worked well for a bit, I did positive wording. I only added the negative “don’t summarize” to try and resolve the summarizing after page 3. Changing the one “don’t” to “refrain from” probably won’t make any difference.

“Please edit the following transcript by removing filler words, stutters, repeated words (unless for emphasis), and interjections by other speakers that do not add content (such as “wow!”). Group dialogue from each speaker into single paragraphs, only start a new paragraph when the speaker changes or appears to change. Start each paragraph with the speaker’s name, bolded, followed by a colon (e.g., ‘Andrew:’). If a speaker’s name is unknown, use ‘Speaker 1:’, ‘Speaker 2:’, etc. Do not make any changes to the content of the text or summarize anything, the output text should closely match the content and length of the input text, with only extraneous words removed. Transcript: {chunk}”

But I will say that when I added a bunch of "no"s it did get a lot worse! So you aren’t wrong, I just don’t think that’s the issue. The above prompt worked great… for a bit. Idk what’s causing it to not work the more times I call the model.

Topic		Replies	Views
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4504	January 26, 2024
Is the GPT4 api actually this limited or am I doing something wrong? API	13	1509	December 13, 2023
Persistent Truncation Issues with GPT-4o-Transcribe – Has Anyone Fully Solved This? API gpt-4 , api , transcribe , gpt-4o , api-realtime	11	554	July 30, 2025
How to confirm that you got the correct value from a text other than repeating the same prompt over and over API	39	942	September 1, 2024
Poor quality response on trained LLM with pdf files Community gpt-4	29	6414	May 1, 2024

Prompt Fatigue Question For API Calls

Related topics