Prompt Fatigue Question For API Calls

Trying to process large transcripts of spoken english to remove natural stutters and filler words. Works with >99% accuracy for about 3 pages, and then devolves, first forgetting to do anything and just printing the input text, and then just summarizing the remaining text.

I should note that since this is a large text file, over the token limit, I have my python script breaking the text into blocks before making the call to the API. But it would seem that even though these calls are supposed to be completely independent, they are happening together, i.e., Iā€™m getting significant prompt fatigue.

I have a lot to do and so if I have to manually break each page apart and make a new script I might as well not use this at all and go back to what I was doing beforeā€¦ Iā€™m just hoping for a workaround so that the output quality of the first 3 pages is maintained. Some way of quitting and initializing a new instance of the model, etcā€¦ but I canā€™t figure it out or find any documentation on removing prompt fatigue.

The model doesnā€™t need any prior context since itā€™s just performing rote editing. So thatā€™s not an issue either.

Thanks!

Code for reference. I call this method once per page with the prompt that includes my task plus that page of text. Max Tokens is 4000. But I want each call here to be a totally new instance of the LLM to avoid prompt fatigue.

response = client.chat.completions.create(
    model="gpt-3.5-turbo"
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content":  "prompt"}
    ],
    max_tokens=MAX_TOKENS,
    temperature=0,
    top_p=0,
    frequency_penalty=0,
    presence_penalty=0

Iā€™m afraid with gpt-3.5-turbo you wonā€™t be able to fix it.
Try 4o or 4o-mini to compare and know for sure :slight_smile:

What do you mean? My results with 4o were indistinguishable from 3.5-turbo so I obviously went with the far cheaper option. The issue isnā€™t the understanding, but the carrying of context through 30+ subsequent calls of the model.

I want each call to be independent.

If I manually break the 30 pages into 30 files, write 30 scripts to call the API, then I get good results. But surely thereā€™s a way to make the calls to the API in a similar fashion where each one is independent?

I would sure think so, but I guess I am not understanding how the script call and the API calls are currently different.

client.chat.completions.create() would still be done 30 times and those would not have any ā€˜connectionā€™ to each other. Or is that what you are experiencing?

I am experiencing a clear prompt fatigue when calling the client.chat.completions.create() 30x in one script. The first few do great with near perfect accuracy and then the rest gets worse and worse. Iā€™m also not understanding how the API calls are working, as I thought they were independent as well, but clearly something is happening.

Itā€™s confusing to me though, as the limits on the model are 500 calls per minute. So if OpenAI is trying to get that level of usage from their larger customers, they would clearly need to not have prompt fatigue after 5 calls of the same prompt. So Iā€™m assuming I did something wrong in how I designed my script but idk.

I have not heard of prompt fatigue yet, but that doesnā€™t mean it canā€™t be real :slight_smile:
Is it somehow possible (especially since you mention your 30x script going ok, which are also API calls?) - that some value(s) are re-used (like ā€˜responseā€™ in your example code? And then you keep adding a message to the same chat? Curious to see what is in your dashboard (platform.openai.com) where you can now see your completions. It should show 30 completions.?

There is no connection between API calls, except for other models that can use cache for a discount.

It sounds like you might be growing a chat history and re-sending that. A symptom of gpt-3.5-turbo (and especially 16k) is not performing the rewrite task at max context.

Or if truly single messages, that the quality of the task on mid-document is not as high. You might also include an outline or summary of what the document is about, outside of the data needing production.

I do see multiple API calls in my account platform. But as you both asked, I am not making the same call with a larger and larger prompt, as my token count in the platform corresponds with a correct approach where each API call has about a page of input and a page of output.

Thing is, if there were no connection between API calls, then how do you explain what is happening? First 3 pages >99% task fidelity down to <50% for the remainder as the model ā€œforgetsā€ what it was doing.

Context is irrelevant, as Iā€™m only asking the model to read a page at a time and remove things like ā€œumā€, ā€œlikeā€, stutters and repeated words, etc.

It appears to work fine with calls in separate scripts.

First: a less generic system message can improve the quality overall.

ā€œYou have a singular automated task and purpose: to improve the quality of text transcripts, removing any pauses or unnatural interjections in English writing that may have been transcribed. You do this by focusing on one sentence at a time from input while you produce a new version of that sentence in the output, constructing a cleaned document as your response.ā€

Then for prompt, you would make a briefer command:

Here is the current section from within a transcript for you to rewrite.
Output only the better version of this text as response:

"""
{document}
"""

Are you making the distinction here between ā€œI run one script about rewriting page 3ā€ and ā€œI send the same identical API calls in a loop, and my successive calls become lower in qualityā€? There is just no mechanism for that ā€œprompt fatigueā€ to happen. You can run thousands of API calls a minute and the API doesnā€™t care.

I would look more closely at your document splitting and parsing code techniques, recording exactly what ā€œmessagesā€ are being sent to the API in a log. Temporarily replace chat.completions.create(**params) with a function to create timestamp log files instead, for example.

Or if you think this effect is actually happening, in your iterating script, put in ā€œif page_loop = 3: api_callā€ so only one page is submitted in the same processing style to test.

You will have highest quality in general if you donā€™t go by existing ā€œpagesā€, but instead keep the input data to be processed at a time in one model call under 500 words or so. If you can split at logical paragraphs, even better.

Also, try gpt-4o-mini - it might not forget what its job is. It sounds like the problem is forgetting the ā€œtaskā€ with larger inputs (and further into the growing response), and instead just getting into a pattern of repeating the input sequence to output.

Thank you, I will try these things, but I just donā€™t know how weā€™re getting from 99% accuracy to just completely appearing to miss the prompt entirely by the end.

Yes, I mean that I wrote a different script for each page, and that significantly improved the response.

I didnā€™t review the entire thing word for word in both cases, but running everything in one script resulted in half of the page count in the output since the model began to significantly summarize everything, whereas in the first couple of pages there was hardly a 5% reduction in word count because it correctly removed only filler words. Thatā€™s a clear deviation in behavior using the same prompt across multiple calls. Running one script per page (it took me much longer and thatā€™s what Iā€™ve been trying to avoid) resulted in an output that was much better and nearly doubled in page count from the single-script output. So Iā€™m not understanding whatā€™s happening, if there is truly no mechanism for deviation from the prompt across multiple calls.

But I understand what you mean, and I agree, how can it handle 1,000s of calls per minute with no issue if there is prompt fatigue. But regardless, I canā€™t determine what IS causing the issue, Iā€™ll try better prompting but I donā€™t see how that could be the cause of that behavior.

And yes, I do have the script parse the text block based on character count, I just didnā€™t provide that detail before.

And yes, I did try 4o-mini, but the behavior across 4o, 4o-mini, and 3.5-turbo was indistinguishable. This is a rote editing task and does not require advanced LLM. And yeah, I was surprised that the alleged prompt fatigue happened with all models.

My best guess is also that you are inadvertently sending more and more context without realizing it when using only one script.

My suggestion is to compare your expected prompt and input text token count to what you see in your platformā€™s billing and usage page. You can do this manually using the tool linked below and it should take only a few minutes.

https://platform.openai.com/tokenizer

You can also share your code.

Tokenizer says 16,000 tokens.

OpenAI platform says I used 17,500 uncached and 1,500 cached tokens. Iā€™m not sure what that means unless the cached are the instructions?

That doesnā€™t seem like Iā€™m making the call with the entire body of text or sending back the same thing multiple times, as I would expect at least some multiple, like 100k tokens or something.

1 Like

Currently working on a much tighter prompt, just so that I try and eliminate that as a possibility. But I donā€™t see how my previous prompt was the issue if the first calls resulted in >99% accuracy in removing filler words and stutters, and then later just changed to summarizing the text. Thatā€™s what keeps stumping me.

Every tried to add things like:

ā€œLetā€™s think step by step.ā€

Some courses and even OpenAI itself tells in some documentation on here that you can give the model time to think.

So the ā€œLetā€™s think step by step.ā€ would be one of time.
Another one could be ā€œTake a deap breath and count to 20ā€¦ā€ and the like.

I guess that should help at least somewhat.

GPT 3.5 and itā€™s turbo variants have a maximum number of output tokens equal to 4,096.

If your task is proofreading, then the number of output tokens should be approximately 16,000 as well.

Also, the model allows only 16k input tokens.

Something doesnā€™t add up.

https://platform.openai.com/docs/models#gpt-3-5-turbo

@vb Iā€™m not sure what you mean.

Iā€™ve set max tokens to 4,000 for the model and then break my 30 page document into roughly 1,500 token chunks to pass as input to the model. So each call to the API should only send a small bit of the document. Then thereā€™s some standard python code to take each response and append to the output.

@hugebelts I was just rewriting my prompt on another users suggestionā€¦

Results of the prompt update with more explicit, step by step instructions, the output is significantly worse.

I went into detail and had a ā€œbetterā€ prompt. 1st, the formatting. Then, the revisions. Examples and specifics, first look for this ā€œexampleā€. Then, look for this ā€œexampleā€. You arenā€™t allowed to summarize at all. Success is when the output closely matches the input.

Well, now the model completely summarized the entire first page. Then, the rest of the document was printed completely unedited.

I was extremely, and I mean extremely impressed by my initial output. Seriously breezed through my review of the first 3 pages in minutes on double speed of the audio. Made a single correction. So I know the model is capable, I just donā€™t know what is happening. And Iā€™m not seeing anything thatā€™s not working in the code (which ChatGPT also wrote and I fixed upā€“donā€™t use python much so I donā€™t know all of the syntax).

Hereā€™s my code. If you take the time to look at thisā€¦ thank you.

from openai import OpenAI
import os

# Set your OpenAI API key here
client = OpenAI(api_key="MY_API_KEY")

# Define the maximum token limit for GPT-3 (example: 4000 tokens)
MAX_TOKENS = 4000  # Adjust depending on your model (GPT-3.5, GPT-4, etc.)

# Function to split the text into manageable chunks
def split_text_into_chunks(text, max_tokens=MAX_TOKENS):
    # Tokenize the text and split it based on token limit
    tokens = text.split()  # Simple word-based split (you could use a real tokenizer for more accuracy)
    chunks = []
    current_chunk = []

    for token in tokens:
        current_chunk.append(token)
        # Wrap the chunk at half of the token limit to leave space for output
        if len(current_chunk) > max_tokens // 3:
            chunks.append(" ".join(current_chunk[:-1]))  # Add the current chunk (without the last token)
            current_chunk = [current_chunk[-1]]  # Start a new chunk with the last token

    if current_chunk:
        chunks.append(" ".join(current_chunk))  # Add the last chunk

    return chunks

def process_chunk(chunk):
    # Updated API usage with the new 'chat' method and gpt-3.5-turbo model
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",  # Use a newer model (or gpt-4 if you have access)
        messages=[
            {"role": "system", "content": f"You have a singular automated task and purpose: to improve the quality of text transcripts. You do this by first taking the entire transcript and making the text into a single paragraph. Then, given the context, you make one new paragraph for each speaker in the audio, with **Speaker 1:** at the start of each paragraph. You may use speaker names if they are presented in the text, but you may also just number the speakers sequentially. Next, in each paragraph, you perform a revision according to the following instructions: You are not allowed to remove any words of content, such that when comparing the output back to the input, there should be no loss of meaning. You are not allowed to correct grammar or substitute incorrect word usage (such as replacing mute point with moot point if the speaker uses the common incorrect word), or add words for clarity. You are only allowed to remove stutters or repeated words (except when the repeated words are for emphasis, such as saying it was very, very, difficult), you may also remove a thought fragment (such as a speaker saying I was--He told me something...) when the speaker changes thought entirely midsentence. You may also remove an interjection from another speaker that disrupts a thought from the current speaker and does not add content or new information, especially single word interjections like wow! or really? (but not necessarily limited to single word interjections). In this case, the interjection speaker paragraph should be removed entirely and the current speakers dialogue should continue uninterrupted in a single paragraph. You may also remove filler words or speakers ticks, such as repeated usage of um like you know, but you should not remove natural lead ins or transitive words that naturally break up dialogue, such as now or then. You are not allowed to review content for correctness, in fact, other than performing your removal of filler words, you do not care about context at all. You succeed when the output has the maximum possible similarity with the input in content and the minimum possible extraneous words according to the above guidelines, nothing was summarized in the output, no words are present in the output text that was not in the input text aside from the new paragraph headers (this task is purely reductive in nature), and each speaker has a single paragraph of text per speaking engagement with a paragraph header to identify the speaker."},
            {"role": "user", "content": f"Here is the current section of the transcript for you to revise according to your instructions. Remember, do not summarize anything or add to the text except for the paragraph headers. Focus only on your instructions. {chunk}"}
        ],
        max_tokens=MAX_TOKENS,
        temperature=0,
        top_p=0,
        frequency_penalty=0,
        presence_penalty=0
    )
    return response.choices[0].message.content.strip()

# Function to edit a large transcript
def edit_large_transcript(input_file, output_file):
    # Read the large text file
    with open(input_file, "r", encoding="utf-8") as f:
        text = f.read()

    # Split the text into manageable chunks
    chunks = split_text_into_chunks(text)

    # Process each chunk and store the results
    edited_text = ""
    for chunk in chunks:
        edited_text += process_chunk(chunk) + "\n"

    # Write the final edited text to an output file
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(edited_text)

    print(f"Editing completed. Results saved to {output_file}")

# Example usage
if __name__ == "__main__":
    input_file = r"C:\Users\mcmas\Desktop\TriDot Podcast\Output Files\TDP - 253 - Swim Straight! Your Guide to Open-Water Success.txt"  # Input file path
    output_file = r"C:\Users\mcmas\Desktop\TriDot Podcast\Revised Output Files\TDP - 253 - Swim Straight! Your Guide to Open-Water Success.txt"  # Output file path

    edit_large_transcript(input_file, output_file)

The worse prompt with a great output for the first 3 pages was:

{ā€œroleā€: ā€œsystemā€, ā€œcontentā€: ā€œYou are a helpful assistant.ā€},
{ā€œroleā€: ā€œuserā€, ā€œcontentā€: ā€œPlease edit the following transcript by removing filler words, stutters, repeated words (unless for emphasis), and interjections by other speakers that do not add content (such as ā€œwow!ā€). Group dialogue from each speaker into single paragraphs, only start a new paragraph when the speaker changes or appears to change. Start each paragraph with the speakerā€™s name, bolded, followed by a colon (e.g., ā€˜Andrew:ā€™). If a speakerā€™s name is unknown, use ā€˜Speaker 1:ā€™, ā€˜Speaker 2:ā€™, etc. Do not make any changes to the content of the text or summarize anything, the output text should closely match the content and length of the input text, with only extraneous words removed. Transcript: {chunk}ā€}

I would refrain from patterns/structures/wordings/sentences like

Donā€™t x.

Instead either write:

  • Refrain from x.
  • Or use y instead of x.
  • Instead of what not to do, tell what to do.
    etc.

Normally this is a good exercise to get happier.
BUT, turns out we can also use this for prompting:

for example:

Donā€™t write a sad text. ā†’ Write a happy text.
Donā€™t say ā€œnoā€. ā†’ Say ā€œyesā€.

AND, there are sentences where you might have to rebuild the sentence completely. Thatā€™s how much weā€™re used to phrase sentence in a negative way (in terms of wording.)

For example:

Donā€™t let it feel forced.

Instead you may use something like:

Aim for a relaxed and effortless feel.

P.S.: Or in another words:

Instead of narrowing the context by negation.
do so by specification.

Itā€™s sometimes not easy for us humans to wrap our head around a single ā€œnoā€, like in:

ā€œAre you not sick?ā€

Sorry, the prompt is hard to read the way the code appears aboveā€¦

But youā€™re saying itā€™s all a prompt issue?

Again, it all goes back to the fact that I had a much more positive sounding prompt that I wrote like I typically do, that resulted in >99% accuracy for the first 3 pages, then devolved as subsequent calls were made. Doesnā€™t make sense to me, if the prompt was already good enough to get that level of accuracy, then how can I improve it?

edit because Iā€™ve reached max posts for new accounts:

to reply to the belowā€¦

In the previous prompt that worked well for a bit, I did positive wording. I only added the negative ā€œdonā€™t summarizeā€ to try and resolve the summarizing after page 3. Changing the one ā€œdonā€™tā€ to ā€œrefrain fromā€ probably wonā€™t make any difference.

ā€œPlease edit the following transcript by removing filler words, stutters, repeated words (unless for emphasis), and interjections by other speakers that do not add content (such as ā€œwow!ā€). Group dialogue from each speaker into single paragraphs, only start a new paragraph when the speaker changes or appears to change. Start each paragraph with the speakerā€™s name, bolded, followed by a colon (e.g., ā€˜Andrew:ā€™). If a speakerā€™s name is unknown, use ā€˜Speaker 1:ā€™, ā€˜Speaker 2:ā€™, etc. Do not make any changes to the content of the text or summarize anything, the output text should closely match the content and length of the input text, with only extraneous words removed. Transcript: {chunk}ā€

But I will say that when I added a bunch of "no"s it did get a lot worse! So you arenā€™t wrong, I just donā€™t think thatā€™s the issue. The above prompt worked greatā€¦ for a bit. Idk whatā€™s causing it to not work the more times I call the model.