I say: It may be. If it turns out the same still… after this… Then another issue may be causing the problem.
P.S.: Above this are suggestions. And a description.
I say: It may be. If it turns out the same still… after this… Then another issue may be causing the problem.
P.S.: Above this are suggestions. And a description.
MAX_TOKENS // 3 does not optimize for the ideal input length, as larger chunks can lead to degraded AI understanding.A more effective strategy involves semantic-aware splitting:
Below is also some bot-written code after a few iterations.
You can put back the framing messages after making them actionable and not word spew.
Python code implementing semantic-aware chunking using word counts, with a complete workflow for first preparing the chunks, processing them via the API, logging the inputs and outputs for diagnostics, and reassembling the final document.
import re
# Temporary diagnostic files
CHUNK_INPUT_FILE = "chunk_input.txt"
LOG_FILE = "chunk_log.txt"
# Function to split text by paragraphs
def split_by_paragraphs(text):
"""Split text into paragraphs using double newlines as delimiters."""
paragraphs = text.split("\n\n")
return [para.strip() for para in paragraphs if para.strip()]
# Function to split text into sentences
def split_by_sentences(text):
"""Split text into sentences using punctuation delimiters."""
return re.split(r'(?<=[.!?])\s', text)
# Function for semantic-aware splitting
def semantic_chunk_split(text, target_words=500, max_words=800, hard_limit=1000):
"""
Split text semantically with priorities:
- Paragraph splits at ~500 words.
- Sentence-level splits up to ~800 words if no paragraph boundary found.
- Hard split at ~1000 words if necessary.
"""
paragraphs = split_by_paragraphs(text)
chunks = []
current_chunk = []
current_word_count = 0
for para in paragraphs:
para_word_count = len(para.split())
# Check if adding this paragraph would exceed the max word count
if current_word_count + para_word_count > max_words:
# If target length not reached, split by sentences
if current_word_count < target_words and para_word_count <= max_words:
sentences = split_by_sentences(para)
for sentence in sentences:
sentence_word_count = len(sentence.split())
if current_word_count + sentence_word_count > max_words:
# Force split when hard limit is reached
chunks.append({
"content": " ".join(current_chunk).strip(),
"split_at": "sentence"
})
current_chunk = [sentence]
current_word_count = sentence_word_count
else:
current_chunk.append(sentence)
current_word_count += sentence_word_count
else:
# Save the current chunk and start a new one
chunks.append({
"content": " ".join(current_chunk).strip(),
"split_at": "paragraph"
})
current_chunk = [para]
current_word_count = para_word_count
else:
# Add paragraph to the current chunk
current_chunk.append(para)
current_word_count += para_word_count
# Force split if hard limit is reached
if current_word_count > hard_limit:
chunks.append({
"content": " ".join(current_chunk).strip(),
"split_at": "word"
})
current_chunk = []
current_word_count = 0
# Add the last chunk
if current_chunk:
chunks.append({
"content": " ".join(current_chunk).strip(),
"split_at": "paragraph" if current_word_count <= target_words else "word"
})
return chunks
# Function to process each chunk using the OpenAI API
def process_chunk(chunk, client):
"""Send a chunk of text to the OpenAI API and return the response."""
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a document assistant."},
{"role": "user", "content": f"Here is a section of the transcript: {chunk}"}
],
max_tokens=4000,
temperature=0,
top_p=0,
frequency_penalty=0,
presence_penalty=0
)
return response.choices[0].message.content.strip()
# Main function to edit a large transcript
def edit_large_transcript(input_file, output_file, client):
# Read the large text file
with open(input_file, "r", encoding="utf-8") as f:
text = f.read()
# Split the text into chunks
chunks = semantic_chunk_split(text)
# Write the chunks to the diagnostic file
with open(CHUNK_INPUT_FILE, "w", encoding="utf-8") as f:
for i, chunk in enumerate(chunks):
f.write(f"Chunk {i + 1} (split at {chunk['split_at']}):\n")
f.write(repr(chunk["content"]) + "\n\n")
# Process each chunk and store the results
edited_chunks = []
with open(LOG_FILE, "w", encoding="utf-8") as log:
for i, chunk in enumerate(chunks):
log.write(f"Processing Chunk {i + 1} (split at {chunk['split_at']}):\n")
log.write(repr(chunk["content"]) + "\n\n")
response = process_chunk(chunk["content"], client)
log.write(f"Response for Chunk {i + 1}:\n")
log.write(response + "\n\n=====SPLIT=====\n\n")
edited_chunks.append({
"content": response,
"split_at": chunk["split_at"]
})
# Reassemble the final document
final_text = ""
for chunk in edited_chunks:
final_text += chunk["content"]
if chunk["split_at"] == "paragraph":
final_text += "\n\n"
elif chunk["split_at"] == "sentence":
final_text += " "
else:
final_text += " "
# Write the final output to the file
with open(output_file, "w", encoding="utf-8") as f:
f.write(final_text.strip())
print(f"Editing completed. Results saved to {output_file}")
# Example usage
if __name__ == "__main__":
from openai import OpenAI # Importing OpenAI client
client = OpenAI(api_key="MY_API_KEY") # Replace with your API key
input_file = "input_transcript.txt" # Replace with your input file path
output_file = "output_transcript.txt" # Replace with your output file path
edit_large_transcript(input_file, output_file, client)
Semantic Chunk Splitting:
split_at) for tracking the splitting method used, making reassembly possible.Diagnostic Logging:
chunk_input.txt) in repr() format for debugging.chunk_log.txt).API Limits:
Improvements: this ideally would be written where progress could be resumed if there is a complete crash.
I ran your code on some of our Fireflies transcripts.
Here are my thoughts. I agree with the AI generated help on chunking. At least you need to make sure that chunking happens so that a new ‘page’ always starts with a speaker.
Your prompt part that mentions ‘Speaker 1’ is going to cause problems on transcripts that HAVE speaker names, running it several times on different models results in sometimes going with speaker 1 2 3 etc eventhough there are speaker names in my transcript. If you leave that out completely - it should work just fine. It will leave either speaker names or speaker 1 etc.
I would also remove the whole first part about paragraphs.
Now my biggest recommendation in this case would be to simply use Google Gemini for this, super simple update of the code and has a huge context window. But price might be too high? It is so much easier and faster for this job.
BUT I would say you can completely avoid all of this by simply using Google Gemini (1.5). I ran your episode 253 on that without any problem. Send me an email and I will share the outputs from both my local test and the Gemini output.
I really appreciate everyone’s time.
Looking back at this today, a couple of thoughts.
I still struggle to identify the cause of the issue, since the program ran very well at first. Maybe prompt fatigue was mis-identified as the cause, but if so, then what is the cause?
I will experiment with prompting. Also I do see the discussion about breaking the text into smaller chunks, or at least semantically, but I think that’s overdesigning. I wanted a quick code/prompt that’s 99% accurate. I have to review regardless of how accurate it is, so 1 mistake per page due to poor chunking is still within design requirements.
The formatting stuff is because whisper outputs a line per 2 seconds of audio. So sometimes there’s 1 word, sometimes 10. Depends on how fast the speaker is talking. Also there’s no speaker tags or other identification from the initial transcript. But overall, that’s a much less timesave than the revisions. Probably should break into a separate API call with a different prompt first.
I’m not new to coding, but I’m new to python and AI APIs. So maybe my entire approach is wrong… so to revisit my design requirements…
DESIGN REQUIREMENTS:
Listen to an audio file and output a transcript that is 99% accurate in content, but excluding repeated words, stutters, filler words, and interjections.
Ideally also assist with formatting and speaker tags.
I don’t really have the option with my company to use expensive or subscription software that specializes in this (yet) but I calculated that OpenAI could do one for about $0.05 and that was a go.
My current solution:
1-call Whisper to perform the intial transcription.
2-initial transcript is “too accurate” and includes every word and stutter.
3-in a new script, call gpt-3.5-turbo to perform transcript cleanup.
4-(TBD) I should make a second step for formatting.
Trying to fix step 3, but also, maybe my entire process is sub-optimal and there’s a much better way to do this. Open to suggestions either way. Will experiment with prompting and breaking the text into smaller pieces (or potentially running calls asynchronously).
But, at this point, I’m getting out on a path where I’m investing a ton of time into the coding that if i don’t get it to work, I won’t earn my lost time back. I could probably perform step 3 manually in 1-2 hours. My hope if this worked perfectly is to get the entire process to 30-40 minutes.
I don’t experience prompt fatigue, but I took a different approach to repetitive prompts on large volume of text – I used the Batch API. This lowers the cost of processing significantly although it may take minutes rather than seconds to do the same task.
There is a learning curve to using the Batch API, but once you create your first program, it’s easy to adapt it to other use cases. Let me know if you’d like any sample code.
Just a thought, hope it helps.