Just for fun I decided to create an application assisted by Open-AI that takes a video and turns it into a readable article. I have used the audio/transcriptions
endpoint to transcribe the video, and the chat/completions
endpoint to generate a header for the article. My issue is that almost the entire transcription is a solid chunk of text. It added linebreaks in the beginning, but then it’s just a solid chunk. You can see the transcription here: Open AI transcription - Pastebin.com
How can I use open AI to fix this? Can I tweak the audio/transcriptions
call somehow? Can I use some other endpoint to insert linebreaks?
Ran into the same problem using “completions”: The text to “translate” (or in my use case correct for grammar and spelling) is always returned as a single block of text, regardless of paragraph breaks.
My prompts, using text-davinci-003 are “Correct this to standard English” or “Correct this to informal English.” I’ve added, “and preserve paragraphs” and other variations to no effect.
Attempts to place flags - particles, bracketed or escaped code, whole sentences - for subsequent replacement are defeated. The text returned removes those, too.
The only solution I’ve found so far has been to “explode” the original text into an array based on the paragraph break, produce a request for each paragraph, then re-assemble the text. The problem at this point is that the translations/corrections are of lower quality, I presume because they don’t benefit from context. So, I’ve seen clear mistakes in the original be found in the “whole block” submission, but be missed in the series of submissions.
If any one has a solution for conveying paragraph breaks from original through returned text, I’d be interested in knowing it.
You can get the verbose_json
response format, the response will be typed according to this:
AxiosResponse<VerboseJSONTranscriptionResponse>
Then you can use the segment information to arrange the response.
export type TranscriptionSegment = {
"id": number,
"seek": number,
"start": number,
"end": number,
"text": string,
"tokens": number[],
"temperature": number,
"avg_logprob": number,
"compression_ratio": number,
"no_speech_prob": number,
"transient": boolean,
};
export type VerboseJSONTranscriptionResponse = {
"task": "transcribe",
"language": string,
"duration": number,
"segments": TranscriptionSegment[],
"text": string,
}
If that’s still not good enough, there are third-party services such as Deepgram which offer more advanced processing and formatting of the responses, built on top of Whisper or other transcription models.
Thanks for your reply, but this probably got confused because the original question referred to transcription. while I’ve been trying “completions” and “edits” endpoints for other purposes.
Also, the OpenAI “Playground” uses a “Completions”-based example as an example for correction for grammar.
Though verbose_json is not an option for either completions or edits, I’ve found that by switching to the edits endpoint, I can preserve paragraphing. In fact, the model seems to add new line breaks where none are present in the original. I’m still experimenting with different instructions. I’m also getting a higher frequency of timeouts, but maybe it’s time of day or some other network issue.
Depending on how things go, I might start my own thread on this question.
Yes, I was talking about transcription, to be clear.