From embeds to fine-tuning, there are so many options available for achieving what I want. Which approach should I take?
Here’s what I’m trying to do:
- Transcribe an audio interview (already have this working via API + Whisper)
- Summarize the entire interview in 2-3 paragraphs
- Capture details about the guest (name, location, income, etc.) in a structured format
I’ve tried splitting the transcript into chunks and using a refine approach, but I’m certain that by combing the structured data ask in the prompt alongside the summary, I’m getting poor results. Here’s what that looks like…
- Create transcript chunks (each within a set character limit and context window of last N sentences from previous chunk)
- Send the first chunk with prompt
- Send the second chunk with alternate prompt and include previous response
I hoped this would summarize the content and update the summary as new chunks are processed in combination with the previous response, but I’m losing a ton of context and the summary is very poor quality.
Can I even do what I’m hoping with chat/completions?
Would a map reduce approach be better?
Do I need to consider using embeddings (ugh)?
Can I ask how long is the audio data of the interviews?
Once you have the interview converted to text you need to create an embedding.
This is to be searched by your pinpointed questions and the result can be raw text or summarised, formatted by Chat GPT.
This is a very common method to process agreements, and business documents.
(So text splitter, chunks, embeddings, then query will give you very nice summary, category classification etc of the data)
I think I’d store each of the chucks (with top and tail overlap) in a vector database and then run queries of that data to build my standard structured format up, i.e. ask a series of questions about salary, location etc, then use those vector retrievals to build up a context to then ask the AI to produce a section at a a time, probably using function calls to standardise the output.
The audio is 45-60 minutes long. Since whisper has a 25MB limit, I downsample to 96k before splitting into chunks via FFMPEG (detecting silence).
That’s what I figured. Anyone have a sense of the cost of this approach? What I’m afraid of is needing to set up a purpose-built vector database to handle all of this, separate from my main app. It’s a little bit more complex than I was hoping for and the cost factor is something to consider as well.
you may not even need a full blown vectordb tbh.
if it’s just transient, you can use faiss with langchain for example. that’s basically free. you could be up and running with a prototype in jupyter in like an hour or two. the embeddings cost virtually nothing. if your app is not a python stack then faiss might be more challenging, but doable. if you can’t be bothered to figure it out and your usecase generally splits into fewer than ~1000 chunks, you can probably get away with vector comparison in a for loop (it’s called a flat index lol). you just take the cosine similarity of your query, costs basically nothing.
a lot of people are running pinecone, but I’ve personally never used it. getting a milvus stack running in docker is pretty trivial.
So, I’ll try and pitch in here.
You could probably modify my technique I posted recently to help gather data about the guest: Data Distillation: Generate custom instructions for ChatGPT using your own data
If you want more granular help with that I’m happy to whip up a process for you.
Are you using ChatGPT at all? Do you have access to it?
I’ve been able to successfully summarize data (even outside the aforementioned technique) to my needs by leveraging the Data Analysis plugin, which can sometimes give me a leg up in creating a pseudo-broader context window. If we can figure out what makes your summary insufficient and if you could be more specific about what things are getting lost in context, that could help me out in finding a solution for you.
Also, while I haven’t tried it with the API specifically, I’ve given chunks of text before for the model to successfully summarize in aggregate, but the logic is slightly different.
It’s basically: chunk->summary1, chunk->summary2, chunk->summary3
Summary1 + Summary2 + Summary3 = Sufficient summary.
Granted, that’s for my specific use cases and from using the Data Analysis plugin to leverage the model summarizing internally before it presents an output, but maybe this could still help you?
I like the sound of that!
So here are a few more questions for you:
- What’s the right chunk size for this type of application? A few words? A paragraph? As big as possible? The data is essentially unstructured, so there’s no concept of chapters or pages I could use as a key for the chunks.
- Would a map-reduce or refine method still be preferred for summarization?
I’ll admit, that’s something I’m still trying to probe myself to figure out better.
It’s been one of those cases where I don’t know how it works better, only that it does. My educated guess has to do with token counts and token limits.
Since it appears you seem relatively comfortable around the API, you could use tiktoken and some logic to parse it into chunks of around, I think like, ~10k tokens? Someone else is going to have to pitch in and find what the input limit was, I can’t find it right off the bat for some reason.
The vector mappings could definitely help if you’re comfortable working with them. @Diet 's solution should work, or maybe the combination of our suggestions.
To answer your second question, for me personally, refinement is a natural intrinsic part of this process for me, but I’m realizing it’s not always necessary. For this though, definitely. I’m assuming map-reduce means using vector embeddings/mappings to achieve this. To me that’s just the earlier step in this process before you refine for the summary you want.
I’d call it a “reiterative” approach. You’re iterating over the process as you go, giving it chunks of data that allows it to change and refine its summary as you feed it new data.
TL;DR chunk it via token count. I’m not the right person to ask for names or preferences of methods yet; all of my methods are self-taught through my own personal trial and error before I knew prompt-engineering was even a thing.
Ok, so after some helpful replies and deep rabbit holes, I think I have a sense of what I need to do.
- Transcribe audio using Whisper/OpenAI API
- Save transcript to text file
I’ve already built-out my own simple chunking system that handles a summarization process using my own refine flow, but it’s not getting me the results I expect. LangChain has all of this built-in, so…
Here’s what I need to build
- Pass name of text file from my current app to a lightweight Python server (this server will be running LangChain and ChromaDB)
- Let LangChain handle chunking of text file and orchestration
- Vectorize the text and store in transient ChromaDB store (only really needed for processing a request)
- Run a defined set of queries against the store and pass back to the main app via JSON output.
Then, the main app can parse the JSON, store it, etc. Does this sound like a reasonable approach? I’m not re-writing my main app in python but I can use HTTP / cURL to handle comms between them. Starting with local env, and not even sure this part of it needs to go to prod.
Yep, sounds like a very good approach to me!
Let us know how it goes for you. We’ll do our best to troubleshoot as necessary.
Glad we could all help!
So the ideal chunk size is a logical unit of conversation. Character limits are mostly irrelevant here IMO. In a movie it would be like a scene or a shot. If you have paragraphs, that would probably be good.
The way I do paragraphs though, is that I import/augment information that is relevant to the paragraph, so that the paragraph can be understood as a standalone unit - and then embed that. trivial example, for instance if they keep using certain jargon or acronyms, or referencing names, it’s typically a good idea to disambiguate that. YMMV with that; sometimes it’s not worth the effort to do a two pass approach like this depending on how ‘heavy’ your content is.
but if you do that, you have excellent basis data which you can then excellently map/filter/reduce/expand
map reduce is often a very good option for most tasks.
I’ve been working on the same but on a smaller scale, a 2.5-minute average per call.
I recommend checking Azure batch transcription, only for the speaker
Diarization and fewer limitations and hallucinations - better than whisper IMO, and the same pricing.
Also enables you to keep the original file audio quality since the limitation with Diarization enabled is 240 minutes\1Gb per file.
If you’re using .NET, there are a few in memory Vector databases you can use.
Best of luck.
I’m working on a couple of similar tasks, might be worth talking on a call if you don’t mind. Feel free to look up my posts in this forum for some more details