Summarizing and extracting structured data from long text

chris_a · October 26, 2023, 4:15am

From embeds to fine-tuning, there are so many options available for achieving what I want. Which approach should I take?

Here’s what I’m trying to do:

Transcribe an audio interview (already have this working via API + Whisper)
Summarize the entire interview in 2-3 paragraphs
Capture details about the guest (name, location, income, etc.) in a structured format

I’ve tried splitting the transcript into chunks and using a refine approach, but I’m certain that by combing the structured data ask in the prompt alongside the summary, I’m getting poor results. Here’s what that looks like…

Create transcript chunks (each within a set character limit and context window of last N sentences from previous chunk)
Send the first chunk with prompt
Send the second chunk with alternate prompt and include previous response

I hoped this would summarize the content and update the summary as new chunks are processed in combination with the previous response, but I’m losing a ton of context and the summary is very poor quality.

Can I even do what I’m hoping with chat/completions?
Would a map reduce approach be better?
Do I need to consider using embeddings (ugh)?

Help!

supershaneski · October 26, 2023, 6:23am

Can I ask how long is the audio data of the interviews?

gyorgy.rozgonyi · October 26, 2023, 9:15am

Once you have the interview converted to text you need to create an embedding.
This is to be searched by your pinpointed questions and the result can be raw text or summarised, formatted by Chat GPT.
This is a very common method to process agreements, and business documents.

(So text splitter, chunks, embeddings, then query will give you very nice summary, category classification etc of the data)

Foxalabs · October 26, 2023, 9:16am

I think I’d store each of the chucks (with top and tail overlap) in a vector database and then run queries of that data to build my standard structured format up, i.e. ask a series of questions about salary, location etc, then use those vector retrievals to build up a context to then ask the AI to produce a section at a a time, probably using function calls to standardise the output.

chris_a · October 26, 2023, 10:39pm

The audio is 45-60 minutes long. Since whisper has a 25MB limit, I downsample to 96k before splitting into chunks via FFMPEG (detecting silence).

chris_a · October 26, 2023, 10:43pm

That’s what I figured. Anyone have a sense of the cost of this approach? What I’m afraid of is needing to set up a purpose-built vector database to handle all of this, separate from my main app. It’s a little bit more complex than I was hoping for and the cost factor is something to consider as well.

Diet · October 26, 2023, 11:15pm

you may not even need a full blown vectordb tbh.

if it’s just transient, you can use faiss with langchain for example. that’s basically free. you could be up and running with a prototype in jupyter in like an hour or two. the embeddings cost virtually nothing. if your app is not a python stack then faiss might be more challenging, but doable. if you can’t be bothered to figure it out and your usecase generally splits into fewer than ~1000 chunks, you can probably get away with vector comparison in a for loop (it’s called a flat index lol). you just take the cosine similarity of your query, costs basically nothing.

a lot of people are running pinecone, but I’ve personally never used it. getting a milvus stack running in docker is pretty trivial.

Macha · October 27, 2023, 12:03am

So, I’ll try and pitch in here.
You could probably modify my technique I posted recently to help gather data about the guest: Data Distillation: Generate custom instructions for ChatGPT using your own data

If you want more granular help with that I’m happy to whip up a process for you.
Are you using ChatGPT at all? Do you have access to it?

I’ve been able to successfully summarize data (even outside the aforementioned technique) to my needs by leveraging the Data Analysis plugin, which can sometimes give me a leg up in creating a pseudo-broader context window. If we can figure out what makes your summary insufficient and if you could be more specific about what things are getting lost in context, that could help me out in finding a solution for you.

Also, while I haven’t tried it with the API specifically, I’ve given chunks of text before for the model to successfully summarize in aggregate, but the logic is slightly different.

It’s basically: chunk->summary1, chunk->summary2, chunk->summary3
Summary1 + Summary2 + Summary3 = Sufficient summary.

Granted, that’s for my specific use cases and from using the Data Analysis plugin to leverage the model summarizing internally before it presents an output, but maybe this could still help you?

chris_a · October 29, 2023, 3:32am

I like the sound of that!

So here are a few more questions for you:

What’s the right chunk size for this type of application? A few words? A paragraph? As big as possible? The data is essentially unstructured, so there’s no concept of chapters or pages I could use as a key for the chunks.
Would a map-reduce or refine method still be preferred for summarization?

Macha · October 29, 2023, 6:07am

I’ll admit, that’s something I’m still trying to probe myself to figure out better.
It’s been one of those cases where I don’t know how it works better, only that it does. My educated guess has to do with token counts and token limits.

Since it appears you seem relatively comfortable around the API, you could use tiktoken and some logic to parse it into chunks of around, I think like, ~10k tokens? Someone else is going to have to pitch in and find what the input limit was, I can’t find it right off the bat for some reason.

The vector mappings could definitely help if you’re comfortable working with them. @Diet 's solution should work, or maybe the combination of our suggestions.

To answer your second question, for me personally, refinement is a natural intrinsic part of this process for me, but I’m realizing it’s not always necessary. For this though, definitely. I’m assuming map-reduce means using vector embeddings/mappings to achieve this. To me that’s just the earlier step in this process before you refine for the summary you want.

I’d call it a “reiterative” approach. You’re iterating over the process as you go, giving it chunks of data that allows it to change and refine its summary as you feed it new data.

TL;DR chunk it via token count. I’m not the right person to ask for names or preferences of methods yet; all of my methods are self-taught through my own personal trial and error before I knew prompt-engineering was even a thing.

chris_a · October 29, 2023, 4:54pm

Ok, so after some helpful replies and deep rabbit holes, I think I have a sense of what I need to do.

Already done:

Transcribe audio using Whisper/OpenAI API
Save transcript to text file

I’ve already built-out my own simple chunking system that handles a summarization process using my own refine flow, but it’s not getting me the results I expect. LangChain has all of this built-in, so…

Here’s what I need to build

Pass name of text file from my current app to a lightweight Python server (this server will be running LangChain and ChromaDB)
Let LangChain handle chunking of text file and orchestration
Vectorize the text and store in transient ChromaDB store (only really needed for processing a request)
Run a defined set of queries against the store and pass back to the main app via JSON output.

Then, the main app can parse the JSON, store it, etc. Does this sound like a reasonable approach? I’m not re-writing my main app in python but I can use HTTP / cURL to handle comms between them. Starting with local env, and not even sure this part of it needs to go to prod.

Macha · October 29, 2023, 8:42pm

Yep, sounds like a very good approach to me!
Let us know how it goes for you. We’ll do our best to troubleshoot as necessary.
Glad we could all help!

Diet · October 29, 2023, 9:23pm

So the ideal chunk size is a logical unit of conversation. Character limits are mostly irrelevant here IMO. In a movie it would be like a scene or a shot. If you have paragraphs, that would probably be good.

The way I do paragraphs though, is that I import/augment information that is relevant to the paragraph, so that the paragraph can be understood as a standalone unit - and then embed that. trivial example, for instance if they keep using certain jargon or acronyms, or referencing names, it’s typically a good idea to disambiguate that. YMMV with that; sometimes it’s not worth the effort to do a two pass approach like this depending on how ‘heavy’ your content is.

but if you do that, you have excellent basis data which you can then excellently map/filter/reduce/expand

map reduce is often a very good option for most tasks.

sergeliatko · October 30, 2023, 6:30pm

Hi Chris,

I’m working on a couple of similar tasks, might be worth talking on a call if you don’t mind. Feel free to look up my posts in this forum for some more details

jkryanchou · February 19, 2024, 3:33am

I have searched for a long while and got 2 ways to deal with this situations.

Splitting the document into multiple chunks and embed in MapReduce or Redefine way (which could not stuff the whole document into a message content window size) Summarization | 🦜️🔗 Langchain
Using the tools such as ChatGPT Prompt Splitter to split your document into multiple parts and send them by multiple messages. follow the blog article. Ways to automate breaking a large piece of input into chunks that fit in the 4096 token constraint? - API - OpenAI Developer Forum

While I have no idea which is the better. according to my testing. I thought the Map-Reduce way would generate a better result. As I know the method 2 seems like the Refine way. Their workflows are similar

Topic		Replies	Views
⬛ Splitting / Chunking Large input text for Summarisation (greater than 4096 tokens....) API	24	45385	December 12, 2023
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4493	January 26, 2024
FewShot with Document Refiner Prompting api	6	883	February 13, 2024
How do I summarise a block of text larger than the token limit? API	13	9132	December 17, 2023
GPT one paragraph reply? Condensation/Summary for core ideas (keep content depth) Prompting api	4	1773	February 10, 2024

Summarizing and extracting structured data from long text

Related topics