Summarizing and extracting structured data from long text

Macha · October 29, 2023, 6:07am

I’ll admit, that’s something I’m still trying to probe myself to figure out better.
It’s been one of those cases where I don’t know how it works better, only that it does. My educated guess has to do with token counts and token limits.

Since it appears you seem relatively comfortable around the API, you could use tiktoken and some logic to parse it into chunks of around, I think like, ~10k tokens? Someone else is going to have to pitch in and find what the input limit was, I can’t find it right off the bat for some reason.

The vector mappings could definitely help if you’re comfortable working with them. @Diet 's solution should work, or maybe the combination of our suggestions.

To answer your second question, for me personally, refinement is a natural intrinsic part of this process for me, but I’m realizing it’s not always necessary. For this though, definitely. I’m assuming map-reduce means using vector embeddings/mappings to achieve this. To me that’s just the earlier step in this process before you refine for the summary you want.

I’d call it a “reiterative” approach. You’re iterating over the process as you go, giving it chunks of data that allows it to change and refine its summary as you feed it new data.

TL;DR chunk it via token count. I’m not the right person to ask for names or preferences of methods yet; all of my methods are self-taught through my own personal trial and error before I knew prompt-engineering was even a thing.

Topic		Replies	Views
⬛ Splitting / Chunking Large input text for Summarisation (greater than 4096 tokens....) API	24	45920	December 12, 2023
Prompt engineering to summarize in a more human like structure Prompting	5	1155	July 9, 2021
How do I summarise a block of text larger than the token limit? API	13	9222	December 17, 2023
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4569	January 26, 2024
FewShot with Document Refiner Prompting api	6	917	February 13, 2024

Summarizing and extracting structured data from long text

Related topics