⬛ Splitting / Chunking Large input text for Summarisation (greater than 4096 tokens....)

Thank you so much for providing these langchain links! Exactly what I needed.
I tried to explain a little bit in layman terms how embeddings work and how they can be used.
I think summarizing everything before “needing them” might be an expensive overkill, as it is significantly more expensive than embeddings.

I am thinkibg about creating “rolling” embeddings with 2k-long overlap, so whenever I detect this “long but interesting document part” I can process only it doing iterations. I will test the approach in the next days

1 Like

My friend is developing a tool dedicated to this task that works in server and client side JS

Any feedback appreciated :slight_smile:

Update: moved here Embedbase Documentation

Thank You so much for providing solution of this problem but i want to pass the list of reviews say 10,000 hotel review and generate a summary of the given list of reviews so How can I split the list of reviews.

Hello. I’m working on solution that combines summarisation and extraction. Basically I need to make sure that every important information from the call is recorded in database.

Most talks are under the limit however some of them are over 8k tokens.

I’m wondering how small chunks for good summarisation should be. I expect that the smaller chunks are → the more information is extracted however there is also higher chances for “hallucinations”.

What’s your opinion on that. Which size is optimal, when retrieving data from the dialogue is important.

Did you try doctran interrogation method and predefine the parameters you want to extract than summarize it you can get a better understanding that way I’m a noob so this might not be the solution😅