I am wondering how to summarize large research articles. Each article size is larger than current token max sizing.
I went through “fine tuning” exercise route, but it did not producing expected results and I don’t think fine tuning is suppose to solve this use case.
Lastly I tried “embedding”, but even with embedding I hit a wall because the max sizing comes into play when I ask the api to summarize.
I can do recursive summarization of the article, but I am not sure if it efficient way to solve this use case.
Can I ask community’s help in pointing me right direction to solve this issue ?
La ley mata, el espíritu vivifica. Una ley de educación ambiental
Recursive summarization is likely the best approach for a general summary.
I have had pretty good success in summarizing large documents manually using Chunking. And I am now working on code to do this for me. The techniques that I have considers are as below. But don’t start at the complex end of these. I have had good results with Chunking and the summaries of summaries version.
Chunking: Divide the text into smaller chunks, ensuring that each chunk is within the model’s token limit. Summarize each chunk independently and then combine the summaries at the end to create a final summary. This method might lead to a loss of coherence and context in the final summary.
Iterative summarization: Similar to chunking, divide the text into smaller chunks. Start by summarizing the first chunk, then use the summary as input for the next chunk summarization, and so on. This method may help maintain better context and coherence compared to independent chunking but might still suffer from information loss.
Summarize summaries: Divide the text into smaller chunks and create summaries for each chunk independently. Once you have all the summaries, use the model to generate a summary of these summaries. This approach may be more coherent than combining summaries from the chunking method but may still lose some context.
Hierarchical summarization: Divide the text into sections or chapters, summarize each section independently, and then summarize the section summaries. This method can be particularly useful for well-structured documents like books or reports, as it leverages the inherent organization of the content.
Sliding window: Use a sliding window approach to create overlapping chunks, ensuring that the model has enough context from the previous and next parts of the text. Summarize each window, and then merge or select the most relevant parts from the overlapping summaries. This method can help maintain context and coherence but may require more complex post-processing to merge the summaries effectively.
Thank Paul for the pointers, really appreciate it.
Btw, I always wonder how OpenAI was able to achieve the same feet on wiki articles, books … etc, may be their trade secret.