Logic behind uploading a large document

JP202 · June 4, 2024, 1:52pm

I just uploaded a 1,000-page pdf document and asked
give me a summary of this document in 50 sentences

This is the first time I’ve seen an analyzing… message, and after a few seconds, I got a very good 50-sentence spread.

As OpenAI uses Azure search for storing and indexing documents, does anyone know the secret sause of going trough all chunks and summarizing them?

thinktank · June 4, 2024, 2:42pm

Hiya,

Check out this thread:
https://community.openai.com/t/using-gpt-4-api-to-semantically-chunk-documents/

jr.2509 · June 4, 2024, 2:51pm

You may also find this OpenAI cookbook on long document summarization useful.

darcschnider · June 4, 2024, 2:59pm

pretty much how I deal with things is to build a token chunker.

The idea is you set your size limit per chunk so that if data comes back over your limits you want it to chunk it into manageable chunks. But one important thing is to ensure that each chunk is seperated at a valid point so not in the middle of the message.

you than can pass each chunk with user query against the chunk to summarize based on the query. you than process all chunks this way.

At the end you combine the summaries. than check the size again. if size is still too big you repeat until it gets down to manageable size.

This works really well for contextual data. Do not use this method with Embeddings as embeddings are predefined patterns and altering them will cause your algorithms to calculate incorrectly. So its best to use your embeddings to narrow down the data to which information than pull the context and than do chunk against size.

hope that helps with some direction.

SomebodySysop · June 4, 2024, 7:11pm

Summarization Methodologies
- Map Reduce
  - Chunk document. Summarize each chunk, then summarize all the chunk summaries. Using this currently in embed_solr_index01.php.
- Stuffing
  - Summarize entire document all at once, if it will fit into prompt.
- Refine
  - Chunk document. Summarize first chunk. Summarize 2nd chunk + 1st chunk summary. Summarize 3rd chunk + 1st and 2nd chunk summary. And so on…

SomebodySysop · June 4, 2024, 7:12pm

Exactly what we are discussing here: Using gpt-4 API to Semantically Chunk Documents

JP202 · June 4, 2024, 7:52pm

Thank you guys! super helpful

If you do not mind a step further, how do you deal with non-standard Q&A questions:

From does my company support XYZ to

Summarise whole document
Top 5 takeaways
Please give me a list of all addresses listed in the document
How many times does the phrase “XXX ZZZ” appear in the document?

How to know? Should you run a similarity query (I use Azure search) and get the top 5 chanks, VS

Questions 1 and 2 - The summarised version of the document should be RAG

Question 4: open the original document and do a text search.

Question 3 - what would be the solution for this UC - summarised version will not work as maybe some of the addresses are omitted;- can you do a text search as you do not know what is address by simple text search ?

darcschnider · June 4, 2024, 9:40pm

indeed. it’s the same concept.

darcschnider · June 4, 2024, 9:51pm

what do you mean non standard haha, a question/ query is a query. All questions are standard.

you have to build ai logic stacks to handle various types of questions. For data input though that is the key, your data storage design with meta data when you input you do your embeddings, and context for each chunk seeing you are using documents. you could for instance build chapter summaries to utilize in your stack for manuals etc. where you use an intent logic for messages to determine should I use this logic or that logic path based on my understanding of what the user is asking with their query on the data.

Hope that helps. Its not a simple response without a lot of code.

ps. I highly recommend taking some online courses for neo4j, I can’t say enough about it. they now offer their own Ai server stack to help get people started. although its very simple design that they have to learn from it would get you started for what you want.

myself I do not use any of their stuff but I did download it to look through all the code to see how they did it. that is another way to gain valuable insights is looking at open source code and talking to ai’s about about.

It helps though that I have a large background in I.T. , Automation, Electrical mechanical (sensors etc), Generative Ai, Databases, etc… so I can play as a full stack dev to take it all the way into robotics when that day comes. being a one man though with Ai as my only other work partner is a lot of work. I put in 8 hours in my day job programming and the likes, and 9 hours a night on ai learning and building so this is a life project for me, and I have that focus and drive so money is not a limiter for developing. its like a game for me haha.

Topic		Replies	Views
Summarizing and extracting structured data from long text Prompting gpt-4 , api , token , limitations	14	12367	February 19, 2024
Is there any way by which I can let GPT-4 API summarize large PDF texts? API gpt-4 , api	10	11045	May 6, 2024
Ways to automate breaking a large piece of input into chunks that fit in the 4096 token constraint? API	17	14315	June 6, 2023
New approach to summarize books Community gpt-4 , api	9	1933	February 8, 2024
⬛ Splitting / Chunking Large input text for Summarisation (greater than 4096 tokens....) API	24	44888	December 12, 2023

Logic behind uploading a large document

Related topics