First timer, I hope I didn’t post in the wrong place.
I’ve been reading several tutorials on how to build a chatbot using user provided files as context; many of them suggest chunking and embedding the files, use vector database find most relevant chunk to user query, then input them together to GPT so GPT can answer user query with the relevant info in the chunks.
However, none of the single chunk can answer a “global” question about the entire file, like : "what this file is about? " “Give me an outline of this file” etc. etc.
I tried to throw the entire file in the very first “system” message, but it has a few drawbacks:
cost of tokens.
When file(s) gets long, it exceed max_token limit in one try.
As the conversation gets longer, I notice GPT tends to “forget” the file contents in the first system message.
I can see no use for a command “give me an outline of this file”. Because when chunked into pieces of knowledge, the file no longer exists.
You can increase the answering domain by adding more information to the knowledge of the database. Things like “The list of files used to train the AI: list”, “Knowledgebase article title: xxx, Article summary: yyy” can be instances of knowledge.
Then a question like “do you have any papers that discuss mouse behavior” may have the embeddings that match and return some file summaries.
Consider another way a chunk returned to the AI could allow AI to answer questions:
Data source
Title: Bananas: A deep look
Summary: All about banana cultivars, speciation, growing, harvesting
Download source: mycompany.com/papers/banana.pdf
Page: 6
Musa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in New Guinea. They are grown in 135 countries, primarily for their fruit, and to a lesser extent to make fiber, banana wine, and banana beer, and are sometimes even grown as ornamental plants. The world’s largest producers of bananas in 2017 were India and China, which together accounted for approximately 38% of total production. As of 2023, India was producing nearly 30.5 million tons of bananas each year, a little less than 20 million tons more than China.
If the file has an abstract or summary at the top, you could simply use that.
Otherwise, if trying to do this cheaply, embed the entire document in 8k token chunks, try to cluster the embedding vectors, and have summaries reported out of each cluster. Then concatenate all the summaries together, for the overall summary, or for further summarization.
The more expensive, but simpler version, is to feed it to either GPT-4 32k (or Turbo 16k if it fits) and get it summarized there. You only need to do this once per article, and if a user asks the question again, use a classifier to detect this, and dump out the cached version.