Guidance on summarizing questions from a collection of pdfs

Hi everyone,

I am looking for some guidance on how to approach my current project.

I have a collection of PDFs which contain protocols from oral exams from my university class. I want to use gpt-4 to ask questions about these protocols, such as:

1 - Which topics were asked about?
2- Extract all questions which were asked and group them into categories


Sometimes, this works just fine. Then again, with different data, it will tell me that it wasn’t provided with any protocols / can’t help me with that.
I don’t fully understand where this is coming from.

Here is my approach:

  • I concatenate all protocols into a single string, separated by a line saying “Protocol Number {num}”
  • I use langchain Chroma and the OpenAI Embeddings to create an embedding database from the string
  • I use langchain’s ConversationalRetrievalChain.from_llm + ChatOpenAI with gpt-4 with my embeddings to ask questions

I would really appreciate some pointers here … I would also be willing to change my whole approach if this isn’t the optimal one.


Welcome to the community!

Sounds like you’re using the API? Do you have access to the Advanced Data Analysis plugin?

The approach itself doesn’t look all that bad, but perhaps the way in which you’re amalgamating the data and feeding it to GPT seems to be the problem here. You may be forgetting the “parsing” layer.

If it were up to me, I would store the protocol strings in some kind of dictionary data structure or database. Lots of people have their own methods for this. I’m a tinkerer so I just make h5 files, but it’s more or less up to you.

Remember, these models do have input limits and context limits. In instances like Advanced Data Analysis, you could feed it an entire database, but it cannot handle an extremely long string.

Also, are you embedding the entire string as one embedding? I would recommend parsing the data so that each protocol includes its own vector embedding, and use that for context retrieval. If that’s what you want to execute already, ensure your code or logic structure is actually set up to do that.

You’re very much on the right track, but from what you’ve shown us here, it seems you’re missing an extra parsing layer to feed the information to GPT with. It CAN summarize large sets of data, so your goal is very achievable, you just can’t feed it the elephant all in one prompt. Do it iteratively, or provide a database to allow GPT to summarize each string iteratively itself.