Leveraging LLMs with Vast Mechanic Datasets and Guides

Denial · August 31, 2023, 9:52am

Hey OpenAI Community! I’ve come across an intriguing challenge and could use some guidance.

Imagine you have:

A comprehensive 800-page PDF guide detailing every aspect of repairing, tweaking, and enhancing a specific Mercedes model. This isn’t just your regular guide; it delves deep into the nuances of the art.
An expansive dataset containing product specs: torque, speed, HP, weight, and spare parts information.

Here’s the scenario: Using “mechanic” as an example, I want to leverage these resources to enable an LLM to give me detailed suggestions. For instance, if I ask, “I want to increase the HP of my 190e from 1996. What modern parts would be suitable?” I’d want precise recommendations based on the dataset and insights from the guide.

Given the massive size of the two resources that exceed standard token limits, direct integration during LLM training seems impractical. Has anyone here tackled a similar challenge? How can I best utilize these datasets with LLMs to extract actionable insights?

EricGT · August 31, 2023, 9:55am

I’m thinking RAG (Retrieval-Augmented-Generation).

EDIT

For more details see

Denial · August 31, 2023, 11:54am

Thank you!
After scratching the surface of RAG, it seems like a good way forward. But I feel a token limit looming when the retriever needs to look through all the data. If so, could there be a solution to this?

merefield · August 31, 2023, 12:06pm

Yeah, that’s too big to feed to the LLM in one big prompt!

The architecture to prevent issues with that is roughly along these lines:

upload the pdf(s) into your store
using a suitable library, read the pdf(s), chunking it into suitable sizes
for each chunk generate an indexed row in a table which will point you to which pdf and a deep link to the chunk. better yet store the entire chunk of text locally in the table.
retrieve and store embeddings for each of those indexed chunks

some moments later:

user enters query
retrieve an embedding for the query
using vector search, find the best match rows which will point to the best chunks of text
return those text chunks to the user and/or deep links to pdfs.

A more advanced solution is to build a chatbot to wrap all that in natural language.

you can take the top 3-5 best results, and inject the source text into the bot’s prompt, allowing it to respond to the user with the best source information.

This is how chatbot agents work with source material.

It completely avoids having to feed the entire source to the LLM at the time of the query.

I’ve recently built one that works with Discourse Posts as source material, but pdf differ really only in the interface: GitHub - merefield/discourse-chatbot: A cloud chatbot adaptor for Discourse, currently supporting OpenAI. As a matter of fact, I’ve been thinking about adding a pdf browser to it that would allow it to access, index and retrieve all the uploaded pdf’s to the site. PR welcome!

Denial · August 31, 2023, 1:03pm

Nice!
I’ll have to look in to your chatbot. Using a chatbot to “bake” the final response seems like the right way to go.
What do you think about using 2 RAGs?

Q: How do I change tires on my specific car?

One query is sent to the database containing “the mechanic knowledge” on changing tires.
One query is sent to the database containing “product knowledge” on tires, rims, bolt sizes etc.

The retrieved info is sent to the chatbot. The chatbot basically gets the prompt “based on the information above, how does the user change his tires?”

A: Jack your car up. Make sure to have your car on a stable surface. Your tires are mounted with 1" bolts, so use a lug wrench that fit. etc…

So in short, is it a good or bad idea to have two separate databases?

merefield · August 31, 2023, 1:11pm

that’s an interesting scenario!

my chatbot uses the latest functions capability of the Open AI LLM API to request functions be run with specific arguments (locally).

The agent iterates “internal thoughts” to work through the problem and then once it has all the data and answers from the functions, responds to user.

This is a standard approach you will see in such problem solving. Mine is bespoke, but you can see standard ones in the Langchain library.

Yes, that’s more or less exactly what the last prompt to the LLM does.

The location source of the information is irrelevant, you simply might want to package these in different tools or functions and expose these interfaces to the LLM so it knows about them.

I have several interfaces to external APIs that i do not even maintain!

Yes, you could have two interfaces, one that retrieves practical guidance and the other products the person might use to achieve those goals.

Balthazar · August 31, 2023, 1:38pm

There are chatGPT plugins that will do RAG from a pdf. I’d start doing a feasibility check with that:

find a diverse handful of questions you’d imagine a user might have
copy relevant info from your sources and paste into one shorter pdf
load the pdf in a chatGPT pdf plugin
experiment w prompting and your questions to see if they get answered correctly

Caveats: with a plugin you might not be able to tweak the RAG as much as needed, regarding chunking of the doc, how many chunks to add etc, but it might still give you a good sense of feasibility. Next step could be LlamaIndex or LangChain.

As merefield points out, you can also start with just the search/retrieval problem. Depending on the use case, that might already provide value, and you can add the chat later. Might need quite some tweaking for it to work nicely, and you’ll probably still want to have the source links in order to verify the answer I guess?

Topic		Replies	Views
Retrieval Augmented Generation (RAG) with 100k PDFs?! Too slow! Community pdf , llm , rag , development	13	25564	October 31, 2024
Problem with doing RAG with 300k pages of PDFs Community gpt-4 , gpt-35-turbo , api	8	5456	March 7, 2024
Using large PDFs to make a ChatBot API chatgpt , api	21	6459	December 15, 2023
How to build an AI system that can search over 50,000 documents with high accuracy? Community gpt-4 , fine-tuning , api , rag , assistants-api	7	356	June 16, 2025
Open AI prompts for RAG / doc Q&A API api	11	6974	January 9, 2024

Leveraging LLMs with Vast Mechanic Datasets and Guides

Related topics