Using A Fine-Tuned Model To Query A PDF / Database

Hi,

ive done alot of research on when you should fine tune and when you should do embeddings for accurate q&a with a database etc

My question is, for my use case, what im trying to do is have the llm query the document i upload to the database and answer the questions but in our “brand voice”. Ie, i have found ways on how to do q&a with pdf, and i have also successfully fined tuned, but for some reason whenever i try to combine both features it fails?

Is it not possible to use a 3.5-turbo model that ive fined tuned on my data to go through the pdf we upload to it and have it reply in the patterns that it was fine-tuned on?

The reason we need this is because every pdf the user will upload will be different content, so there will never be the same type of answer but the questions will be similar

The closest thing i found was building a RAG, but when i try to change the llm to my fine-tuned one instead of pre-trained base model it always fails and doesnt generate the pattern we want. It either ignores the content of the pdf or it answers from the pdf but without the brand voice. Hope this makes sense

Thanks

Hi and welcome to the Developer Forum!

This very much depends on what your fine tuning dataset looks like, how many examples there are to train on, the quality of the data, are the examples consistent, etc, etc.

You go on to say that you combine both features, how? What embedding database are you using? how are you splitting your pdf’s into chunks, are you using chunk overlap, if so how much? how big are the chunks? are you creating chunking borders at paragraph, page, sentence or word level?

You mention RAG and then mention changing the LLM? this does not sound like an OpenAI product, are you using some kind of framework?

The more information you can provide the better, also consider if you are using non OpenAI products then also check with the creators of those products for solutions.

1 Like

Sorry i am very new to this and have no engineering background but ive managed to get this far using online resources

previously i had created a pdf reader using pypdf and my fine tuned model to do q&a with the bot:

import openai
import os
from PyPDF2 import PdfReader

Function to read PDF and convert it to string, truncate if necessary

def read_pdf(filename, max_tokens=2000):
with open(filename, ‘rb’) as file:
reader = PdfReader(file)
text = ‘’
for i in range(len(reader.pages)):
text += reader.pages[i].extract_text()
# Truncate or otherwise reduce the text
truncated_text = text[:max_tokens]
return truncated_text

Read the 30-page document

pdf_text = read_pdf(‘enter file name’)

Set your OpenAI API key

openai.api_key = os.environ[‘api-key’]

Your fine-tuned model ID

model_id = “enter fine tuned model id”

Initialize the conversation

conversation = [
{“role”: “system”, “content”: “enter system prompt”},
{“role”: “user”, “content”: pdf_text}
]

Begin the conversation loop

while True:
user_input = input("You: ")
if user_input.lower() == ‘exit’:
break

# Add user message to conversation
conversation.append({"role": "user", "content": user_input})

print()  # Add a line break after user input

# Using the ChatCompletion API
try:
    response = openai.ChatCompletion.create(
        model=model_id,
        messages=conversation
    )
    # Extract and print the assistant's reply
    assistant_reply = response['choices'][0]['message']['content'].strip()
    print("Assistant:", assistant_reply)
    
    # Add assistant's reply to conversation
    conversation.append({"role": "assistant", "content": assistant_reply})
    
except openai.error.OpenAIError as e:
    print(f"An error occurred: {e}")

print()  # Add a line break after each turn

print("_" * 40)  # Add a line of asterisks to separate conversations visually
print()  # Add a line break after each turn

but it was giving weird in consistent responses. And not following the fine-tuned training as i hoped. I had given it about 39 examples and trained at 16 epochs. The examples are pretty consistent in the format and the quality is good, but im not sure how thats actually judged. I just know that its consistent in “brand voice”

so then i did research on RAG and found people were using langchain, weaviate/chroma and then using openai as the llm. i cloned their codes from github and it was reading the pdf accurately but i could not get it run using my openai fined-tuned id (it was using base openai api models)

i found platforms that help build these structures with no code, but when i do the q&a it ignores my fine tuned training format. And i made sure that the pdf i was quering as a test was already based off inputs and output formats from the same pdf i included in the fine tuning data.

not sure if this makes sense

1 Like