Sorry i am very new to this and have no engineering background but ive managed to get this far using online resources
previously i had created a pdf reader using pypdf and my fine tuned model to do q&a with the bot:
import openai
import os
from PyPDF2 import PdfReader
Function to read PDF and convert it to string, truncate if necessary
def read_pdf(filename, max_tokens=2000):
with open(filename, ‘rb’) as file:
reader = PdfReader(file)
text = ‘’
for i in range(len(reader.pages)):
text += reader.pages[i].extract_text()
# Truncate or otherwise reduce the text
truncated_text = text[:max_tokens]
return truncated_text
Read the 30-page document
pdf_text = read_pdf(‘enter file name’)
Set your OpenAI API key
openai.api_key = os.environ[‘api-key’]
Your fine-tuned model ID
model_id = “enter fine tuned model id”
Initialize the conversation
conversation = [
{“role”: “system”, “content”: “enter system prompt”},
{“role”: “user”, “content”: pdf_text}
]
Begin the conversation loop
while True:
user_input = input("You: ")
if user_input.lower() == ‘exit’:
break
# Add user message to conversation
conversation.append({"role": "user", "content": user_input})
print() # Add a line break after user input
# Using the ChatCompletion API
try:
response = openai.ChatCompletion.create(
model=model_id,
messages=conversation
)
# Extract and print the assistant's reply
assistant_reply = response['choices'][0]['message']['content'].strip()
print("Assistant:", assistant_reply)
# Add assistant's reply to conversation
conversation.append({"role": "assistant", "content": assistant_reply})
except openai.error.OpenAIError as e:
print(f"An error occurred: {e}")
print() # Add a line break after each turn
print("_" * 40) # Add a line of asterisks to separate conversations visually
print() # Add a line break after each turn
but it was giving weird in consistent responses. And not following the fine-tuned training as i hoped. I had given it about 39 examples and trained at 16 epochs. The examples are pretty consistent in the format and the quality is good, but im not sure how thats actually judged. I just know that its consistent in “brand voice”
so then i did research on RAG and found people were using langchain, weaviate/chroma and then using openai as the llm. i cloned their codes from github and it was reading the pdf accurately but i could not get it run using my openai fined-tuned id (it was using base openai api models)
i found platforms that help build these structures with no code, but when i do the q&a it ignores my fine tuned training format. And i made sure that the pdf i was quering as a test was already based off inputs and output formats from the same pdf i included in the fine tuning data.
not sure if this makes sense