Training with Large PDF FIles

rkaplan · February 3, 2023, 2:20pm

Suppose I want to train GPT-3 to be able to chat about a specialized are of knowledge.

And suppose I have a large PDF file discussing the content I would like GPT-3 to have available.

Do I need to break up the PDF into individual paragraphs each under 4000 tokens? Is that what OpenAI did when training ChatGPT?

If I do this do I want to use fine-tuning or embedding for this training data?

Do I need to use Pinecone to convert this PDF file to a vector database?

denniskanjirappally · April 3, 2023, 10:57am

I am also looking into similar issue.
any advice??

jmeiri · April 28, 2023, 9:10pm

did any one find an answer to this? need to do the same and don’t want to invent the wheel
maybe I’ll ask ChatGPT

jmeiri · April 28, 2023, 9:12pm

To train GPT-3 on a specific topic using a large PDF file, you would need to convert the PDF file into a format that GPT-3 can understand and then fine-tune the model using that data. Here are the general steps you can follow:

Convert the PDF file into a text format that GPT-3 can understand. You can use tools like Adobe Acrobat, PDFtoText, or PyPDF2 to extract the text from the PDF file. Make sure to clean the text by removing any unnecessary elements like page numbers, headers, and footers.
Split the text into smaller segments that can be used as training examples. For example, you can split the text into paragraphs or sentences.
Format the data into the appropriate format for fine-tuning GPT-3. For GPT-3, each training example should be a single line of text, with no newlines or other formatting.
Fine-tune the GPT-3 model using the formatted data. You can use OpenAI’s API to fine-tune the model, as I explained in my previous answer.
Test the fine-tuned model to see how well it performs on the specialized area of knowledge you want it to chat about. You can generate text using the fine-tuned model and evaluate it manually or with an automated metric like perplexity.

It’s important to note that fine-tuning GPT-3 on a specialized area of knowledge requires a significant amount of data and computational resources. You may need to experiment with different amounts of training data and fine-tuning configurations to achieve good results. Additionally, make sure to follow best practices for fine-tuning language models, such as using a validation set to monitor the model’s performance and avoiding overfitting.

novaphil · April 28, 2023, 9:56pm

You likely want to use embeddings rather than fine-tuning. Look into LangChain or similar projects that provide helper scripts in accomplishing this.

jmeiri · April 29, 2023, 5:00pm

thank you for the info. I’ll try that
enjoy the weekend

SomebodySysop · April 29, 2023, 5:39pm

Here is an excellent tutorial which covers all the questions posed: GPT-4 Tutorial: How to Chat With Multiple PDF Files (~1000 pages of Tesla's 10-K Annual Reports) - YouTube

In addition, I have discovered that when you break a document down into multiple chunks, you should have a strategy for maintaining a contextual relationship between the source document and it’s chunks:

janibasha695 · April 30, 2023, 10:29am

@rkaplan you can use “Langchain” lib to ingest the large pdf data … Langchain has a PDF loader which takes the input pdf data , It has text splitting or chunking functionality which can split the your entire data , and we can use openai embeddings then we store these vectors into vector database like Chorma Now you can ask any question on your data …

https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/pdf.html

paulperry · May 2, 2023, 2:37am

markeewan · September 17, 2023, 2:03am

Are there any guidelines for formatting data in a plain text file uploaded as a training file?

I have seen asterisks used eg. Heading and ## used to annotate text, but what is the best way to mark up a text file document so that the ai understands it?

Topic		Replies	Views
Seeking Advice: Uploading Large PDFs for Analysis with GPT-3 API API gpt-35-turbo , chatgpt , fine-tuning , api	7	6823	December 13, 2023
Accurately read PDF files? API	12	74366	December 12, 2023
Making a chatbot that answers questions from a book API api	3	4395	December 15, 2023
My GPT - Knowledge base - Best practices GPT builders	7	17700	January 25, 2024
Using large PDFs to make a ChatBot API chatgpt , api	21	6059	December 15, 2023

Training with Large PDF FIles

Related topics