Training with Large PDF FIles

Suppose I want to train GPT-3 to be able to chat about a specialized are of knowledge.

And suppose I have a large PDF file discussing the content I would like GPT-3 to have available.

Do I need to break up the PDF into individual paragraphs each under 4000 tokens? Is that what OpenAI did when training ChatGPT?

If I do this do I want to use fine-tuning or embedding for this training data?

Do I need to use Pinecone to convert this PDF file to a vector database?


I am also looking into similar issue.
any advice??

did any one find an answer to this? need to do the same and don’t want to invent the wheel
maybe I’ll ask ChatGPT :slight_smile:

1 Like

To train GPT-3 on a specific topic using a large PDF file, you would need to convert the PDF file into a format that GPT-3 can understand and then fine-tune the model using that data. Here are the general steps you can follow:

  1. Convert the PDF file into a text format that GPT-3 can understand. You can use tools like Adobe Acrobat, PDFtoText, or PyPDF2 to extract the text from the PDF file. Make sure to clean the text by removing any unnecessary elements like page numbers, headers, and footers.
  2. Split the text into smaller segments that can be used as training examples. For example, you can split the text into paragraphs or sentences.
  3. Format the data into the appropriate format for fine-tuning GPT-3. For GPT-3, each training example should be a single line of text, with no newlines or other formatting.
  4. Fine-tune the GPT-3 model using the formatted data. You can use OpenAI’s API to fine-tune the model, as I explained in my previous answer.
  5. Test the fine-tuned model to see how well it performs on the specialized area of knowledge you want it to chat about. You can generate text using the fine-tuned model and evaluate it manually or with an automated metric like perplexity.

It’s important to note that fine-tuning GPT-3 on a specialized area of knowledge requires a significant amount of data and computational resources. You may need to experiment with different amounts of training data and fine-tuning configurations to achieve good results. Additionally, make sure to follow best practices for fine-tuning language models, such as using a validation set to monitor the model’s performance and avoiding overfitting.

1 Like

You likely want to use embeddings rather than fine-tuning. Look into LangChain or similar projects that provide helper scripts in accomplishing this.


thank you for the info. I’ll try that
enjoy the weekend

1 Like

Here is an excellent tutorial which covers all the questions posed: GPT-4 Tutorial: How to Chat With Multiple PDF Files (~1000 pages of Tesla's 10-K Annual Reports) - YouTube

In addition, I have discovered that when you break a document down into multiple chunks, you should have a strategy for maintaining a contextual relationship between the source document and it’s chunks:


@rkaplan you can use “Langchain” lib to ingest the large pdf data … Langchain has a PDF loader which takes the input pdf data , It has text splitting or chunking functionality which can split the your entire data , and we can use openai embeddings then we store these vectors into vector database like Chorma Now you can ask any question on your data …

1 Like

Are there any guidelines for formatting data in a plain text file uploaded as a training file?

I have seen asterisks used eg. Heading and ## used to annotate text, but what is the best way to mark up a text file document so that the ai understands it?