Training with Large PDF FIles

Suppose I want to train GPT-3 to be able to chat about a specialized are of knowledge.

And suppose I have a large PDF file discussing the content I would like GPT-3 to have available.

Do I need to break up the PDF into individual paragraphs each under 4000 tokens? Is that what OpenAI did when training ChatGPT?

If I do this do I want to use fine-tuning or embedding for this training data?

Do I need to use Pinecone to convert this PDF file to a vector database?

4 Likes