Suppose I want to train GPT-3 to be able to chat about a specialized are of knowledge.
And suppose I have a large PDF file discussing the content I would like GPT-3 to have available.
Do I need to break up the PDF into individual paragraphs each under 4000 tokens? Is that what OpenAI did when training ChatGPT?
If I do this do I want to use fine-tuning or embedding for this training data?
Do I need to use Pinecone to convert this PDF file to a vector database?