How to train the API using like 100 documents (docx, xlsx, pptx, pdf)

osiris.torres · April 6, 2024, 1:20am

Now, I want to create an API that knows more than 100 company documents, which are manuals, guidelines, and activity records, among others, so that the entire company can manage it without the need to create a team of more than 200 people. (it is costly to pay USD 20 for 200 people per month).

I am investigating how feasible it is to train the API this way. Any guide or suggestion would be appreciated.

Among the topics that I have reviewed are the following:

But I would like to have a guide to know where to start this journey.

I’m already diligently studying Python, I have solid SQL skills, as well as HTML and Javascript (because of QA Automation).

Additional Info: My company wants to train a chatgpt model to answer the employee questions using all the files that we have on our site, they say that we can use them as we want. Is this possible? There are like 100 documents using like 250MB total.

Thank you.

apris · April 6, 2024, 3:22am

Hi! Have you found the solution for training API with your documents? Have you tried RAG based approach in combination with assistants API?

si3 · April 6, 2024, 5:53am

Following also, API Assistant cant handle more then 20 doc

osiris.torres · April 7, 2024, 9:01pm

Customer service told me:

Hi there, Yes, it’s possible to train a ChatGPT model to answer employee questions using the documents you have on your site. This process is known as fine-tuning, where you can train a model on your specific dataset to tailor its responses to your needs. Given that you have around 100 documents totaling approximately 250MB, this falls well within the capabilities of the OpenAI API for fine-tuning. Here are the key steps and considerations for fine-tuning a model with your documents: Prepare Your Data: Your documents need to be formatted correctly for fine-tuning. Typically, this involves creating a JSONL file (JSON Lines format) where each line is a separate JSON object representing a training example. For document-based training, each example might include a prompt from the document and the expected response or summary. Upload Your Data: Once your data is prepared and formatted, you’ll need to upload it using the Files API. The OpenAI platform supports file uploads for the purpose of fine-tuning, and your 250MB of documents are well within the size limits. Create a Fine-Tuning Job: After uploading your data, you can create a fine-tuning job specifying the model you wish to fine-tune (e.g., gpt-3.5-turbo) and the file you’ve uploaded. This process customizes the model based on your data. Use Your Fine-Tuned Model: Once the fine-tuning process is complete, you’ll receive a model that’s tailored to your documents. You can then use this model to generate responses to employee questions by making API requests. Please note, while fine-tuning can significantly improve the model’s performance on tasks similar to your training data, it’s important to review and test the model’s outputs to ensure they meet your expectations. For detailed instructions on preparing your data, uploading files, and creating fine-tuning jobs, please refer to the Fine-tuning - OpenAI API documentation. If you have any further questions or need assistance with the fine-tuning process, feel free to reach out. Best, OpenAI Team

Then I sent the next:

The last time that we chat (yesterday) you say: “For document-based training, each example might include a prompt from the document and the expected response or summary. Upload Your Data: Once your data is prepared and formatted, you’ll need to upload it using the Files API. The OpenAI platform supports file uploads for the purpose of fine-tuning, and your 250MB of documents are well within the size limits.” And I need to know: The maximum size per file. Could I use multiple JSONLs for training? The cost per day of storage per GB. Could I create a friendly environment for my company to upload the JSONL without access to “OpenAI API”? same question for training the model and generate the key.

And their response:

Hi there, Thank you for reaching out with your questions. I’ll address each of your queries in turn: The maximum size per file: For fine-tuning, the maximum file upload size is 1 GB. This limit should accommodate your needs for uploading documents for the purpose of fine-tuning. You can find more details in our Fine-tuning guide. Using multiple JSONLs for training: Yes, you can use multiple JSONL files for training. When creating a fine-tuning job, you can specify multiple files for both training and validation. This allows you to organize your data in a way that best suits your project’s needs. For more information on how to structure your fine-tuning job, please refer to our Fine-tuning guide. The cost per day of storage per GB: Currently, OpenAI does not charge for the storage of files used for fine-tuning on a per-day basis. Files uploaded for fine-tuning purposes are stored only as long as necessary for the fine-tuning process and are not meant for long-term storage. For detailed information on pricing, including any updates or changes, please refer to our Pricing page. Creating a friendly environment for your company: To create a user-friendly environment for your company to upload JSONL files and train models without direct access to the OpenAI API, you might consider developing an internal tool or interface that interacts with the OpenAI API on behalf of your users. This tool can handle file uploads, initiate fine-tuning jobs, and manage API keys securely, abstracting away the direct interaction with the OpenAI API. This approach requires some development effort on your part to create a secure and user-friendly interface that meets your company’s specific needs. For managing API keys securely and following best practices, please see our guide on Production best practices. Please note that while you can create a more controlled environment for your team, all interactions with the OpenAI API, including file uploads and fine-tuning, ultimately require the use of API keys for authentication. Therefore, your internal tool will need to securely manage these keys. I hope this answers your questions. If you have any further queries or need additional assistance, please don’t hesitate to ask.

Topic		Replies	Views
Can I use my own pdf/text documents to train to get an article out API	6	4302	December 23, 2023
Custom GPT Model Training for Unversity LMS Courses (E-Learning) API chatgpt , api	1	437	February 23, 2024
[Tutorial] Specific knowledge base + Open AI answering questions using it (for noobs) Documentation	3	6485	December 17, 2023
ChatGPT implementation in a private organization API	5	2322	December 31, 2022
I am using the gpt-4 model and I want it to be able to read documents and respond to me based on the documents API	4	1783	November 22, 2023

How to train the API using like 100 documents (docx, xlsx, pptx, pdf)

Related Topics