How to make a larger amount of data available for ChatGPT?

First of all, I apologize if this question has already been answered elsewhere - I didn’t find anything about it. Regarding OpenAI I feel like a toddler taking his first steps.

Initial situation

I have a large number (a little more than 800 MB) of prepared and linked documents in XML format on legislation in Germany. These documents are not available online in this format and therefore could not have been used for the training of the AI.


I am looking for a way to make these documents available in ChatGPT and link them to its broad general knowledge to facilitate meaningful research for lawyers and interested citizens.

Trials (and errors)

I tried the following instructions and also tested some Python snippets, but without success:

  1. Install the OpenAI CLI by following the instructions on the OpenAI GitHub page:
  2. Login to your OpenAI account using the CLI: openai login
  3. Use the openai datasets create command to create a new dataset that contains your German law texts. The dataset should be in the form of a file with one text per line.
  4. Use the openai datasets use command to select the dataset you just created and prepare it for fine-tuning by splitting it into a training and validation set.
  5. Use the openai models create command to create a new model and specify the GPT-3 model architecture you want to use.
  6. Use the openai models fine-tune command to fine-tune the model on your dataset. You can specify the number of training steps, the batch size, and other training parameters.
  7. Use the openai models evaluate command to evaluate the model’s performance on the validation set.
  8. Once you are satisfied with the model’s performance, use the openai models serve command to start a server that serves the fine-tuned model and allow you to use it for your task.
    Please note that for fine-tuning large models like GPT-3 you will need access to powerful GPUs, and you will also need a GPT-3 subscription, some other OpenAI models can be used with an API Key. Furthermore, the process of fine-tuning GPT-3 could take a considerable amount of computational resources, time and data so make sure you have a good amount of it.

The “openai datasets create” command is used to create a new dataset in the OpenAI Datasets library. The command takes several arguments, which you can see by running “openai datasets create --help”. Some of the important arguments include:

  • –name: the name of the dataset you want to create.
  • –version: the version of the dataset you want to create.
  • –file: the file or directory containing the data for the dataset.
  • –metadata: a JSON file containing metadata about the dataset.
  • –description: a text file containing a description of the dataset.
  • –url: a URL where the data for the dataset can be downloaded.
  • –size: the size of the dataset in bytes.
import requests

api_key = "YOUR_API_KEY"
headers = {"Authorization": "Bearer {api_key}"}
data = {
    "name": "My XML Dataset",
    "description": "A dataset of XML files for training",
    "metadata": {"source_url": "https://url pointing to one xml file"}
response ="", headers=headers, json=data)


Maybe someone knows a way how this can be accomplished? I think “embeddings” don’t help either - it doesn’t make sense to transfer such large amounts of data in every session?

1 Like

I think that the Embeddings API is the solution here.
Check this tutorial: OpenAI API

For quick access you should store vectors in database. More information and list of Vector databases: OpenAI API

This thread may be helpful: Storing embeddings in SQL Server? Latency between Redis & Pinecone? Vector DB recommendations? - #2 by raymonddavey


Thank you. I will try that and post the steps and the result here.


Where are the results? I am trying to do something similar.

My results with the embedding approach using vector databases were not very useful. The introduction of the OpenAI plugin concept has made all this obsolete.
I myself was able to successfully implement a project using langchain. See here: Tutorial: ChatGPT Over Your Data

1 Like