Train ChatGPT model using JSON data

Rahul.Khande · May 17, 2023, 10:51am

Hello All,

I have a json data and I would like to train ChatGPT model using this JSON data. I followed steps to create JSONL from JSON but got error as “ERROR in necessary_column validator: prompt column/key is missing. Please make sure you name your columns/keys appropriately, then retry”
Here’s how my JSON data looks like:

{
  "folders": [
    {
      "name": "User1",
      "id": "dfsFGGES:1",
      "projects": [
        {
          "name": "chatgpt",
          "id": "Noc29sdX",
          "files": [
            {
              "name": "Test1",
              "id": "guI4wU",
              "createdBy": "User 1",
              "data": {
                "name": "Test1-new",
                "items": [
                  {
                    "name": "abc",
                    "type": "cloth",
                    "items": [
                      {
                        "name": "abc",
                        "type": "fabric"
                      },
                      {
                        "name": "pqr",
                        "type": "fabric"
                      }
                    ]
                  }
                ]
              }
            },
            {
              "name": "Test1",
              "id": "guI4wU",
              "createdBy": "User 1",
              "data": {
                "name": "Test1-new",
                "items": [
                  {
                    "name": "abc",
                    "type": "cloth",
                    "items": [
                      {
                        "name": "abc",
                        "type": "fabric"
                      },
                      {
                        "name": "pqr",
                        "type": "fabric"
                      }
                    ]
                  }
                ]
              }
            }
          ]
        },
        {
          "name": "Test1",
          "id": "guI4wU",
          "createdBy": "User 1",
          "data": {
            "name": "Test1-new",
            "items": [
              {
                "name": "abc",
                "type": "cloth",
                "items": [
                  {
                    "name": "abc",
                    "type": "fabric"
                  },
                  {
                    "name": "pqr",
                    "type": "fabric"
                  }
                ]
              }
            ]
          }
        }
      ]
    },
    {
      "name": "User1",
      "id": "dfsFGGES:1",
      "projects": [
        {
          "name": "chatgpt",
          "id": "Noc29sdX",
          "files": [
            {
              "name": "Test1",
              "id": "guI4wU",
              "createdBy": "User 1",
              "data": {
                "name": "Test1-new",
                "items": [
                  {
                    "name": "abc",
                    "type": "cloth",
                    "items": [
                      {
                        "name": "abc",
                        "type": "fabric"
                      },
                      {
                        "name": "pqr",
                        "type": "fabric"
                      }
                    ]
                  }
                ]
              }
            },
            {
              "name": "Test1",
              "id": "guI4wU",
              "createdBy": "User 1",
              "data": {
                "name": "Test1-new",
                "items": [
                  {
                    "name": "abc",
                    "type": "cloth",
                    "items": [
                      {
                        "name": "abc",
                        "type": "fabric"
                      },
                      {
                        "name": "pqr",
                        "type": "fabric"
                      }
                    ]
                  }
                ]
              }
            }
          ]
        },
        {
          "name": "Test1",
          "id": "guI4wU",
          "createdBy": "User 1",
          "data": {
            "name": "Test1-new",
            "items": [
              {
                "name": "abc",
                "type": "cloth",
                "items": [
                  {
                    "name": "abc",
                    "type": "fabric"
                  },
                  {
                    "name": "pqr",
                    "type": "fabric"
                  }
                ]
              }
            ]
          }
        }
      ]
    }
  ]
}

I followed steps as mentioned at: “OpenAI API”
Can someone help how can I train ChatGPT using JSON data.

sps · May 17, 2023, 12:07pm

Welcome to the community @Rahul.Khande

What use-case do you want to build for?

I’m asking because there are several discrepancies with your comprehension of the API.

All models are pre-trained. The P in GPT is for pre-trained. You can only fine-tune them.
Only base models are available for fine-tuning.
Fine-tuning is done with a JSONL file with each individual line being a valid JSON object consisting of prompt completion pairs with appropriate formatting.

I highly recommend reading docs to develop a better understanding of the capabilities offered.

Rahul.Khande · May 17, 2023, 12:26pm

Thank you @sps . I appreciate your answer and completely got you mean.
Earlier I thought I need to fine tune existing models with my data but now I got your point.

My use case is, I have a huge JSON data like shared in my earlier post and would like to develop a chat app that answers based on data in JSON.
For e,g. give me list of all files in project named “chatgpt”

Can you please share instructions how this can achieved? Thank you

sps · May 18, 2023, 5:33am

If you’re looking for searching through json data, simple recursive search such as this can be performed, for the entity extracted from user message and the result can be supplied back directly to the user.

Rahul.Khande · May 18, 2023, 5:48am

Thanks for the suggestion @sps .
This is indeed a way to do but worried on search performance for huge dataset like 500 MB JSON. Time taken to search would be high. Right? So I was thinking to leverage ChatGPT here.

Thought process was, by integrating ChatGPT into application, I can leverage its language understanding capabilities to enhance the search functionality and provide a more user-friendly experience on large datasets.

Just to understand more, isn’t this a use case to use ChatGPT?

sps · May 18, 2023, 9:09am

In that case, you can use an indexing technique that preprocesses the JSON object and stores the value-to-key mapping in a dictionary or hash table.

This way, when you need to search for a value, you can look it up in the index directly, which is much faster than traversing the entire JSON object each time you want to find a value.

MaskedAttention · May 18, 2023, 10:11am

You can’t train a pre trained model. But you can fine tune it. Still it will be computational expensive. Use semantic search by using vectordbs.

aikanarov · May 21, 2023, 4:01am

I use langchain with a pydantic extractor for this.

rohianon48 · August 5, 2023, 7:12pm

@MaskedAttention

How would you accomplish the semantic search and vectordbs on json?

Topic		Replies	Views
How can I use chat/completion API on large datasets of "arbitrary" JSON API gpt-4 , fine-tuning , token , json	7	2883	March 12, 2024
JSON data in training file API	2	3379	December 16, 2023
Creating JSONl File from doc file API chatgpt	3	3460	October 10, 2024
Can we train with XML or JSON? API gpt-35-turbo , chatgpt , openai	1	2650	August 10, 2023
Training GPT assistant using JSON API gpt-4 , chatgpt , fine-tuning , api , assistants-api	4	866	September 13, 2024

Train ChatGPT model using JSON data

Related topics