Train ChatGPT model using JSON data

Hello All,

I have a json data and I would like to train ChatGPT model using this JSON data. I followed steps to create JSONL from JSON but got error as “ERROR in necessary_column validator: prompt column/key is missing. Please make sure you name your columns/keys appropriately, then retry”
Here’s how my JSON data looks like:

{
  "folders": [
    {
      "name": "User1",
      "id": "dfsFGGES:1",
      "projects": [
        {
          "name": "chatgpt",
          "id": "Noc29sdX",
          "files": [
            {
              "name": "Test1",
              "id": "guI4wU",
              "createdBy": "User 1",
              "data": {
                "name": "Test1-new",
                "items": [
                  {
                    "name": "abc",
                    "type": "cloth",
                    "items": [
                      {
                        "name": "abc",
                        "type": "fabric"
                      },
                      {
                        "name": "pqr",
                        "type": "fabric"
                      }
                    ]
                  }
                ]
              }
            },
            {
              "name": "Test1",
              "id": "guI4wU",
              "createdBy": "User 1",
              "data": {
                "name": "Test1-new",
                "items": [
                  {
                    "name": "abc",
                    "type": "cloth",
                    "items": [
                      {
                        "name": "abc",
                        "type": "fabric"
                      },
                      {
                        "name": "pqr",
                        "type": "fabric"
                      }
                    ]
                  }
                ]
              }
            }
          ]
        },
        {
          "name": "Test1",
          "id": "guI4wU",
          "createdBy": "User 1",
          "data": {
            "name": "Test1-new",
            "items": [
              {
                "name": "abc",
                "type": "cloth",
                "items": [
                  {
                    "name": "abc",
                    "type": "fabric"
                  },
                  {
                    "name": "pqr",
                    "type": "fabric"
                  }
                ]
              }
            ]
          }
        }
      ]
    },
    {
      "name": "User1",
      "id": "dfsFGGES:1",
      "projects": [
        {
          "name": "chatgpt",
          "id": "Noc29sdX",
          "files": [
            {
              "name": "Test1",
              "id": "guI4wU",
              "createdBy": "User 1",
              "data": {
                "name": "Test1-new",
                "items": [
                  {
                    "name": "abc",
                    "type": "cloth",
                    "items": [
                      {
                        "name": "abc",
                        "type": "fabric"
                      },
                      {
                        "name": "pqr",
                        "type": "fabric"
                      }
                    ]
                  }
                ]
              }
            },
            {
              "name": "Test1",
              "id": "guI4wU",
              "createdBy": "User 1",
              "data": {
                "name": "Test1-new",
                "items": [
                  {
                    "name": "abc",
                    "type": "cloth",
                    "items": [
                      {
                        "name": "abc",
                        "type": "fabric"
                      },
                      {
                        "name": "pqr",
                        "type": "fabric"
                      }
                    ]
                  }
                ]
              }
            }
          ]
        },
        {
          "name": "Test1",
          "id": "guI4wU",
          "createdBy": "User 1",
          "data": {
            "name": "Test1-new",
            "items": [
              {
                "name": "abc",
                "type": "cloth",
                "items": [
                  {
                    "name": "abc",
                    "type": "fabric"
                  },
                  {
                    "name": "pqr",
                    "type": "fabric"
                  }
                ]
              }
            ]
          }
        }
      ]
    }
  ]
}

I followed steps as mentioned at: “OpenAI API
Can someone help how can I train ChatGPT using JSON data.

1 Like

Welcome to the community @Rahul.Khande

What use-case do you want to build for?

I’m asking because there are several discrepancies with your comprehension of the API.

  1. All models are pre-trained. The P in GPT is for pre-trained. You can only fine-tune them.

  2. Only base models are available for fine-tuning.

  3. Fine-tuning is done with a JSONL file with each individual line being a valid JSON object consisting of prompt completion pairs with appropriate formatting.

I highly recommend reading docs to develop a better understanding of the capabilities offered.

1 Like

Thank you @sps . I appreciate your answer and completely got you mean.
Earlier I thought I need to fine tune existing models with my data but now I got your point.

My use case is, I have a huge JSON data like shared in my earlier post and would like to develop a chat app that answers based on data in JSON.
For e,g. give me list of all files in project named “chatgpt”

Can you please share instructions how this can achieved? Thank you

1 Like

If you’re looking for searching through json data, simple recursive search such as this can be performed, for the entity extracted from user message and the result can be supplied back directly to the user.

1 Like

Thanks for the suggestion @sps .
This is indeed a way to do but worried on search performance for huge dataset like 500 MB JSON. Time taken to search would be high. Right? So I was thinking to leverage ChatGPT here.

Thought process was, by integrating ChatGPT into application, I can leverage its language understanding capabilities to enhance the search functionality and provide a more user-friendly experience on large datasets.

Just to understand more, isn’t this a use case to use ChatGPT?

In that case, you can use an indexing technique that preprocesses the JSON object and stores the value-to-key mapping in a dictionary or hash table.

This way, when you need to search for a value, you can look it up in the index directly, which is much faster than traversing the entire JSON object each time you want to find a value.

You can’t train a pre trained model. But you can fine tune it. Still it will be computational expensive. Use semantic search by using vectordbs.

I use langchain with a pydantic extractor for this.

1 Like

@MaskedAttention

How would you accomplish the semantic search and vectordbs on json?