How to teach a model relational data?

I have a set of relational data that I would like a model to understand. I have tried using the Assistants API to create an assistant with a large set of data in a JSON file. I also tried Markdown. The results from both experiments was awful.

The data has a few simple fields and one very large text blob in one of the fields. I basically want to fetch some of the simple fields (id, url) based on analysis of the large text blog for that item. However it seems the model (gpt-4-turbo-preview) just chokes on the data and if it does spit out an answer it can never give me the correct ID or URL, in fact it just fabricates an ID that looks right but is not in the retrieval document.

The document size is 5.4MB and there are about 109,000 tokens in the document. Am I doing something wrong here? Should I be using a different format? Is there any way to get a ChatGPT model to “understand” relational data in the way we do when writing SQL queries for example and interpreting the results?

Thanks for any help or shared experiences with this.

Have you tried adding a JSON schema alongside the object? Seems to help in my experience :man_shrugging:

For Markdown, you can achieve a similar result by using a ToC (Table of Contents) for the document.

Thanks for your suggestion. Could you give me an example? Are you referring to the spec at json-schema dot org? If so, do I just add that as a separate file? How do you do it?

No, not the spec - though if you use Draft-07, the model is well versed and you shouldn’t run into issues. I haven’t looked too much into the newer versions, so this would be at your own risk (to your sanity :upside_down_face:).

What I mean is to define a schema for the JSON object you created (based on draft-07 or similar), which is typically used for validation (specific to the object you created, or objects like it) purposes. Here is an example (kudos to ChatGPT - though I did briefly double-check and it doesn’t include any obvious errors) to illustrate:

Below is an example JSON Schema adhering to Draft 07 specifications. This schema is designed to validate a JSON object that describes a person, including their name, age, and a list of emails. The schema ensures that the name is a string, the age is an integer, and the emails array contains valid email addresses.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Person",
  "type": "object",
  "properties": {
    "name": {
      "type": "string",
      "description": "The person's name"
    },
    "age": {
      "type": "integer",
      "description": "The person's age",
      "minimum": 0
    },
    "emails": {
      "type": "array",
      "description": "List of email addresses",
      "items": {
        "type": "string",
        "format": "email"
      }
    }
  },
  "required": ["name", "age", "emails"]
}

This schema uses basic data types (string, integer, array) and integrates common schema keywords like type, properties, description, and required. The format keyword is used to validate the format of the email addresses in the emails array.

If you need more detailed guidance on creating JSON Schemas or explanations on how the different keywords work, the JSON Schema documentation is an excellent resource.

P.s.: Yes, add it as a separate file, and also provide instructions or context that tells the model to use it (to then understand the JSON object).

1 Like

Thanks! Yes I didn’t mean include the spec itself but a schema based on that spec. I did as you described, however my main problem now is it just takes forever to get an answer. Actually I don’t even get an answer. I think this has to do with the file size / token amount. I tried with a much smaller file and it seemed to work alright.

The JSON file containing all the data is about 8.5MB and has about 1.6 million tokens. Am I just being way too ambitious here?

1 Like

That’s a tough one to answer outright - it kind of depends:

  • Is it necessary to read all* of the data at once? *Only 128,000 tokens can be contained within the limit of gpt-4-turbo-preview “memory” - less for the other models.
  • Can you break the data up into multiple files? Ideally less than the context/token limit of the model per file. Keep in mind that for any given query, the answer must be achievable with less tokens than the models limit - you might have to try summarizing some of the data, for example, but this is… complicated.
  • A good rule of thumb is that the length and complexity of the data has an inverse relationship with the models ability to use that data; more data/complexity, less model ability (or more mistakes).
1 Like

Additionally, you could try to give the model a way to access specific parts of the data (or, even better, a sequence of steps to follow) so that it doesn’t try to read more data than it needs to (per run / completion).

1 Like

Additionally, you could try to give the model a way to access specific parts of the data (or, even better, a sequence of steps to follow) so that it doesn’t try to read more data than it needs to (per run / completion).

This is exactly what I was starting to think about… in my case if the model could call out to an API I setup to perform searches and then only analyze the large text blobs for those search results, I think that would fit within the 128k token limit.

1 Like

Let me know how you go! I have yet to try anything that large, so I would be interested to know if that approach is successful.

Happy coding!