How to teach a model relational data?

cdavaz · April 6, 2024, 2:55am

I have a set of relational data that I would like a model to understand. I have tried using the Assistants API to create an assistant with a large set of data in a JSON file. I also tried Markdown. The results from both experiments was awful.

The data has a few simple fields and one very large text blob in one of the fields. I basically want to fetch some of the simple fields (id, url) based on analysis of the large text blog for that item. However it seems the model (gpt-4-turbo-preview) just chokes on the data and if it does spit out an answer it can never give me the correct ID or URL, in fact it just fabricates an ID that looks right but is not in the retrieval document.

The document size is 5.4MB and there are about 109,000 tokens in the document. Am I doing something wrong here? Should I be using a different format? Is there any way to get a ChatGPT model to “understand” relational data in the way we do when writing SQL queries for example and interpreting the results?

Thanks for any help or shared experiences with this.

caydennormanton · April 6, 2024, 8:11am

Have you tried adding a JSON schema alongside the object? Seems to help in my experience

For Markdown, you can achieve a similar result by using a ToC (Table of Contents) for the document.

cdavaz · April 6, 2024, 3:38pm

Thanks for your suggestion. Could you give me an example? Are you referring to the spec at json-schema dot org? If so, do I just add that as a separate file? How do you do it?

caydennormanton · April 6, 2024, 4:05pm

No, not the spec - though if you use Draft-07, the model is well versed and you shouldn’t run into issues. I haven’t looked too much into the newer versions, so this would be at your own risk (to your sanity ).

What I mean is to define a schema for the JSON object you created (based on draft-07 or similar), which is typically used for validation (specific to the object you created, or objects like it) purposes. Here is an example (kudos to ChatGPT - though I did briefly double-check and it doesn’t include any obvious errors) to illustrate:

Below is an example JSON Schema adhering to Draft 07 specifications. This schema is designed to validate a JSON object that describes a person, including their name, age, and a list of emails. The schema ensures that the name is a string, the age is an integer, and the emails array contains valid email addresses.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Person",
  "type": "object",
  "properties": {
    "name": {
      "type": "string",
      "description": "The person's name"
    },
    "age": {
      "type": "integer",
      "description": "The person's age",
      "minimum": 0
    },
    "emails": {
      "type": "array",
      "description": "List of email addresses",
      "items": {
        "type": "string",
        "format": "email"
      }
    }
  },
  "required": ["name", "age", "emails"]
}

This schema uses basic data types (string, integer, array) and integrates common schema keywords like type, properties, description, and required. The format keyword is used to validate the format of the email addresses in the emails array.

If you need more detailed guidance on creating JSON Schemas or explanations on how the different keywords work, the JSON Schema documentation is an excellent resource.

P.s.: Yes, add it as a separate file, and also provide instructions or context that tells the model to use it (to then understand the JSON object).

cdavaz · April 6, 2024, 4:58pm

Thanks! Yes I didn’t mean include the spec itself but a schema based on that spec. I did as you described, however my main problem now is it just takes forever to get an answer. Actually I don’t even get an answer. I think this has to do with the file size / token amount. I tried with a much smaller file and it seemed to work alright.

The JSON file containing all the data is about 8.5MB and has about 1.6 million tokens. Am I just being way too ambitious here?

caydennormanton · April 6, 2024, 5:32pm

That’s a tough one to answer outright - it kind of depends:

Is it necessary to read all* of the data at once? *Only 128,000 tokens can be contained within the limit of gpt-4-turbo-preview “memory” - less for the other models.
Can you break the data up into multiple files? Ideally less than the context/token limit of the model per file. Keep in mind that for any given query, the answer must be achievable with less tokens than the models limit - you might have to try summarizing some of the data, for example, but this is… complicated.
A good rule of thumb is that the length and complexity of the data has an inverse relationship with the models ability to use that data; more data/complexity, less model ability (or more mistakes).

caydennormanton · April 6, 2024, 5:42pm

Additionally, you could try to give the model a way to access specific parts of the data (or, even better, a sequence of steps to follow) so that it doesn’t try to read more data than it needs to (per run / completion).

cdavaz · April 6, 2024, 7:38pm

Additionally, you could try to give the model a way to access specific parts of the data (or, even better, a sequence of steps to follow) so that it doesn’t try to read more data than it needs to (per run / completion).

This is exactly what I was starting to think about… in my case if the model could call out to an API I setup to perform searches and then only analyze the large text blobs for those search results, I think that would fit within the 128k token limit.

caydennormanton · April 6, 2024, 7:42pm

Let me know how you go! I have yet to try anything that large, so I would be interested to know if that approach is successful.

Happy coding!

davemdub · July 17, 2024, 5:24pm

I’m trying to do a similar example - 2 tables, one is rather large with customer data and another is a reference table to tell what kind of training customers have. I am not able to relate the 2 tables together to answer a question about the customer that requires a reference to the second table. Some queriers are answered correctly, but others are a dismal failure. Often, I’ll ask if a customer has a certain, training, and then it will return all of them. But, if I ask if customer X has training Y or Z it answers yes or no correctly. Any suggestions?

cdavaz · July 17, 2024, 5:44pm

I wound up creating an Assistant that can call out functions, following the “How to call functions with chat models” tutorial (sorry they won’t let me past links here so you’ll have search it up).

Basically setup your database server and some way of querying it via some function you write. Chat GPT can figure out if based on the prompt it should call your function and what the parameters should be. It can then incorporate the returned data from your function into a text response to the user.

Topic		Replies	Views
Seeking Guidance on Building a ChatGPT-Style Data Analyst Tool with Database Integration Plugins / Actions builders gpt-4 , chatgpt , api , openai	11	4355	September 23, 2024
Connecting an assistant to a database for retrieval API	13	9941	April 29, 2024
Turning chatgpt API into a assistant for a (complex) website API	20	4202	December 21, 2023
API and files - Assistant or classic chat completation? API gpt-4 , chatgpt , api , assistants-api	12	1602	February 20, 2024
Creating Chatbot with Assistant and Table of Data API code-interpreter , assistants-api	7	1903	May 12, 2024

How to teach a model relational data?

Related topics