I have some data in tables that may have 3 or more columns. Such as Name|DOB|City|Zip. How should I go about creating embedding for such data? Should I create embedding for each table row with header as below:
Name|DOB|City|State: Sam Walker|1/1/1997|Paducah|KY
or something else?
1 Like
Try this:
Column name: row value
For example, if you have columns Name and Date:
Name: Brian
Date: 2023-03-23
2 Likes
In this case, how would it establish relationship between two data points? For example:
Name: Tom
DOB: 01/01/01
Name: Mot
DOB: 10/10/10
If I ask GPT, “what’s DOB for MOT”, how would it know?
Just include a piece of meta data as part of the embedding, like “stats for person 1: Name: Tom, DOB…etc.” GPT won’t have any trouble interpreting that. Assign a unique ID to each person.
1 Like
Asking the same question in a different context. Let’s say that I have a pdf file that may have multiple tables. Using Adobe API, I can extract the tables as Excel as well as JSON. How should I go about creating this table into embedding? The problem arise when there is heading and subheading in the table as shown in the image.
I used a simple CSV format for the file embedding.
Name,DateOfBirth,City,State
Jake Kirby,1987/05/02,Los Angeles,CA
Annete Madison,1974/07/11,San Francisco,CA
Mike Browning,1992/03/25,Hope,AR
Beth Drupal,1996/10/05,Atlanta,GA
Naomi Miyazaki,2002/12/04,Topeka,KS
Here is the result of the conversation:
user: what is the birthdate of beth drupal?
assistant: The birthdate of Beth Drupal is 1996/10/05.
user: where was she born?
assistant: I couldn't find the answer to that question.
user: what city and state is beth drupal born?
assistant: Beth Drupal was born in Atlanta, GA.
user: how about annete madison? what can you tell me about her?
assistant: I couldn't find the answer to that question.
user: tell me the birthdate of annete madison and what city and state she was born.
assistant: The birthdate of Annete Madison is 1974/07/11. However, I couldn't find information about the city and state she was born in.
user: When was Mike Browning born?
assistant: Mike Browning was born on 1992/03/25.
user: where is the birth place of Mike Browning?
assistant: I couldn't find the answer to that question.
user: what city and state is Mike Browning born?
assistant: Mike Browning was born in Hope, AR.
I just ran into this today. ChatGPT seems to have a hard time recognizing these headers, and the best solution I’ve found so far is to do some preprocessing on that table and combine the headers and sub-headers into one column.
i.e. instead of header being over subheader1 and subheader2, you have to combine them, and it becomes header-subheader1 and header-subheader2.
Let me know if you find anything more elegant!
3 Likes
Normally I add at context the meaning of each column and works fine. Explain the data like a data dictionary (if you, when create th table add description to each column do you could automate this task)
I have same dilemma. It’s unpractical to adjust by “hand” tabular Data before embedding creation. It is very common scenario. Does anyone have good solution? Answer from AI is unreliable if we fill vector database with tabular data.
I wonder if you can use Vision capability here. So any table gets fed as a pic and get summarized. Then you use the summary to pick the right image when a user asks a question and ask open AI to provide answer.
You mean that with vision capability we recognise that the part of document is table? Have you any particular example?
Keep in mind that data which follows a schema is better kept in a database, and GPT(Custom GPT, and Assistants) are most likely better off performing function calling to retrieve the data.
1 Like
I tried multiple ways to convert tables into a suitable “JSON-Like” (text) format so that I can pass into an embedding model for my RAG operation.
The text for your example would be :
Chunk 1 :-
“Personal Dtls - Name”:“Deepak”,
“Info 1-Basic info-Location”:“Surat”,
“Info 1-Basic info-Gender”:“male”,
“Info 2- Additional Data-Color”:“red”,
“Info 2- Additional Data-Date Of Birth”:“07/02/1994”
Chunk 2 :-
“Personal Dtls - Name”:“Ajunkya”,
“Info 1-Basic info-Location”:“KP”,
“Info 1-Basic info-Gender”:“male”,
“Info 2- Additional Data-Color”:“red”,
“Info 2- Additional Data-Date Of Birth”:“07/02/1994”
This works well for queries that want to retrieve from a single query (eg. whats deepak’s gender? or when ajinkya was born ?).
This method fails when you ask queries that require information from mutiple chunks (eg. how many male where born with red hair? or how many people where born on 07/02/1994?)