Creating embeddings of tabular data

chinmay1 · March 23, 2023, 7:47pm

or something else?

wfhbrian · March 23, 2023, 7:53pm

Try this:

Column name: row value

For example, if you have columns Name and Date:

Name: Brian
Date: 2023-03-23

chinmay1 · March 24, 2023, 12:16am

In this case, how would it establish relationship between two data points? For example:
Name: Tom
DOB: 01/01/01
Name: Mot
DOB: 10/10/10
If I ask GPT, “what’s DOB for MOT”, how would it know?

lmccallum · March 24, 2023, 12:22am

Just include a piece of meta data as part of the embedding, like “stats for person 1: Name: Tom, DOB…etc.” GPT won’t have any trouble interpreting that. Assign a unique ID to each person.

chinmay1 · July 6, 2023, 4:29am

Asking the same question in a different context. Let’s say that I have a pdf file that may have multiple tables. Using Adobe API, I can extract the tables as Excel as well as JSON. How should I go about creating this table into embedding? The problem arise when there is heading and subheading in the table as shown in the image.

supershaneski · July 6, 2023, 6:03am

I used a simple CSV format for the file embedding.

Name,DateOfBirth,City,State
Jake Kirby,1987/05/02,Los Angeles,CA
Annete Madison,1974/07/11,San Francisco,CA
Mike Browning,1992/03/25,Hope,AR
Beth Drupal,1996/10/05,Atlanta,GA
Naomi Miyazaki,2002/12/04,Topeka,KS

Here is the result of the conversation:

user: what is the birthdate of beth drupal?
assistant: The birthdate of Beth Drupal is 1996/10/05.
user: where was she born?
assistant: I couldn't find the answer to that question.
user: what city and state is beth drupal born?
assistant: Beth Drupal was born in Atlanta, GA.
user: how about annete madison? what can you tell me about her?
assistant: I couldn't find the answer to that question.
user: tell me the birthdate of annete madison and what city and state she was born.
assistant: The birthdate of Annete Madison is 1974/07/11. However, I couldn't find information about the city and state she was born in.
user: When was Mike Browning born?
assistant: Mike Browning was born on 1992/03/25.
user: where is the birth place of Mike Browning?
assistant: I couldn't find the answer to that question.
user: what city and state is Mike Browning born?
assistant: Mike Browning was born in Hope, AR.

zachary.suzuki · July 7, 2023, 6:25am

I just ran into this today. ChatGPT seems to have a hard time recognizing these headers, and the best solution I’ve found so far is to do some preprocessing on that table and combine the headers and sub-headers into one column.

i.e. instead of header being over subheader1 and subheader2, you have to combine them, and it becomes header-subheader1 and header-subheader2.

Let me know if you find anything more elegant!

fagnermoura · December 4, 2023, 11:33am

Normally I add at context the meaning of each column and works fine. Explain the data like a data dictionary (if you, when create th table add description to each column do you could automate this task)

mymetricsoftware · December 27, 2023, 12:03am

I have same dilemma. It’s unpractical to adjust by “hand” tabular Data before embedding creation. It is very common scenario. Does anyone have good solution? Answer from AI is unreliable if we fill vector database with tabular data.

chinmay1 · January 19, 2024, 7:32pm

I wonder if you can use Vision capability here. So any table gets fed as a pic and get summarized. Then you use the summary to pick the right image when a user asks a question and ask open AI to provide answer.

mymetricsoftware · February 6, 2024, 7:21pm

You mean that with vision capability we recognise that the part of document is table? Have you any particular example?

anon10827405 · February 6, 2024, 8:01pm

Keep in mind that data which follows a schema is better kept in a database, and GPT(Custom GPT, and Assistants) are most likely better off performing function calling to retrieve the data.

sshirgaonkar20comp · September 27, 2024, 7:40am

I tried multiple ways to convert tables into a suitable “JSON-Like” (text) format so that I can pass into an embedding model for my RAG operation.

The text for your example would be :
Chunk 1 :-
“Personal Dtls - Name”:“Deepak”,
“Info 1-Basic info-Location”:“Surat”,
“Info 1-Basic info-Gender”:“male”,
“Info 2- Additional Data-Color”:“red”,
“Info 2- Additional Data-Date Of Birth”:“07/02/1994”
Chunk 2 :-
“Personal Dtls - Name”:“Ajunkya”,
“Info 1-Basic info-Location”:“KP”,
“Info 1-Basic info-Gender”:“male”,
“Info 2- Additional Data-Color”:“red”,
“Info 2- Additional Data-Date Of Birth”:“07/02/1994”

This works well for queries that want to retrieve from a single query (eg. whats deepak’s gender? or when ajinkya was born ?).
This method fails when you ask queries that require information from mutiple chunks (eg. how many male where born with red hair? or how many people where born on 07/02/1994?)

Topic		Replies	Views
Best Table Text Format for Embeddings Generation Prompting embeddings , api	2	1189	July 30, 2024
How do you tag data correctly? API embeddings , chatgpt , vector-db	8	4234	December 16, 2023
Efficient way for Chunking CSV Files or Structured Data API	9	4035	September 5, 2024
ChatGPT can handle dates in tables? API chatgpt , api , prompt-engineering , assistants-api	3	367	July 2, 2024
Generate Embedding For each Tabular record Prompting	5	2556	December 17, 2023

Creating embeddings of tabular data

Related topics