Preparing complex data for embedding that is originally in JSON

Hi, I have a bunch of data I want to embed. The data is originally in JSON format, and describes a lot of different items with the same kinds of attributes but in different amounts/distributions. Think like “Sword: 20 damage, 15 Stamina, 10 Strength, 20 Speed, 1 Intelligence, 3 Agility”.

I’ve tried to convert into natural language via a script that will output text like this: “The sword has very high amounts of damage, high stamina, average Strength, excellent Speed, abysmal intelligence and bad Agility”. Now my aim is for a prompt like “give me an item with some of the best speed but bad intelligence” to return an item that fits this criteria. But this didn’t work out at all - my theory is that because all of the data is formatted in such a similar way (all items have the same attributes, so it will mention “strength, stamina…” on every single piece of data) that it is struggling.
In reality, the items have a lot more stats, about 30-40 in total, i just simplified it here.

So in my next attempt, I tried to format the data in a way where I would only mention the key strengths and weaknesses of every item, not mentioning average stats of them. So for this sword, it would be similar to: “The sword has very high amounts of damage, high Stamina, excellent speed and abysmal intelligence”. Another item with mediocre Strength will not have the strength in its text at all, etc.

But even this way, it is not returning accurate results. What I am also noticing is that usually, my most similar item is only less than 10% above the least similar items (so the most similar might have a score of 79% and the least similar item has 72%). Is there something wrong with the way I am formatting my data? What would be a good way to turn my JSON-Data into natural language? I guess a last resort would be calling the ChatGPT API on every JSON-item to turn it into natural language, but that would be very pricey with the amount of data.

2 Likes

Hi @Vaughn I am also trying to do something similar, I have JSON data I want to embed, and not sure what is the best approach. As constructing English sentences will probably not work for me. Have you tried to embed the whole JSON data as is? Any other learning you wanna share, please? Thanks

The problem with embedding it just return text based on similarity with your prompt like google but for data,
you can achieve your goal by fine tuning embedded data

I also faced a similar problem. I think we’ll have to pre-process the complex data first. It starts with feature engineering i.e to identify features that are important to the use-case.

Once those are identified, they need to be cleaned and normalized using different techniques like min-max scaling in case of integer values or one hot encoding for categorical values. Text values can be embedded normally using thr embedded models.

After this process, we’ll get a vector which can then be used. But, there is a bit of fine-tuning required to get this working.

Note: this answer is based on my findings and experiments so far. Don’t know if there is a better way.