Optimize natural data creation from JSON

Hi, I am using embeddings (text-embedding-ada002) to inject Football Player data into chatGPT to answer questions and the results are okay, but I am not completely happy. For example, it will often retrieve players who don’t play in the specified positions (asking for a leftback and it gives me left-footed players who are not necessarily leftbacks). For other information such as strong foot or nationality it works basically perfectly. I think the way to improve the performance in this field is likely to improve the natural language data that I am using for the embeddings.

The original data I have access to is in JSON format and I wrote a script to transform it into natural language. In the JSON, a player might be represented like this:

[
{
“id”: “2000020333”,
“name”: “Olivier Aertssen”,
“age”: “17”,
“currentAbility”: 95,
“potentialAbility”: 160,
“club”: “Ajax”,
“nationalities”: [
“NED”
],
“positions”: [
“D (C)”,
“DM”
],
“askingPrice”: “€9.75M”,
“contractLength”: “30/6/2025”,
“personality”: “Balanced”,
“searchString”: “Olivier Aertssen”,
“attributes”: {
“technicals”: {
“crossing”: 6,
“corners”: 5,
“firstTouch”: 10,
“finishing”: 5,
“dribbling”: 11,
“heading”: 11,
“freekicks”: 7,
“marking”: 9,
“longThrow”: 6,
“longshots”: 6,
“passing”: 13,
“penalties”: 7,
“tackling”: 10,
“technique”: 10
},
“mentals”: {
“workrate”: 13,
“vision”: 12,
“teamwork”: 12,
“positioning”: 12,
“offTheBall”: 11,
“leadership”: 8,
“flair”: 12,
“determination”: 12,
“decisions”: 11,
“concentration”: 11,
“composure”: 11,
“bravery”: 12,
“anticipation”: 10,
“aggression”: 10
},
“physicals”: {
“acceleration”: 11,
“agility”: 11,
“balance”: 9,
“jumpingReach”: 12,
“naturalFitness”: 11,
“pace”: 10,
“strength”: 12,
“stamina”: 13
}
}
}
]

which i then turn into this natural data to embed:

Olivier Aertssen is an Either-footed player currently playing for Ajax. His nationality is NED and his preferred positions are Central Defender. The player who just appeared on people’s radar fetches a decent price of €9.75M. Still being very young and even a teenager at 17 years, his current ability is not up to par and he is considered to be a real talent and a wonderkid. He has a lot of potential left to grow. He shapes the game with his pinpoint accurate passes. He can play great opening passes, get lots of assists and play key passes. He has lots of bravery which is why he generally doesn’t back down from challenges. He has good levels of determination and works hard to achieve his goals on and off the pitch. He shows his flair on the ball regularly, doing unpredictable moves and passes. He has good defensive positioning and leaves little room for attackers. His workrate is really admirable. He runs up and down the field the whole game and is willing to go the extra mile. He has really impressive stamina and is a proper workhorse. He will still look fit at the end of the game and won’t have to be subbed off. Finishing really isn’t one of his strong suits. His shots often miss the goal. He shouldn’t take corner kicks under any circumstance.

Basically, i’m looking at his strengths and weaknesses and generate one or two sentences about each of them. Does anyone have any suggestions on how I could improve this data to get better results from my embeddings, more specifically how can I make it put more emphasis on the player’s position and not ignore it?

You can try adding either the whole position instead of the short form or adding a prompt to the NL generator where you provide it with the full form of all the positions and then get it to generate the description.

The first approach is likely going to be the better one in most of the cases

I am already doing that, see how “D (C)” is “Central Defender” in the natural language text.