Should I use YAML or JSON for embeddings text?

Should I format my input data as human readable YAML?
Product: iPhone
Review: I dont like the new version

or machine readable JSON?
“Product”: “iPhone”,
“Review”: “I dont like the new version”

Does embeddings vector change based on the data format?

Hi and welcome to the developer forum!

You pay per token, so the less of them you have, the cheaper it will be.

How you format data will of course change the vectors, but by how much is something you will need to experiment with, create a subset of your data and try in both formats and try that with some predefined queries and see which give you the best results.

1 Like

I had same experience and mostly it makes no difference how you structure your data, gpt focuses mostly on the contents, not the structure of data.

My experience somehow , is different than this. GPT will actually skip some of your items.
if you have questions on 100 items like this:
product name:xxx
product state:xxx
and you ask 100 questions within one prompt, for each one of the item.

you wilk get less than 100 answers, gpt will skip some of them and I dont know why.

So the problem you will probably have, is not the data structure of it, it is how GPT understand your question and give back answers.

1 Like

Thanks, @ davidcurious977! I am planning to use embeddings and not prompt. Does embeddings focus on the contents instead of the structure of data similar to GPT prompt?