Data to use for embed model

Hi folks.

I have thought to start learning embeddings. My plan is to start with GPT4All and the nomic-embed-text it provides with it.

Everywhere it is said that “You can feed pdf, xls, doc, your mother…whatever” to embed model. But should I?

For example, does it help embed model if I arrange my data to simple XML?

<xml>
<data>
<date>12345</date>
<text>foo barr is funny thing</text>
</data>
<data>
<date>54321>/date>
...
..

Or maybe in even simpler form

START DATA:
date: 12345
text: foo barr is funny thing
END DATA:
START DATA:
date:
...
...

Or is it really so, that it does not matter?

Yes, it does matter.

You’d have to understand the underlying model to answer this question. Typically the structure would match whatever you’re comparing with. If transformation is necessary a general rule of thumb is markdown. Although if your data is structured then XML may make more sense.

Since you’re learning it’s a great time to tinker, find out, and return your results

Ok, thanks. I got similar responses elsewhere, so i stick with XML and consider markdown.