Data to use for embed model

krisu.virtanen · February 18, 2025, 4:39pm

Hi folks.

I have thought to start learning embeddings. My plan is to start with GPT4All and the nomic-embed-text it provides with it.

Everywhere it is said that “You can feed pdf, xls, doc, your mother…whatever” to embed model. But should I?

For example, does it help embed model if I arrange my data to simple XML?

<xml>
<data>
<date>12345</date>
<text>foo barr is funny thing</text>
</data>
<data>
<date>54321>/date>
...
..

Or maybe in even simpler form

START DATA:
date: 12345
text: foo barr is funny thing
END DATA:
START DATA:
date:
...
...

Or is it really so, that it does not matter?

mat.eo · February 18, 2025, 7:23pm

Yes, it does matter.

You’d have to understand the underlying model to answer this question. Typically the structure would match whatever you’re comparing with. If transformation is necessary a general rule of thumb is markdown. Although if your data is structured then XML may make more sense.

Since you’re learning it’s a great time to tinker, find out, and return your results

krisu.virtanen · February 18, 2025, 7:33pm

Ok, thanks. I got similar responses elsewhere, so i stick with XML and consider markdown.

Topic		Replies	Views
Should I use YAML or JSON for embeddings text? API embeddings	4	2324	December 17, 2023
Best structure lang for input to MD output API	3	180	July 13, 2025
How do you train with non-text data? API	1	1800	November 25, 2021
I read about embeddings and I want to try it. How to start? Community embeddings , chatgpt , api	2	4900	August 11, 2023
Preparing complex data for embedding that is originally in JSON API	3	6450	June 15, 2024

Data to use for embed model

Related topics