How do you train with non-text data?

Hi folks,
So far all training/fine-tuning I’ve read about involves simple data in text, csv, etc.
I was wondering how training can be done using more complex data like a XML document. For instance, training to extract specific information from XML documents, where the structure can change a bit.
how are datasets like these created?

1 Like

GPT-3 can parse text so first have it extract the kind of material you are looking for from the XML with a natural language instruction. The parameters are important, so look for examples in the Playground that you can adapt.

Or you could skip parsing altogether and just see if GPT-3 can do what you want with the XML itself.

I’m pretty sure the fine-tuning is only good for applications that have a really patterned response. Like “This is my desired input, this is my desired output”. If there’s some kind of regularity of response you have in mind. Then just feed the input/output to the fine-tuning training endpoint. I don’t know what the length limit is for the examples but again, you could just pass the XML and state what you want from it.

The only big limitation I think with GPT-3 is the length of the input. Whatever you are doing, you have to find a way to break it into pieces that GPT-3 can work with. It can’t parse a 100 page document. But it can parse, classify, etc, one page at a time, for example. That’s the only slightly cumbersome aspect of designing with GPT-3, in my opinion (so far).