How do you train with non-text data?

nunodonato · November 25, 2021, 10:37am

Hi folks,
So far all training/fine-tuning I’ve read about involves simple data in text, csv, etc.
I was wondering how training can be done using more complex data like a XML document. For instance, training to extract specific information from XML documents, where the structure can change a bit.
how are datasets like these created?

juliushamilton100 · November 25, 2021, 11:11am

GPT-3 can parse text so first have it extract the kind of material you are looking for from the XML with a natural language instruction. The parameters are important, so look for examples in the Playground that you can adapt.

Or you could skip parsing altogether and just see if GPT-3 can do what you want with the XML itself.

I’m pretty sure the fine-tuning is only good for applications that have a really patterned response. Like “This is my desired input, this is my desired output”. If there’s some kind of regularity of response you have in mind. Then just feed the input/output to the fine-tuning training endpoint. I don’t know what the length limit is for the examples but again, you could just pass the XML and state what you want from it.

The only big limitation I think with GPT-3 is the length of the input. Whatever you are doing, you have to find a way to break it into pieces that GPT-3 can work with. It can’t parse a 100 page document. But it can parse, classify, etc, one page at a time, for example. That’s the only slightly cumbersome aspect of designing with GPT-3, in my opinion (so far).

Topic		Replies	Views
Can we train with XML or JSON? API gpt-35-turbo , chatgpt , openai	1	2705	August 10, 2023
Training GPT-3 or similar on data API	1	427	December 15, 2023
Unstructured text to dataset API	3	2467	December 16, 2023
Custom datasets? API	3	525	December 27, 2023
Training with Large PDF FIles API	10	25716	December 15, 2023

How do you train with non-text data?

Related topics