Unstructured text to dataset

Hi, I am experimenting with training/fine-tuning gpt-3 specifically for my niche.

My aim is really to have a model specifically trained for my niche, instead of using few-shots.

I have lots of unstructured text in the form of reports, ebooks, etc (tens of thousands of pages)

I want to convert those unstructured text data to a structured dataset so that I can use it to fine-tune gpt-3 for my niche.

And then I will use this niche-fine-tuned model for various tasks like question answering, completion, classification, etc.

There are several proposed methods like using NER detection, BERT classification, and some others.

Manual annotation/labeling on those documents with thousands of pages, obviously, would last forever and cost a lot.

I would appreciate some expert direction here as I could not find any best practices on the net.

Thank you.

I think you have too many steps in this. I don’t think you need to create structured data nor do you need finetuning. I think you just need the write system of embeddings, chunking, search and prompts.

I will need to study this, appreciate for the direction…

1 Like