Unstructured text to dataset

mkural2016 · August 4, 2022, 8:06am

Hi, I am experimenting with training/fine-tuning gpt-3 specifically for my niche.

My aim is really to have a model specifically trained for my niche, instead of using few-shots.

I have lots of unstructured text in the form of reports, ebooks, etc (tens of thousands of pages)

I want to convert those unstructured text data to a structured dataset so that I can use it to fine-tune gpt-3 for my niche.

And then I will use this niche-fine-tuned model for various tasks like question answering, completion, classification, etc.

There are several proposed methods like using NER detection, BERT classification, and some others.

Manual annotation/labeling on those documents with thousands of pages, obviously, would last forever and cost a lot.

I would appreciate some expert direction here as I could not find any best practices on the net.

Thank you.

daveshapautomator · August 4, 2022, 11:53am

I think you have too many steps in this. I don’t think you need to create structured data nor do you need finetuning. I think you just need the write system of embeddings, chunking, search and prompts.

mkural2016 · August 4, 2022, 3:04pm

I will need to study this, appreciate for the direction…

Topic		Replies	Views
Building Own Knowledge Base LLM Community embeddings , chatgpt , api , assistants-api	3	12368	April 8, 2024
Fine-Tuning with Non-Prompt/Completion Data: Seeking Advice for Direct Text-Based Training? API gpt-4 , chatgpt , fine-tuning , api	3	615	August 23, 2024
Trainining based on complex text API gpt-4 , chatgpt , api	8	1840	July 5, 2023
Prompting GPT3.5 for NER data labeling Prompting gpt-4 , gpt-35-turbo , chatgpt	18	5097	January 25, 2024
Is it possible finetune with unlabeled data and then labeled data? API fine-tuning	5	1211	March 18, 2024

Unstructured text to dataset

Related topics