How do you find so many content to generate jsonl?

Where do you get so many content from? For example for my paraphrasing model I had to google and get all paraphrase examples and it was about 100 examples. Is there better way of doing it?

Thanks

Hi @davut,

Can you please elaborate a little bit?

Are you wanting to put your data into .jsonl format?

Search Kaggle Datasets and Google Datasets first.

https://datasetsearch.research.google.com/

3 Likes

Yeah I’ve had a lot of luck with kaggle, wikipedia data. You can try hitting up common crawl but that data is harder to data mine.

No, I just want to find a dataset or create easily. For example let’s say I want to make a fine tuned model for instagram captions, I would probably spent hours to try to scrape examples.

Wow this is so good @daveshapautomator thank you. You are so helpful I see you everywhere :smiley:

1 Like

Depending on future model requirements:

https://registry.opendata.aws

Thank you so much, so many datasets

1 Like

One way could be to create another model which will be good at generating .jsonl files.
Generated jsonl files than can be used to fine-tune/learn new models for your task.