What are your favorite text-based dialog datasets?

daveshapautomator · July 17, 2021, 11:18am

I am looking for some open source conversational data for finetuning. I’d like a variety topics in the data - scientific, medical, interpersonal, philosophical, political, etc.

I will be finetuning a couple different models:

Summarizer - generate a concise (but complete) summary of arbitrary text (including conversations)
Question generator - Generate salient follow-up questions to arbitrary text (also including conversations)
“Who speaks next?” - Specific to dialog, try and anticipate the next speaker, great for use with chatbots and conversational agents so they can predict when it is their turn to speak

Furthermore, I will need some other data for the “arbitrary text” - such as blog posts, Wikipedia articles, news articles, and fiction. Since this is for finetuning I only need a few hundred examples, but I want a wide array.

daveshapautomator · July 17, 2021, 11:45am

Okay this was easier to find than I realized it would be. Here’s my list of sources so far:

Multiple domain (stack exchange) GitHub - Guzpenha/MANtIS: MANtIS - a multi-domain information seeking dialogues dataset
Multiple domain (stack exchange) MANtIS - a multi-domain information seeking dialogues dataset | MANtIS
Many lists for chatbot training 36 Best Machine Learning Datasets for Chatbot Training | Kili Technology
Medical NLP GitHub - socd06/medical-nlp: Dataset for Natural Language Processing using a corpus of medical transcriptions and custom-generated clinical stop words and vocabulary.
General purpose summarization WikiHow Summarization | Kaggle
News summarization BBC News Summary | Kaggle
News summarization NEWS SUMMARY | Kaggle
Movie reviews Movie Reviews | Kaggle
Supreme court cases US Supreme Court Cases, 1946-2016 | Kaggle
Data science reddit Reddit Data Science Posts (500k+) | Kaggle
Reddit broad spectrum Reddit Top 1000 Posts | Kaggle
Reddit broad spectrum Reddit's 2400 Posts Dataset | Kaggle

daveshapautomator · July 18, 2021, 3:23pm

Okay, this is rocking and rolling. I had to discard much of what I found for various reasons, but I’ve got movie dialog, Reddit posts, Stack Exchange posts, medical cases, and news. These 5 classes give a pretty good cross section, especially since the Reddit and Stack Exchange posts contain multiple domains, such as dating, mental health, school, work, and hobbies.

In case anyone is curious, this is the kind of thing I’m doing with it:

INSTRUCTIONS: Write a list of the most important questions to ask about the following passage:

PASSAGE:
1, Spermatocelectomy, Epididymectomy, & Vasectomy, PREOPERATIVE DIAGNOSES:,1. Left spermatocele.,2. Family planning.,POSTOPERATIVE DIAGNOSES:,1. Left spermatocele.,2. Family planning.,PROCEDURE PERFORMED:,1. Left spermatocelectomy/epididymectomy.,2. Bilateral partial vasectomy.,ANESTHESIA:, General.,ESTIMATED BLOOD LOSS:, Minimal.,SPECIMEN:, Left-sided spermatocele, epididymis, and bilateral partial vasectomy.,DISPOSITION: ,To PACU in stable condition.,INDICATIONS AND FINDINGS:, This is a 48-year-old male with a history of a large left-sided spermatocele with significant discomfort. The patient also has family status complete and desired infertility. The patient was scheduled for elective left spermatocelectomy and bilateral partial vasectomy.,FINDINGS:, At this time of the surgery, significant left-sided spermatocele was noted encompassing almost the entirety of the left epididymis with only minimal amount of normal appearing epididymis remaining.,DESCRIPTION OF PROCEDURE:, After informed consent was obtained, the patient was moved to the operating room. A general anesthesia was induced by the Department of Anesthesia.,The patient was prepped and draped in the normal sterile fashion for a scrotal approach. A #15 blade was used to make a transverse incision on the left hemiscrotum. Electrocautery was used to carry the incision down into the tunica vaginalis and the testicle was delivered into the field. The left testicle was examined. A large spermatocele was noted. Metzenbaum scissors were used to dissect the tissue around the left spermatocele. Once the spermatocele was identified, as stated above, significant size was noted encompassing the entire left epididymis. Metzenbaum scissors as well as electrocautery was used to dissect free the spermatocele from its testicular attachments and spermatocelectomy and left epididymectomy was completed with electrocautery. Electrocautery was used to confirm excellent hemostasis. Attention was then turned to the more proximal aspect of the cord. The vas deferens was palpated and dissected free with Metzenbaum scissors. Hemostats were placed on the two aspects of the cord, approximately 1 cm segment of cord was removed with Metzenbaum scissors and electrocautery was used to cauterize the lumen of the both ends of vas deferens and silk ties used to ligate the cut ends. Testicle was placed back in the scrotum in appropriate anatomic position. The dartos tissue was closed with running #3-0 Vicryl and the skin was closed in a horizontal interrupted mattress fashion with #4-0 chromic. Attention was then turned to the right side. The vas was palpated in the scrotum. A small skin incision was made with a #15 blade and the vas was grasped with a small Allis clamp and brought into the surgical field. A scalpel was used to excise the vas sheath and vas was freed from its attachments and grasped again with a hemostat. Two ends were hemostated with hemostats and divided with Metzenbaum scissors. Lumen was coagulated with electrocautery. Silk ties used to ligate both cut ends of the vas deferens and placed back into the scrotum. A #4-0 chromic suture was used in simple fashion to reapproximate the skin incision. Scrotum was cleaned and bacitracin ointment, sterile dressing, fluffs, and supportive briefs applied. The patient was sent to Recovery in stable condition. He was given prescriptions for doxycycline 100 mg b.i.d., for five days and Vicodin ES 1 p.o. q.4h. p.r.n., pain, #30 for pain. The patient is to followup with Dr. X in seven days.surgery, partial vasectomy, spermatocele, epididymis, family planning, vas deferens, metzenbaum scissors, vasectomy, spermatocelectomy, epididymectomy, testicle, deferens, hemostats, electrocautery,
END PASSAGE

IMPORTANT QUESTIONS:
What type of procedure is the surgeon performing and why?
What did the doctor find at the beginning of the surgery?
How does he proceed to complete the surgery?
What is wrong with this person, why does he need surgery?

The reason I’m creating finetuning data to ask questions is because the ability to ask questions (internally or externally) is the cornerstone of intelligence. By creating a training dataset, I can have a purpose-built “Question Asking” transformer. These questions can then be piped to a Question Answering service.

Question Asking & Answering is what leads to the creation of everything from stories to moon rockets. What happens next in the story? How do you build a hydrogen fueled rocket engine? Most of the questions we ask ourselves are automatic and inarticulate, but with AGI, we will need to articulate those questions into Natural Language for the sake of interpretability and transparency. We want to know what the AGI is thinking and why.

In other news, I’ve ordered a proof of my book where I go into much greater detail of all this. As soon as I get the proof, and if everything looks to be in order, it will go on sale (paperback and EPUB). The paperback will be $7.95 (I will make less than $1 per copy) and the EPUB will be free.

daveshapautomator · July 18, 2021, 5:31pm

The prompt is the passage and the response is just the questions. So the fine-tuned model will take in any arbitrary text and output a list of questions.

I also have other boilerplate questions such as “What is the next step?” And “What does this mean?” So those don’t need to be generated. But every situation is different, so the ability to dynamically generate questions is critical.

daveshapautomator · July 18, 2021, 6:01pm

I used a zero shot prompt to generate the questions, and then I clean up them up by hand if needed. The more open-ended the prompt, the more creative GPT-3 can be, which is why it can already generate better questions than most people.

Topic		Replies	Views
Determining if the user has changed a subject Prompting	11	2118	March 28, 2023
Fine-tuning GPT-3 on entire conversations to mimic style and extract relevant knowledge API	13	4964	December 16, 2023
Fine Tuned Chatbot forgets how to output summary of conversation API	9	1851	December 18, 2023
Generating dataset of prompt-completion pairs for fine-tuning Prompting	0	1672	February 20, 2023
How do I identify if a Question is being asked from a few paragraphs of text Prompting	5	2595	December 15, 2023

What are your favorite text-based dialog datasets?

Related topics