To-do detection by GPT

Hi all,

For a research project, I am investigating whether GPT-3 can be used for to-do detection. I have several transcripts that contain tasks, and I want to extract them automatically. Because these transcripts are too long for GPT-3, I split them into smaller sub-transcripts. For all sub-transcripts, I have written down if they contain any to-dos, and if so, what to-dos and who is assigned to them.

To detect the tasks, I have come up with a small pipeline:

  1. Ask GPT-3 if the conversation contains any to-dos
  2. If the answer is yes, ask GPT-3 to write down the to-dos and who is assigned to them.

If I skip step 1 and only work with the second prompt, GPT-3 notes many irrelevant or incorrect to-dos.

The prompts look like the following:

Step 1:

Microphone RED: …
Microphone BLUE: …
Microphone RED: …
Microphone RED: …
Microphone YELLOW: …
Microphone YELLOW: …
Microphone BLUE: …
Microphone BLUE: …
Microphone YELLOW: …
Microphone BLUE: …
Microphone RED: …
Microphone BLUE: …
Microphone YELLOW: …
Microphone RED: …
Microphone RED: …
Microphone RED: …
Microphone YELLOW: …
Microphone YELLOW: …
Microphone RED: …
Microphone BLUE: …
Microphone RED: …
Microphone RED: …
Microphone RED: …
Microphone YELLOW: …
Microphone YELLOW: …
Microphone BLUE: …
Microphone BLUE: …
Microphone BLUE: …
Microphone YELLOW: …
Microphone RED: …

Question: Does this conversation contain any explicit to-dos for Blue, Yellow, or Red?
Answer (yes/no):

Microphone RED: …
Microphone BLUE: …
Microphone RED: …
Microphone RED: …
Microphone YELLOW: …
Microphone YELLOW: …
Microphone BLUE: …
Microphone BLUE: …
Microphone YELLOW: …
Microphone BLUE: …
Microphone RED: …
Microphone BLUE: …
Microphone YELLOW: …
Microphone RED: …
Microphone RED: …
Microphone RED: …
Microphone YELLOW: …
Microphone YELLOW: …
Microphone RED: …
Microphone BLUE: …
Microphone RED: …
Microphone RED: …
Microphone RED: …
Microphone YELLOW: …
Microphone YELLOW: …
Microphone BLUE: …
Microphone BLUE: …
Microphone BLUE: …
Microphone YELLOW: …
Microphone RED: …

Write down the tasks and who has to do them behind the task in brackets.

Now I have some questions about this project:

  1. Am I asking the right prompt for step 1 or do you maybe have a better suggestion?
  2. I want to see if finetuning improves the performance, especially for step 1. I finetuned a model with 91 prompts (I realize this is a small dataset but I just wanted to see if whether already changes the performance). However, when I finetuned the model it gave a strange output. With the regular GPT-3 model, I receive either yes or no as an answer. It performed OK, with an accuracy of 0.62. (It was also consistent, giving the same response when running it 5 times). With my fine-tuned models, the output was no longer either yes or no, but looked more like this:
    yes yes no no no no no no yes no yes yes yes yes
    I have no idea what is going wrong here, because in the prompts I used to fine-tune the model I only had either yes or no as my ideal output. Does anyone have an idea what I did wrong? Or is GPT-3 simply not able to answer this question?

Thank you!