What is the best format to input Q&A data for ChatGPT or the OpenAI Assistant?

Hello everyone,
I am currently working on summarizing long texts, approximately 10,000 to 20,000 tokens in length.

My current approach involves segmenting the long text into smaller paragraphs, having AI generate summaries for each, and then integrating these into a final comprehensive summary for analysis.

My original data is stored in Excel in the following format:
1 \t Q \t question 1
\t A \t answer 1
2 \t Q \t question 2
\t A \t answer 2
… and so forth.

If I want to input this into ChatGPT to process the Q&A content, how should I modify the format? Would a direct text string input be better? Should the Q&A headings include specific numbers, like:
Q1: XXXXXX
A1: YYYYYYY
Q2: ZZZZZZZ
A2: LLLLLLLLLL

Or should each Q&A pair be on the same line, for example:
Q: XXXXXXX \t A: YYYYYYYY
Q: ZZZZZZZZ \t A: LLLLLLLLL

I’ve also considered using the JSON format, but given the length of my texts, I’d like to minimize unnecessary token usage.

Thank you all for your help!

Best format to feed to GPT is JSON:

{
“QA_set”: [
{
“question”: “Question 1”,
“response”: “response 1”
},
{
“question”: “Question 2”,
“response”: “response 2”
},
{
“question”: “Question 3”,
“response”: “response 3”
}
]
}

advice: to save on tokens, shorten keys (question / response) to ‘Q’ and ‘R’, then add row ‘Q = question, R = response’

Thank you for your response. Previously, I used to input text directly to the assistant like this:

Q: Q1 \n
A: A1 \n
Q: Q2 \n
A: A2 \n
......

However, when reviewing the thread content, it appears all stuck together like this: Q: Q1 A: A1 Q: Q2 A: A2…, which is quite strange.

Regarding your suggestion to use JSON, besides the challenge of converting the format, I have some questions:

  1. Are the brackets, curly braces, and quotes in JSON format { } “”) counted in the token consumption?
  2. Since JSON format includes quotation marks (" "), should I remove any quotation marks from my Chinese Q&A content beforehand? Otherwise, it might affect the JSON parsing, like this:
{
  “question”: “Question 1”,
  “response”: “response 1”
}
  1. You suggested shortening keys (question/response) to ‘Q’ and ‘R’. Can I assign numbers to Q and R, like Q1, R1, Q2, R2, until the end?
{
  “question”: “Question 1”,
  “response”: “response 1”
}

Best regards;

Hi,

  1. Every character in prompt is counted as token.

  2. Before sending request to OpenAI API you have to encapsulate special characters with backslash - \

2.1. To make it easier, ask Chat GPT to write that piece of code to prepare raw texts into JSON / API applicable format on your preferable programming language.

1 Like

Thank you very much. So, if using the OpenAI Assistant API to analyze QA datasets, is it easier for the AI to understand if the data is in JSON format?

From the example you provided, can the “question” and “response” keys be sequentially numbered like Q1, Q2, Q3…Qn and A1, A2, A3…An?

If an interview dataset has too many QA pairs and I need to split them, what format should the QA_set be in?

Thank you.