Sentiment and clarity analysis of customer service calls using GPT 3.5 Turbo

I am building a web app for a customer service company and would like to use gpt-3.5-turbo to do basic sentiment analyses on the conversations.

The basic idea is i want to label each line of the transcription with either a Positive , Negative or Neutral label , and a Clarity label either Clear or Unclear based on whether the customer understood the request.

My text is in Arabic and if formatted as follows
[hh:mm:ss] Speaker(Agent/Customer) : Content

I pass the whole content of the transcription in one go to the API with a this prompt as initialization
prompt : Your job is to extract the timestamp at the start each line then read the content of the line and based of the sentiment expressed in the content of the line add to it one of the following tags {Positive} , {Negative} and {Neutral} label each line based on the sentiment expressed in the line additionally include a {Unclear} label if the text in the line is unclear or was not understood by the customer or Include {Clear} label if the conversation the sentence was understood and was clear include only the timestamp, sentiment and clarity for that line as follows make sure to analyze the sentiment and clarity for every line in the input and every timestamp don’t skip anything in the input

My problem is it has a problem identifying clarity especially when the context of the misunderstanding comes more than 2 time blocks away , also it choice of sentiment label is sometime random how would i solve this?

Also would passing the input line by line instead of one big block help at all and would it cost more?

I am fairly new in using Openai’s API, so any guidance would be greatly appreciated.

Yes, the higher number of elements you include, especially if labeled in an unconventional way, the more the AI will have incoherent answers that are essentially fabrications.

To start, ensure you use the API parameter top_p:0.001 so that a top-choice is actually given instead of a random answer.


I made a classifier with a clearly-labeled input:

{1: "I am building a web app for a customer service company and would like to use gpt-3.5-turbo to do basic sentiment analyses on the conversations." , 2: "The basic idea is i want to label each line of the transcription with either a Positive , Negative or Neutral label , and a Clarity label either Clear or Unclear based on whether the customer understood the request." , 3: "My text is in Arabic and if formatted as follows

Then secondly a system prompt with the goal of making sure each response is itself labeled, has a text excerpt for reference, and only when what is actually being scored is reiterated do we have the AI generate the token:

For each input item, output continuous integer range(1,10) where:
1: very confused and angry, up to 
10: confident, assured, correct

Output format:
[{
  1: five words of text (elided),
  assuredness: score
},{
  2: five words...

Then get the AI’s answer (what it’s testing for here is something that also is hard to judge)


[{
  1: "Using GPT-3.5-turbo for sentiment analysis",
  assuredness: 7
},{
  2: "Labeling transcription lines with sentiment and clarity",
  assuredness: 6
},{
  3: "Arabic text format for transcription",
  assuredness: 8
},{
  4: "Passing transcription content to API",
  assuredness: 7
},{
  5: "Challenges with clarity and sentiment labeling",
  assuredness: 5
},{
  6: "Effect of passing input line by line",
  assuredness: 6
}]

You then have list, dictionaries of answers that can be parsed directly for the keys (pick the key not “assuredness” to get number), and values.

By having a clear focus on matching what is currently being output in reference to the original item, a processor with multiple items can do a better job.