Bad formats for semantic search of RAG? Implementing internal chatbot for troubleshooting an SDK


I am a software developer and the company I work for have “Tech exploration days” where we are free to try any new tech that we think could help the company. I had the idea of implementing a chatbot the could help my team with support tickets regarding our SDK.

When customer experience bugs or have trouble integrating the SDK, they can access our support by creating a ticket which someone in my team receives and provides help through.

I want the chatbot helping out with new incoming support tickets. To do so, I can provide all of the resolved tickets, which are thousands. I want the chatbot to find relevant tickets where similar problems have been resolved and provide solutions and suggestions from these resolved tickets.

The relevant data from these tickets is in conversation format, between someone from the support team and some customer and it is containing a lot of irrelevant and misleading information.

I have done some experimenting with ChatGPT-4o and noticed that it is way to much data for it to handle, therefor I was thinking of implementing a RAG so that the task are divided into subtasks, hopefully improving the result.

My main concern is that the format of the data wont make it possible for the semantic search of the RAG to retrieve the relevant chunks. Also, I am worried conversations will be split up into different chunks which would be quite bad. The conversations can range from 5-80 responses back and forth. I am not really sure how to face these problems, are there work around? Tips on implementation, data formatting, RAGs etc. are all welcome as well as tips regarding the whole project.

Thanks in advance!

Hi and welcome to the developer forum!

I think I would first pre-process the conversational data with GPT-4 asking it to look at each interaction and generate a condensed version of the original question and the end solution, then you could use those condensed versions as RAG chunks, this would allow you to avoid the “blah, blah” along the way and to have high quality data to search on.

Hello and thank you for your welcoming and response!

Gotcha! So you would recommend embedding a single text file containing all the condensed versions of the tickets instead of uploading multiple files containing like a ticket each? In that case, what is the smartest way of ensuring that all tickets are divided into individual chunks and not split?

Also, any tips on how to prompt GPT-4 to extract the important parts into the condensed versions?

Thank you!

No, I would “chunk” that is split the data into single Q/A pairs as determined by the AI analysing the conversations, ask it to extract the question and the finial answer from the text and give that as a Q/A pair, then embed each of those Q/A’s in it’s own embedding, that way you will likely get better hits on topics that contain relevant information and you are assured that each one will contain an answer.

Asd for prompting, just use the tex “Please extract the user question and the finial correct answer to that query from the following text {{{$transcript}}} and then give output as a Question-and-Answer pair suitable for embedding and vector search”

I suggest testing it out with a small number of tickets (maybe 50), and then doing a similarity search on those tickets that you would expect to retrieve a ticket. It might work better than you expect - it’s better to prototype than to speculate.