Best architecture for searching historical emails semantically?

sergeliatko · July 24, 2024, 11:08pm

That’s exactly kind of application I was looking for a side project. Would be great to discuss. On my behalf: my past experience and mind. On yours: good knowledge of the domain and the underlying issues only you would know.

Let me know if still interested:

https://www.linkedin.com/in/sergeliatko

sergeliatko · July 24, 2024, 11:18pm

Totally agree, that’s why on my opinion instead of searching in the emails directly the app should distill the solutions and the approach taken, match it with the underlying issues and how user described them.

Then I would store 4 classes of data:

How user described the issue
What was the real issue (and how it was confirmed)
What was the approach taken + how it was solved
Original emails

All 4 having references to each other.

The retrieval workflow would basically go from #1 through #3 and search in emails would be simply to allow the team grab the history right upfront.

Other nice thing would be to automate the documentation (internal+public) based on #3.

milver · August 21, 2024, 7:20pm

@jeff8 what did you end up using?

jeff8 · August 21, 2024, 7:39pm

@milver I ended up not using a vector database but storing the embeddings in a FAISS index. I also perform an SQL search in parallel with the semantic search so that I can obtain exact match search results. I score the sql results and score the semantic search, normalize the two sets of scores and return a mixture of sql and semantic results to the user.

milver · August 22, 2024, 6:58pm

Do you happen to have shareable code? I’m curious as to 1) the rationale for choosing a vector library vs. vector db; e.g., do you plan on not adding more emails to the email data corpus?, 2) what DB you use to store the objects (emails). Thanks!

jeff8 · August 22, 2024, 7:38pm

The client owns the IP to the code so I can’t share it. The “emails” were actually tickets stored in a MS SQL server database. Most were initiated from emails. The rationale for storing in the index was speed. New tickets will be added to the database and then the FAISS index will be recreated periodically. There wasn’t a requirement to be able to search and find newly entered tickets.

Also I decided to build the embeddings based on the first part of the tickets only. This is where the customer is states the problem and support asks clarifying questions. Maybe it was the wrong call but I decided the tickets were way too long to create embeddings on the whole text strings because much of it would be noise to the search, as opposed to increasing the odds of finding more relevant results. This is different than many document search apps where the full text would be critical.

Topic		Replies	Views
How to fine tune a chatbot for Q&A API	12	8226	December 16, 2023
How to search/answer with formatted documents on large knowledgebases Prompting	8	2501	May 15, 2023
Reducing Cost of GPT 4 by using embeddings Prompting	23	10155	May 4, 2023
Embedding - text length vs accuracy? API	13	13789	December 25, 2023
Improve fine tuning by adding embedding API	7	2182	April 26, 2023

Best architecture for searching historical emails semantically?

Related topics