Best architecture for searching historical emails semantically?

jeff8 · January 18, 2024, 6:55pm

Tier 2 support people need a way to find answers to customer questions from an archive of a few thousand historical question/answer emails.

What’s currently the best solution for this?

Clearly there are customer service software platforms that do this plus many other things (unneeded by the client).

There are cloud solutions like Google Cloud Search, Elasticsearch that could do this.

But:

Is there a GPT API solution?
Another trainable or tunable model solution?
An email platform the emails could be imported into that already has a powerful NLP or AI search engine?

I’ve thought about purely SQL driven approaches but I think I need more NLP and semantics rather than a simple keyword search driven solution.

PaulBellow · January 18, 2024, 7:00pm

I’ve heard that kapa.ai has fine-tuned on this forum, GitHub, and OpenAI Docs… It’s a 3rd party service, though, I think.

There’s access to test it on the OpenAI Discord currently.

_j · January 18, 2024, 7:16pm

The language model AI that is used for performing semantic search is called embeddings.

It exposes an internal state of an AI model, after processing your provided text entities, that you can store along with the textual data in a vector database.

Then similarity (rather than keyword) comparisons can be made between elements, or between all elements and new data, to obtain top similarity score matches.

OpenAI offers an embeddings endpoint, currently with one model, that returns a 1536-dimension vector for up to 8k tokens of text.

Searching requires high computation, and should be against vectors stored in memory.

jeff8 · January 18, 2024, 7:16pm

Thank you! I requested access to kapa and I’ll circle back when I learn more.

EricGT · January 18, 2024, 7:29pm

AFAIK kapa.ai is on the OpenAI Discord forum but not this Discourse forum. It has been discussed to bring such to this forum but there are more pressing needs. Personally I don’t mind if I get replaced by an AI for much of the Q&A and moderator tasks. But as there is an AI from Discourse on this site, e.g. Related Topics is done using AI, and doing some other things, there is still a long way to go.

PaulBellow · January 18, 2024, 7:32pm

Sorry, I meant that they fine-tuned the data from this forum for their OpenAI Kapa implementation (that’s only available on Discord as mentioned…)

Sorry for any confusion!

ETA: I’m actually in tough with them because I was testing it on some threads from this very forum and it got stuck in a self-referential loop of sorts! I told them I’m tagging questions from here that I ask kapa.ai with “kapa-ai” so that they can stop it from self-referencing?

curt.kennedy · January 18, 2024, 8:06pm

The best architecture right now is called RAG or Retrieval Augmented Generation.

The way it works is a question comes in, gets semantic meaning from an embedding model, which creates a vector.

In your case, your emails would also be embedded, and have vectors derived from their meaning.

So you correlate the input question with the emails, or related questions linked to the emails. This is vector correlation, which is multiply and add to derive a number between -1 and +1. And the closer to +1, the more correlated.

Then the top correlations would go to the LLM as context, and the LLM would formulate an answer.

This is RAG 101 in a nutshell. And there are many permutations beyond these basics. Embeddings are neural in the sense that they derived from internal model states that extract the semantics of the input.

The OpenAI embedding model is called text-embedding-ada-002, which is available through the API. The nice thing is that Ada-002 has a large 8k context, which means it could extract overall meaning for up to 0.75*8k = 6k words. But you likely want to break large documents into multiple chunks, as the meaning becomes more localized.

SomebodySysop · January 19, 2024, 4:05am

What is RAG? https://youtu.be/T-D1OfcDW1M?si=Zh08QOongbBXYAhm

Once you have a RAG system up which allows you to send queries to an LLM and receive responses, you can connect this to email. Essentially, you create an API, which will receive emails as prompts and return LLM responses.

For receiving emails, I created a Zapier mailbox (this is used to receive the physical emails) and Zap to process send the emails to my API. My API is plugged into my chat completion (RAG) system, so it does the vector store cosine similarity search, brings back context documents, sends documents along with question to the LLM, and receives the LLM response. It finally creates an email response and sends that back to the original sender.

Here is the 2nd half of a use case scenario I created to explain how to use our query responder: https://youtu.be/nBXZLxQEW7A?si=302r3alQfFoiU51q&t=88

Here is the initial discussion I posted on the subject: Query OpenAI Large Language Models via Email

SomebodySysop · January 19, 2024, 4:20am

Oops! I think I just answered the wrong question.

Here is how I search my email archives currently: Use ChatGPT to read and analyze past emails in inbox

In your case, I would do this in batches by date. If you’ve got a way to categorize, you should do that as well (in metadata). Just download as PDFs and then embed. Then, periodically add the newest emails to the embeddings.

You can run all the queries you want against these embeddings, and use the original PDFs for your citation links (in case you want to see the full emails).

Of course, there are many other ways to accomplish this, but this is how I would do it given my current resources.

1deerlane · January 19, 2024, 4:36am

You should use ElasticSearch for this.

If you go with AI, you’ll have huge problems either rescanning all the emails with each question (rip bank account), or problems adding new emails to the dataset the AI knows. This is on top of all the other issues, like managing your own prompts/API downtime/etc. If you got with existing AI solutions that help with large document collections, well they tend to get lost in the details, and hallucinate due to the data volume.

SQL is a horrible choice, not only because as you mentioned, you need more NLP, but because there’s no good way to index your emails in an SQL database. You’ll end up building your own inverse index, which has a million edge cases for doing well.

ElasticSearch supports everything you need. ElasticSearch turns Lucene into a scalable, fault tolerant system, which is exactly what you want. You can also continue to add new emails and have them searchable. Your T2 support staff will get used to using the system and improve over time, whereas AI will constantly shift from underneath them.

There may be a time where AI is a good solution for this, but that time is not now. Unless you’re doing this to get investor $ or catch eyes for a promotion at work, use ElasticSearch. Even if you are doing this to slap AI hype, use ElasticSearch then have AI modify the search prompts or something. The results will be much better.

Diet · January 19, 2024, 4:41am

hmm, why do you say this? ada costs a dollar for like ~~a million~~ 10 million tokens. Pricing

have you ever played around with embedding technology? I absolutely recommend it!

curt.kennedy · January 19, 2024, 4:43am

With AI, in terms of embeddings, you can add each email on the fly the instant it gets vectorized. So it’s real time, and no AI training weight updates involved.

As for ElasticSearch, I used it years and years ago. It’s mainly keyword based. But, I believe you also need a keyword leg as well.

You fuse the AI based (dense/semantic) and keyword based (sparse/keyword) with an algorithm like RRF or the newer density respecting RSF algorithms.

TL;DR Use both (embeddings + keywords)

jroempp · January 19, 2024, 7:49am

I did something similarily for a project at university. I did it with OpenAI, but you could definately swap to other models as well.
Here is the Github repo. Email search
I used Microsoft Outlook API to retrieve the emails as MIME, filtered out for conversations instead of singe emails, preprocessed them with OpenAI and then build a simple RAG. It was done 6 months ago so many things improved/changed.
My plan was to fine tune a model to better preprocess the email content that got embedded, skipping all the clutter in emails like ads, signatures, …
But never found the time to work on it.

jeff8 · January 19, 2024, 1:28pm

Thanks for your help!

Batch by date for what reason?
Categorize for what reason? To subdivide by category so that the vectorized search can be limited to a particular category and therefore more efficient? If so it probably depends on the volume of emails.
Why convert to PDF’s prior to embedding when the emails are easily readable text?

jeff8 · January 19, 2024, 1:41pm

@1deerlane and @curt.kennedy thanks for your insights!

Does ElasticSearch have the facility to create embeddings from connected data sources and also do embedded searches? I’m trying to decide whether it would be easier to use a platform like that.

I’m comfortable using Openai’s API’s so preprocessing the email data, creating embeddings with the ada model, associating the embeddings with source document is all something I get.

Where to put that data is fuzzy for me. I assume once I know where to put it (what db or platform), and how to query it, a user interface for the query process will be more evident.

“Keyword leg” in order to narrow down the search to a particular category of emails? And is ElasticSearch still keyword based with the AI tech available?

jeff8 · January 19, 2024, 2:06pm

@jroempp I just read your entire read.me file! Looks like this project took a lot of work.
I’m mostly concerned with the first few steps, data cleaning, storage and querying. For now I don’t think I will need to summarize or generate a response, only to point the user to past documents. But I could potentially use parts of what you’ve done without using the full RAG.

curt.kennedy · January 19, 2024, 2:48pm

The emails are searched by keywords, and you get a list of top emails sorted by relevancy.

Same thing with the embeddings, you get another list sorted by the correlation scores.

So you have two lists of rankings. You combine these with RRF or RSF. RRF is simple and rank based, so 1, 2, 3, 4, … for both streams.

Wheres RSF preserves the density by mapping the relative relevancies to the interval [0,1] prior to fusing the streams (ref)

As for DB’s you have so many options. Main ones are Pinecone and Weaviate.

Personally I just code this myself, and don’t use any third party stuff. So I have my 2 cents below, but would defer to others on this forum that actually use these third party systems.

2 cents:
But … I believe with Weaviate you can use embeddings and keywords at the same time and they combine it for you with RRF/RSF. So that would be my first turnkey option to investigate. But I hear Pinecone is now serverless, and a lot cheaper, so there’s that to look into.

joyasree78 · January 19, 2024, 3:02pm

Hi Jeff, as most of the people mentioned RAG is the popular way to do this. But there are many retrieval techniques within RAG. Currently there is no automated way to find out which retrieval technique will be the best. So, I would recommend to create a test harness for your RAG and then run the experiments as below.

Experiment#1
retrieval - pure semantic search with similarity score threshold of 0.79
top_k - start with 10, then 5 and then 3

Experiment#2
retrieval - Hybrid search(semantic search + BM25 retrieval) - for the weightage start with 0.5/0.5 and tune from there
tok_k - 5

Experiment#3
retrieval - Semantic Search + Cross-encoder reranking(use the LLM way)
top_k - 5

For each of these techniques, record your retrieval and response accuracy and take the best one.

For chunking, I found if we convert to Markdown and chunk, GPT-4 is able to understand the context better.

SomebodySysop · January 19, 2024, 7:29pm

I’ve been working with diverse datasets consisting of thousands of documents for about a year now, constantly trying to figure out how to get the best responses from the widest variety of queries. The biggest problems I’ve found so far? Noise. Cosine similarity does a great job of finding needles in haystacks, but in my experience, the smaller the haystack, the better the response.

Trust me – thousands of unstructured emails are going to have a LOT of noise. Remember that these machines are looking for semantic similarities – which will result in returned chunks that may or may not have contextual relationships with each other – i.e., noise.

I address this in two ways: 1. I try to organize my documents as much as possible according to their semantic hierarchies (i.e., Semantic Chunking) and 2. I utilize metadata and filtering as much as possible.

Emails, typically, will have just two categorization elements: Date and Subject. If you are trying to build a product support knowledgebase, date probably won’t be of much help, but the subject lines will be the only thing you have to categorize and filter your chunks. I would suggest, if possible, adding metadata to these emails like product/service, issue/complaint, resolution, etc… Being able to filter on this kind of metadata, even with data as unstructured as emails, will help reduce the size of your “haystacks” and increase the efficiency of the responses.

I mentioned PDFs because that’s what I mostly work with. You’re right, you can use any text structure you like as long as your embedding structure supports it. But think about this: When a customer asks a question and the model responds, what do you give the customer to corroborate the response? I give them links back to the source PDFs. Consider doing something similar.

BTW, you can use the AI to do the categorizations for you!

You might find these helpful:

Retrieval-Augmented Generation for Large Language Models: A Survey
- [2312.10997] Retrieval-Augmented Generation for Large Language Models: A Survey
Lessons Learned on LLM RAG Solutions
- https://www.youtube.com/live/Y9qn4XGH1TI?si=iUs_x3yDL8BK7aUb

I have found this to be the best strategy, for sure, when dealing with gpt-3.5. I now configure my prompts in XML and I am getting way better responses.

_j · January 20, 2024, 12:37am

For the actual vector database one might use, the OpenAI cookbook used to have links to actual software solutions.

Now if you go there and pick embeddings, there is just a shameful veneer of code over someone else’s cloud service, over and over…

“Cookbook” should not be recipes of how to increase your latency, downtime, and billings with 3rd party services, IMO.

All that is really needed per dB entry is 6k of vector, then the up to 8k of corresponding tokens of actual data. Then additional metadata, either that metadata which you also combined to obtain better embeddings relevance (but will not augment with), or that metadata which you did not use for embeddings semantics but will provide to language AI for context (such as document hierarchy, retrieval info).

And then …don’t treat it like a “database”. All the vectors need to be “hot” and ready to compute against in parallel.

Topic		Replies	Views
How to fine tune a chatbot for Q&A API	12	8465	December 16, 2023
How to search/answer with formatted documents on large knowledgebases Prompting	8	2557	May 15, 2023
Reducing Cost of GPT 4 by using embeddings Prompting	23	10537	May 4, 2023
Embedding - text length vs accuracy? API	13	15648	December 25, 2023
How I cluster/segment my text after embeddings process for easy understanding? API	13	12956	December 18, 2024

Best architecture for searching historical emails semantically?

Related topics