How to approach this task: email boilerplate removal

So we have thousands of emails from over a hundred online shops. These emails may have different topics, some confirming an order, updating an order status, reminding customers about their account, marketing, etc. Emails from the same vendor are generally consistently formatted. These emails are written in HTML, may have header/footers that contain generic images, links, promotional content. Emails also have a main body that a specific subject (as said above).

I have been experimenting with prompting GPT to process these emails, to remove ‘boilerplates’ including the header, footer, and signature sections, and then only output the plain text of the main body message. The prompt I use is:

The following text delimited by triple backticks is the HTML source code of an email. Extract the clean text of the main message, but remove the header and footer of the message. If the email has a signature section, also remove everything within and after the signature section. Text:\n\n{text}

The results are rather mixed. Probably in over 80% cases the output is ok. But in quite a notable cases, the text is cut out arbitrarily. For example, if in the HTML source code there is a

This is a long sentence long sentence long sentence long sentence

, sometimes it fails to retain all the text between the two

tags, but may return something like ’ This is a long’.

Obviously boilerplate removal would work better if we can ask the model to look at emails from the same client at the same time as they could be using the same formatting. But that would not be possible due to the limited input context.

Any suggestions please? Anything I have done wrong or any tips, thoughts?

Thanks

Hi and welcome to the developer forum!

Are you using ChatGPT or the API? If I assume it’s the API, what exactly is the flow of your data, how are you getting it in to the model, are you vectorising it? any code snippets you can share?

I suggest not to process the html formatted email directly but pre-process it first, removing tags, etc. (e.g. use something like html-to-text).

Then do an initial prompt to check what kind of email it is. As you mentioned, some emails from same client might have unique format. But rather than client check, maybe type of email? Try to categorize what kind of email you guys usually receive.

Then for each category, prepare a special prompt. Emails inquiring about products, or orders to supplier asking for meetings, etc. surely have different information that you need to extract.

This sounds like a project that a regular text processing library is much better for.

Something like:

  1. extract text using a text extractor (such as beautiful soup or whatever)
  2. calculate hashes of successive prefixes and postfixes (and perhaps infixes)
  3. store counts of common hashes. remove pre/post/infix line spans that have very high commonality counts

Amazed to see so many replies, many many thanks to all your help! To reply to some of the points raised so far:

  • Our workflow is like this: 1) BeautifulSoup to do a light-weight parsing of the html, to find one single html block within and one that has the longest text; 2) Pass that html block (html source code) to chat gpt, using the prompt above; 3) take the response as the cleaned message
  • Our task is not to extract information from emails, we ultimately is training an email generator and to do so, our plan is to use our current email repository to fine tune GPT. That is why we want to clean every email message to remove the’boilerplate’ but only keeping the ‘semantic’ content. But we do not want to classify the message or extract any information from it. We need the complete message
  • I have been thinking about the regex per-website approach and that may be my last resort, if GPT cannot do a much better job there is no point of spending money on it. But calculating the common prefix/suffic/infix texts seems to be an NP hard problem, as the search space is infinite. Are there any APIs or algorithms for doing this?

Thank you again, any more comments are highly appreciated.

I recently completed a project to do this - exactly.
It was the first stage (pre-processing stage) of an AI digital assistant app.

The results are over 99% accurate.

Model used: GPT - 3.5 16k

Prompt format:

  1. Main prompt
  2. 5 step-by-step worked examples
  3. 5 one step worked examples
  4. Current email to be parsed

When an error occurs (edge case) I add the corrected example to the prompt examples (manually) so the output improves over time (basic learning).

This simple strategy works well enough for my app but there are likely additional strategies you may consider.

That’s a very interesting approach! I assume few shot learning helps a lot.

Can I check if your HTML source code is relatively short? Ours are very long even after extracting the main block mostly likely to contain the body text. And sometimes they even exceed the 16k context limit, so we have not been able to do few-shot learning.

Just noticed the additional info you added to clarify your workflow/use case and I am actually extracting and analyzing the semantic content of the emails for intent/category - versus simply parsing the original email.

semantic parsing v syntactic parsing

https://mailparser.io/ will suit your use case for syntactic parsing I believe, we use it to do the initial parsing prior to passing the parsed emails to AI.

Very interesting, I am working on a similar project to categorize emails. Can you tell how you handle when the email is too long to fit into the context window? I have several emails that has been replied to a very long conversation, and they don’t even fit into the prompts when I use context windows of 128K. What would be the approach to strip these emails intelligently?

I never had to deal with email threads, but single, independent email messages. But if the text exceeds 128k, I would guess there is a lot of useless html code. The emails I process for our clients are typically 150-300 words, imagine how ‘long’ a thread would be if we were to fit 150-300 words emails into 128k tokens…

Did you try ‘truncating’ or removing any html code that you know will be useless? For me, I firstly remove any link/script/style/meta tags within the html, then find the ‘body’ element, then wrap the ‘body’ html with for chat gpt to process.

Do you think this will help reduce your content?

Your approach is solid, but HTML parsing can be tricky due to inconsistent formatting across emails. Here are a few suggestions to improve accuracy:

Preprocessing: Clean up the HTML before feeding it to GPT. Use libraries like BeautifulSoup (Python) to strip unnecessary tags or normalize the structure.

Chunking: Break the email into smaller sections (e.g., header, body, footer) and process them separately. This can help GPT focus on the main body.

Pattern Recognition: Identify common patterns in headers/footers for each vendor and create rules to remove them before passing the text to GPT.

Fine-Tuning: If possible, fine-tune GPT on a dataset of cleaned emails to improve its understanding of your specific use case.

Fallback Mechanism: Implement a fallback to retain the full text if GPT’s output seems incomplete or truncated.

Vendor-Specific Prompts: If you know the vendor, tailor the prompt to their email structure (e.g., “Remove [Vendor X]’s header/footer”).

Thank you ziqizhang, in fact I directly work with plaintext of the email, so the HTML overhead is not the issue I face. Here’s a more detailed explanation. A typical email might be a reply to a previous mail which quotes the previous one. Then it is replied and this new mail contains the previous two, quoted. And this is replied and the message grows. I have single emails that contain the previous 60 - 70 messages inside, all quoted.
So, in short, stripping HTML does not seem to help in this case, as it has already been stripped; but definitely must be suitable for alternative scenarios.

thank you ta7ha124, I am already using the plaintext of the email body which I understand carries me to the “chunking” step you have suggested. So I guess I need to find a way to chunk down the mail body, particularly removing all the quoted messages prior to the one processed.
I used many regex patterns to recognize and remove the basic quoted sections but there are still ones being missed.
I want to believe that this is something that has already been solved by someone in the community, as email processing has a long history :slight_smile:

Did you try few-shot prompting an LLM to extract quoted messages from an email thread? My bold assumption is that it will work better than regex…

thanks for the recommendation, ziqizhang. what I cannot understand is, I have several messages that cannot fit into the context window alone, so how can I send them to the llm with a prompt that has also example truncations within? that is the reason I am stuck with solutions like regex and cannot use llms for truncation purpose

Ah sorry I missed that… How about this.

If an email quotes others in a long thread, then it means that there will be contents (paragraphs, blocks) that are repeated. But the first time a block appears, it means someone starts new content, quoting others. So:

  1. have a process that identifies repeated and non-repeated blocks of texts in your massive email.txt file
  2. for every repeated blocks, you do ‘index of’ to find the first appearance in the massive email file, assuming that’s the start of a new email quoting earlier messages.
  3. split your massive email.txt into smaller chunks based the indexes discovered in step 2. Practically, I hope you only need to do a couple of splits to have a minimum number of chunks that fit the 128k window.

Of course in practice there will be a lot more to consider and this may prove impractical… E.g., for 1), you may need to experiment with what a ‘block’ is. A single paragraph may not suffice as there could be very short paragraphs, signoffs, etc… maybe a paragraph of at least X words? Or maybe choose the blocks that rank high in terms of repetition, or a combination of both? Or maybe your email text file gives you some patterns?

I can’t see the actual content so I don’t really know. But I wonder if this is something in theory worth trying.

absolutely that approach is doable, thank you very much. do you recommend any approaches for step 1, i.e. identifying repeating blocks?

It’s a bit difficult to advise without seeing what the data looks like, but I guess you’d start with splitting the text by ‘\n\n’ then try counting frequencies of each chunks… but I can imagine this won’t work when the quoted messages are preceded with ‘>’ or ‘>>’ like:

Good morning, can I remind you the following…
> Hi John
> We are hosting a quiz night at…

Are you able to post some dummy samples to show your data structure?