How to approach this task: email boilerplate removal

So we have thousands of emails from over a hundred online shops. These emails may have different topics, some confirming an order, updating an order status, reminding customers about their account, marketing, etc. Emails from the same vendor are generally consistently formatted. These emails are written in HTML, may have header/footers that contain generic images, links, promotional content. Emails also have a main body that a specific subject (as said above).

I have been experimenting with prompting GPT to process these emails, to remove ‘boilerplates’ including the header, footer, and signature sections, and then only output the plain text of the main body message. The prompt I use is:

The following text delimited by triple backticks is the HTML source code of an email. Extract the clean text of the main message, but remove the header and footer of the message. If the email has a signature section, also remove everything within and after the signature section. Text:\n\n{text}

The results are rather mixed. Probably in over 80% cases the output is ok. But in quite a notable cases, the text is cut out arbitrarily. For example, if in the HTML source code there is a

This is a long sentence long sentence long sentence long sentence

, sometimes it fails to retain all the text between the two

tags, but may return something like ’ This is a long’.

Obviously boilerplate removal would work better if we can ask the model to look at emails from the same client at the same time as they could be using the same formatting. But that would not be possible due to the limited input context.

Any suggestions please? Anything I have done wrong or any tips, thoughts?

Thanks

Hi and welcome to the developer forum!

Are you using ChatGPT or the API? If I assume it’s the API, what exactly is the flow of your data, how are you getting it in to the model, are you vectorising it? any code snippets you can share?

I suggest not to process the html formatted email directly but pre-process it first, removing tags, etc. (e.g. use something like html-to-text).

Then do an initial prompt to check what kind of email it is. As you mentioned, some emails from same client might have unique format. But rather than client check, maybe type of email? Try to categorize what kind of email you guys usually receive.

Then for each category, prepare a special prompt. Emails inquiring about products, or orders to supplier asking for meetings, etc. surely have different information that you need to extract.

This sounds like a project that a regular text processing library is much better for.

Something like:

  1. extract text using a text extractor (such as beautiful soup or whatever)
  2. calculate hashes of successive prefixes and postfixes (and perhaps infixes)
  3. store counts of common hashes. remove pre/post/infix line spans that have very high commonality counts

Amazed to see so many replies, many many thanks to all your help! To reply to some of the points raised so far:

  • Our workflow is like this: 1) BeautifulSoup to do a light-weight parsing of the html, to find one single html block within and one that has the longest text; 2) Pass that html block (html source code) to chat gpt, using the prompt above; 3) take the response as the cleaned message
  • Our task is not to extract information from emails, we ultimately is training an email generator and to do so, our plan is to use our current email repository to fine tune GPT. That is why we want to clean every email message to remove the’boilerplate’ but only keeping the ‘semantic’ content. But we do not want to classify the message or extract any information from it. We need the complete message
  • I have been thinking about the regex per-website approach and that may be my last resort, if GPT cannot do a much better job there is no point of spending money on it. But calculating the common prefix/suffic/infix texts seems to be an NP hard problem, as the search space is infinite. Are there any APIs or algorithms for doing this?

Thank you again, any more comments are highly appreciated.

I recently completed a project to do this - exactly.
It was the first stage (pre-processing stage) of an AI digital assistant app.

The results are over 99% accurate.

Model used: GPT - 3.5 16k

Prompt format:

  1. Main prompt
  2. 5 step-by-step worked examples
  3. 5 one step worked examples
  4. Current email to be parsed

When an error occurs (edge case) I add the corrected example to the prompt examples (manually) so the output improves over time (basic learning).

This simple strategy works well enough for my app but there are likely additional strategies you may consider.

That’s a very interesting approach! I assume few shot learning helps a lot.

Can I check if your HTML source code is relatively short? Ours are very long even after extracting the main block mostly likely to contain the body text. And sometimes they even exceed the 16k context limit, so we have not been able to do few-shot learning.

Just noticed the additional info you added to clarify your workflow/use case and I am actually extracting and analyzing the semantic content of the emails for intent/category - versus simply parsing the original email.

semantic parsing v syntactic parsing

https://mailparser.io/ will suit your use case for syntactic parsing I believe, we use it to do the initial parsing prior to passing the parsed emails to AI.