So we have thousands of emails from over a hundred online shops. These emails may have different topics, some confirming an order, updating an order status, reminding customers about their account, marketing, etc. Emails from the same vendor are generally consistently formatted. These emails are written in HTML, may have header/footers that contain generic images, links, promotional content. Emails also have a main body that a specific subject (as said above).
I have been experimenting with prompting GPT to process these emails, to remove ‘boilerplates’ including the header, footer, and signature sections, and then only output the plain text of the main body message. The prompt I use is:
The following text delimited by triple backticks is the HTML source code of an email. Extract the clean text of the main message, but remove the header and footer of the message. If the email has a signature section, also remove everything within and after the signature section. Text:\n\n
{text}
The results are rather mixed. Probably in over 80% cases the output is ok. But in quite a notable cases, the text is cut out arbitrarily. For example, if in the HTML source code there is a
This is a long sentence long sentence long sentence long sentence
, sometimes it fails to retain all the text between the twotags, but may return something like ’ This is a long’.
Obviously boilerplate removal would work better if we can ask the model to look at emails from the same client at the same time as they could be using the same formatting. But that would not be possible due to the limited input context.
Any suggestions please? Anything I have done wrong or any tips, thoughts?
Thanks