Html in text uploaded via files api

if the content of the file I upload to help train openAI has HTML embedded in it (ie span, div etc) is that going to confuse the system? SHould I be stripping all the html out first before i provide it?

I’m feeding openAI the contents of our website essentially to get it to answer questions about it, but not sure whether I should just grab the text only (as the API to my CMS returns json that has all the html included,
Using the example in the doco:
{“text”: “puppy A is happy”, “metadata”: “emotional state of puppy A”}
{“text”: “puppy B is sad”, “metadata”: “emotional state of puppy B”}

would this still work?
{“text”: “puppy A < span>is happy< /span”, “metadata”: “emotional state of puppy A”}
{“text”: “puppy < span>B is sad< /span>”, “metadata”: “emotional state of puppy B”}

or would the span tags just confuse the hell out of it? (spaces added to the span tags as they weren’t showing up)

1 Like

Hi @ddrechsler ,

This is a very interesting problem that you’re facing. In my opinion if the purpose is to help find Information listed on your website, you wouldn’t be needing the HTML. However of the purpose is to help user navigate to do something, then that’s going to be a whole new thing.

Also while we are on subject, a better way to help users find information would be something like Google custom search, because it will be able to handle any updates to the website while in case of GPT-3 you’d have to fine-tune after every update to accommodate changes.

1 Like

in the end I through the html to some python code that strips out just the text from the html then threw that to open AI. Working well and I have a chatbot that is able to answer questions directly from the text of our website. I have to whitelist questions it can successfully answer though, but you can use the knowledgebases in Dialogflow for that purpose. It’s kinds cool, like the ultimate website search where it answers from the content of pages.