Read HTML page, and generate code to scrape the content

Hi all,

Using the Advanced data analysis from within chatGPT UI allows you to upload HTML content, then ask GPT to do all sorts of actions on that content. For example, you can ask it to generate python code to scrape the product titles and prices.

How would this be achieved through the API? I’ve noticed that the openai File create only accepts JSON data and purpose is fine-tune. How can I upload a file that gpt can use to inspect and generate code as per the example above?

You would just post the HTML as part of your prompt, and explain to the AI that you’d like to generate code that can parse that kind of HTML. I bet that will work. The JSON is just the mechanism for how you talk to the API, but you can send it any kind of text as it’s prompt.

Thanks for the response, but that only works if the HTML content is small. If you have a large page for example, say page size is over 500kb, that will be about 280K tokens. Far too large to pass into a prompt, I think the max for gpt-3.5 is 9800 tokens.

To get around that I’d replace the text content in the HTML with placeholder text like “{X}” and then just literally tell the AI the “{X}” is a placeholder, and that you want it to generate a parser still.

You might have to write code that, parses the HTML, and then writes it out in this compact format, but I can also tell you that’s like 10 lines of JS.

Tell GPT to write that JS for you. :slight_smile:

I’m probably looking for a way to upload the file to GPT without any modifications, and then have it reference that file. I’ve been looking into embeddings, but not sure that’s the correct way.

If you’re talking about analyzing a particular file structure in enough detail for the AI to be able to write a parser for the file, you’re going to need to feed it an exact actual example of the file you want to parse.

If the context length limitations are less than the size of your actual file, then you’ll just have to feed it something shorter, like what I suggested. I can’t think of any other way, but let’s see if anyone on this forum can prove me wrong. I hope so, for the sake of my own learning! :slight_smile:

1 Like

you answer your own question:

you run that code yourself.

Welcome to the OpenAI community @anthony9

Hi @sps , thanks, it’s good to be here :slight_smile:

I have tried embedding python code into the message sent to the API, however it still cannot access / open a file that is locally hosted. I will have a look at open-interpreter, however I would like to understand how the “model” is able to “see” the file.