How do I write a prompt to scrape the components (HTML, CSS, image, link, header, title, body, etc) from a webpage?

I am able to instantiate a OpenAI instance and get a ChatCompletions object but my prompt is not good enough to show all the comps of the webpage (saved as a html file in my drive).

My prompt is as :

prompt = “”"
You have a content page from a website.
Get the individual components on this page like header, footer, image, title, body, metadata, links or any other content.
For each component, return:
- a title or heading
- a text content from that component if any
- any other metadata if present
- any images if available
Output the result in form of a JSON array with each element representing a component with its extracted fields.
Here is the HTML content:
\“\”\“{html}\”\“\”
Return only the JSON array.
“”"

How do I improve it?

My get_components function is as follows:

def get_components(html):
messages = \[
{“role”: “system”, “content”: prompt},
\]



client = OpenAI(api_key=OPENAI_API_KEY)



completion = client.chat.completions.create(

 model="gpt-3.5-turbo",

 messages=messages,

 response_format= { "type":"json_object" },

 temperature=0.7

)



print('\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*Response is \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*')

print(completion)



response = completion.choices\[0\].message.content.strip()



try:

    data = json.loads(response)

    return data

except json.JSONDecodeError as e:

    print(e)

    return \[\]

Thanks

You should instead prompt something like “write me a python script that crawls a website”.. and then use that..

The whole idea of using a llm for that is just wrong.

When there is a programmatic solution you always prefer that.