How do I write a prompt to scrape the components (HTML, CSS, image, link, header, title, body, etc) from a webpage?

mk235 · August 25, 2025, 2:04pm

I am able to instantiate a OpenAI instance and get a ChatCompletions object but my prompt is not good enough to show all the comps of the webpage (saved as a html file in my drive).

My prompt is as :

prompt = “”"

You have a content page from a website.

Get the individual components on this page like header, footer, image, title, body, metadata, links or any other content.

For each component, return:

- a title or heading

- a text content from that component if any

- any other metadata if present

- any images if available

Output the result in form of a JSON array with each element representing a component with its extracted fields.

Here is the HTML content:

\“\”\“{html}\”\“\”

Return only the JSON array.

“”"

How do I improve it?

My get_components function is as follows:

def get_components(html):

messages = \[

{“role”: “system”, “content”: prompt},

\]



client = OpenAI(api_key=OPENAI_API_KEY)



completion = client.chat.completions.create(

 model="gpt-3.5-turbo",

 messages=messages,

 response_format= { "type":"json_object" },

 temperature=0.7

)



print('\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*Response is \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*')

print(completion)



response = completion.choices\[0\].message.content.strip()



try:

    data = json.loads(response)

    return data

except json.JSONDecodeError as e:

    print(e)

    return \[\]

Thanks

jochenschultz · August 25, 2025, 4:16pm

You should instead prompt something like “write me a python script that crawls a website”.. and then use that..

The whole idea of using a llm for that is just wrong.

When there is a programmatic solution you always prefer that.

Topic		Replies	Views
ChatCompletion Prompt is not giving the desired result API gpt-35-turbo	3	1094	November 20, 2023
Different Variations in Prompt - How to manage multiple prompts Prompting chatgpt , prompt , prompts-as-code , prompt-engineering	5	1173	August 23, 2024
Prompt Preamble/Prompt Template for Virtual Assistant Prompting api	4	4305	February 7, 2024
Issue with OpenAI API Ignoring Prompt Instructions to Exclude Specific HTML Tags Prompting openapi , html	16	601	December 22, 2024
Can LinkReader Plugin be used with the OpenAI API? API chatgpt	21	3177	June 30, 2023

How do I write a prompt to scrape the components (HTML, CSS, image, link, header, title, body, etc) from a webpage?

Related topics