How to deal with unstructured data scraping for a website using AI?

Hi,

I’ve been scratching my head about this for a few days now.

My goal is to scrape website with mostly unstructured data and make it searchable with a vector db.

My Problem:

  • How should I structure the data in the end to make it searchable
  • Whats a common approach to scrape it using not too much tokens

The website is built like this

| /products
| - /product-1-fiat-500
| - /product-bmw-x3
...

What I’m going to do is loop each detail page:

  • Minimize it (remove header, footer, …)

  • Call openai and add the minimized markup + structured data prompt.

  • (Like: *"Scrape this page: and extract the data like the schema )

Schema Example:

{
title:
description:
price:
categories: ["car", "bike"]
}
  • Save it to JSON file and push it to a vector db like postgresql (with pgvector)

My struggle is now that I’m calling openai 300 times and it run pretty often into rate limits and every token costs some cents.

So I am trying to find a way to reduce the prompt a bit more, but the page markup is quite large and my prompt is also.

I think what I could try further is:

Convert to Markdown

I’ve seen that some ppl convert html to markdown which could reduce a lot overhead. But that wouldn’t help a lot

Do I even need to structure the data?

Maybe i can just convert the single pages to a markdown and push the whole content to
a column for searching? (Single pages have approx 1000 - 3000 words)

Generate Static Script

Instead of calling open AI 300 times I could generate a Scraping Script with AI - save it and use it.

> First problem:

Not every detail page is the same. So no chance to use selectors
For example, sometimes the title, description or price is in a different position than on other pages.

> Second problem:

In my schema i have a category enum like [“car”, “bike”] and OpenAI finds a match and tells me if its a car or bike.

Here’s a full tutorial from OpenAI, it includes web scraping:

How to build an AI that can answer questions about your website

1 Like

In my opinion, you have no need to send every page to OpenAI. I will recommend you to start with basic scraping and cleaning: get the main content from the page, remove things like header and footer, and then save the cleaned text in your vector database. This is usually enough for semantic search.
I will also recommend you to use OpenAI only for small extra tasks, like mapping a page to a category or creating a short structured summary.

Thanks

Hey,

You can handle unstructured data scraping by first cleaning the page and extracting the main content using code, such as removing headers, footers, and unnecessary elements. Then, store the cleaned text along with basic details like the title, price, and URL. It’s best to use AI only where needed, for example, for categorization or summarizing content, which helps reduce cost and improves efficiency.

For additional Practical tips on scraping methods and APIs, this guide provides useful insights.

I hope this will helps.