How to deal with unstructured data scraping for a website using AI?

Hi,

I’ve been scratching my head about this for a few days now.

My goal is to scrape website with mostly unstructured data and make it searchable with a vector db.

My Problem:

  • How should I structure the data in the end to make it searchable
  • Whats a common approach to scrape it using not too much tokens

The website is built like this

| /products
| - /product-1-fiat-500
| - /product-bmw-x3
...

What I’m going to do is loop each detail page:

  • Minimize it (remove header, footer, …)

  • Call openai and add the minimized markup + structured data prompt.

  • (Like: *"Scrape this page: and extract the data like the schema )

Schema Example:

{
title:
description:
price:
categories: ["car", "bike"]
}
  • Save it to JSON file and push it to a vector db like postgresql (with pgvector)

My struggle is now that I’m calling openai 300 times and it run pretty often into rate limits and every token costs some cents.

So I am trying to find a way to reduce the prompt a bit more, but the page markup is quite large and my prompt is also.

I think what I could try further is:

Convert to Markdown

I’ve seen that some ppl convert html to markdown which could reduce a lot overhead. But that wouldn’t help a lot

Do I even need to structure the data?

Maybe i can just convert the single pages to a markdown and push the whole content to
a column for searching? (Single pages have approx 1000 - 3000 words)

Generate Static Script

Instead of calling open AI 300 times I could generate a Scraping Script with AI - save it and use it.

> First problem:

Not every detail page is the same. So no chance to use selectors
For example, sometimes the title, description or price is in a different position than on other pages.

> Second problem:

In my schema i have a category enum like [“car”, “bike”] and OpenAI finds a match and tells me if its a car or bike.

Here’s a full tutorial from OpenAI, it includes web scraping:

How to build an AI that can answer questions about your website

1 Like