How to deal with unstructured data scraping for a website using AI?

kregenrek · July 17, 2024, 6:48am

Hi,

I’ve been scratching my head about this for a few days now.

My goal is to scrape website with mostly unstructured data and make it searchable with a vector db.

My Problem:

How should I structure the data in the end to make it searchable
Whats a common approach to scrape it using not too much tokens

The website is built like this

| /products
| - /product-1-fiat-500
| - /product-bmw-x3
...

What I’m going to do is loop each detail page:

Minimize it (remove header, footer, …)
Call openai and add the minimized markup + structured data prompt.
(Like: *"Scrape this page: and extract the data like the schema )

Schema Example:

{
title:
description:
price:
categories: ["car", "bike"]
}

Save it to JSON file and push it to a vector db like postgresql (with pgvector)

My struggle is now that I’m calling openai 300 times and it run pretty often into rate limits and every token costs some cents.

So I am trying to find a way to reduce the prompt a bit more, but the page markup is quite large and my prompt is also.

I think what I could try further is:

Convert to Markdown

I’ve seen that some ppl convert html to markdown which could reduce a lot overhead. But that wouldn’t help a lot

Do I even need to structure the data?

Maybe i can just convert the single pages to a markdown and push the whole content to
a column for searching? (Single pages have approx 1000 - 3000 words)

Generate Static Script

Instead of calling open AI 300 times I could generate a Scraping Script with AI - save it and use it.

> First problem:

Not every detail page is the same. So no chance to use selectors
For example, sometimes the title, description or price is in a different position than on other pages.

> Second problem:

In my schema i have a category enum like [“car”, “bike”] and OpenAI finds a match and tells me if its a car or bike.

supershaneski · July 17, 2024, 8:34am

Here’s a full tutorial from OpenAI, it includes web scraping:

How to build an AI that can answer questions about your website

Topic		Replies	Views
How can I scrape websites and extract data to create structured entities? Prompting	3	4456	December 16, 2023
Searching 'products' using natural language querying, API	9	2487	December 19, 2023
How can I Write a good article based on scrape content? API chatgpt	4	906	October 27, 2023
Using OpenAI to search database for products API	12	5212	November 21, 2023
GPT for scraping (extracting from) unstructured web pages API	0	2026	December 18, 2023

How to deal with unstructured data scraping for a website using AI?

Related topics