I want to extract unstructured data from a website, how do I do that?

Hello,

I’ll divide my problem into two contexts. The overall and general goal and the specific goal.

Overall and General Goal:
I want to build a website where OpenAI provides fitness consultations to a specific person.

Specific Goal:

I want to extract data from each exercise in Exercise Video Guides: 1500+ Exercises with Instructions & Tips . Whenever OpenAI suggests an exercise, I want to search for it on the website. The problem is, the data is not structured; therefore, I can’t put it in an SQL file. How can I extract the data, specifically the title and video link and then put them in an SQL file. Keep in mind that there’s 1500+ exercises on this website.

P.S: I asked a similar question, but it was about the initial process. I thought this topic needed a question of its own.

Thanks in advance :slight_smile:

How are you expecting to get any data out of a video? Is there accompanying text? Has the text been verified? Can you be sure the text (if any exists) contains useful keywords?

You should spend some time looking at a few hundred examples and doing what you propose the AI does, but disable the video, i.e., can you produce what you expect the AI to do, if you can’t (like the Ai can’t) see the video.

I checked the website you gave and from cursory glance at the pages and source code, it looks pretty structured to me which makes it easy to do web crawling. This is more of a web crawling problem than AI to me imo. For example, you look for taxonomy-heading for the main exercise category, then under that go to exercise-category-list which lists another category under each cell which contains an anchor tag that points to the exercise page, etc.

2 Likes

On my website, I want each exercise to have a title, description, and video link . I don’t want the content in the video, but I am wondering how I can structure the data in an SQL format, either using ChatGPT or my own code. Or at least, what do I need to learn in order to know how to put the title, description, and video link in an SQL file?

Thanks for your response, btw !!!

Sorry, there was a miscommunication on my part. I am a frontend developer who is just getting into backend, so I thought structured data was data in a database. Anyway, what I do want is to put the structured data from the site into a database. This is a web crawling problem, and I am also new to web crawling. I was wondering if web crawling is something that is to be done with OpenAI? If it is or it isn’t, could you perhaps give me directions on how to do it?

Thanks for the reply btw !!

I think you would benefit greatly from these fere short courses by Andy Ng and OpenAI staff

1 Like

Yes, you can use the OpenAI Chat api to analyze the data from your web crawler.

To test this idea, I took a partial HTML of that site and feed it to ChatGPT with the prompt:

I have the following raw html, can you parse and organize the information in readable form?

<div class="content">
<h2 class="taxonomy-heading">Excercises by Muscle Group</h2>
<p class="subheading-text">Choose the muscle group you want to target. Once in the muscle group, use the sort and filter options to find the best exercises for the equipment you have, your training experience, and goals.</p>
<div class="mainpage-category-list exercise-category-list">
<div class="grid-x grid-margin-x grid-margin-y small-up-2 bp600-up-3 medium-up-4">
<div class="cell">
<a href="/exercises/abductors.html">
<img data-src="https://cdn.muscleandstrength.com/sites/default/files/taxonomy/image/videos/abductors.jpg" class=" lazyloaded" width="400" height="250" alt="Abductors Exercises" title="Abductors Exercises" src="https://cdn.muscleandstrength.com/sites/default/files/taxonomy/image/videos/abductors.jpg"></a>
<a href="/exercises/abductors.html">
<div class="category-name">Abductors</div>
</a>
...

And it gave the result:

**Exercise Categories by Muscle Group:**
1. **Abductors**
   - ![Image](https://cdn.muscleandstrength.com/sites/default/files/taxonomy/image/videos/abductors.jpg)
   - [Link](/exercises/abductors.html)

2. **Abs**
   - ![Image](https://cdn.muscleandstrength.com/sites/default/files/taxonomy/image/videos/abs_0.jpg)
   - [Link](/exercises/abs)
...

You can probably write a better prompt to give a cleaner result. But as a programmer, I think it will be faster if you parse it on your own.

2 Likes

get a 7b (very small very fast) LLM to summarize it for you in sequence.

Have it return in a second phase of cleaned up summarized code.

3rd and 4th should either clean it up more or give a zero shot decode.

1 Like

This is actually really helpful, but I may have to start learning web scraping first (because I just want to lol). Once again, thanks for the reply!

1 Like

Well, that’s really interesting. I’ll try to convert it into an SQL format. Once again, thank you for the help!

Do you mean an MPT-7B or am I talking about a completely different thing?

Wait
There
Friend…

I Do hope you learned what huggingface is since this day.