I want to extract unstructured data from a website, how do I do that?

yousefmarey12 · September 4, 2023, 5:05am

Hello,

I’ll divide my problem into two contexts. The overall and general goal and the specific goal.

Overall and General Goal:
I want to build a website where OpenAI provides fitness consultations to a specific person.

Specific Goal:

I want to extract data from each exercise in Exercise Video Guides: 1500+ Exercises with Instructions & Tips . Whenever OpenAI suggests an exercise, I want to search for it on the website. The problem is, the data is not structured; therefore, I can’t put it in an SQL file. How can I extract the data, specifically the title and video link and then put them in an SQL file. Keep in mind that there’s 1500+ exercises on this website.

P.S: I asked a similar question, but it was about the initial process. I thought this topic needed a question of its own.

Thanks in advance

Foxalabs · September 4, 2023, 5:42am

How are you expecting to get any data out of a video? Is there accompanying text? Has the text been verified? Can you be sure the text (if any exists) contains useful keywords?

You should spend some time looking at a few hundred examples and doing what you propose the AI does, but disable the video, i.e., can you produce what you expect the AI to do, if you can’t (like the Ai can’t) see the video.

supershaneski · September 4, 2023, 6:53am

I checked the website you gave and from cursory glance at the pages and source code, it looks pretty structured to me which makes it easy to do web crawling. This is more of a web crawling problem than AI to me imo. For example, you look for taxonomy-heading for the main exercise category, then under that go to exercise-category-list which lists another category under each cell which contains an anchor tag that points to the exercise page, etc.

yousefmarey12 · September 4, 2023, 11:56am

On my website, I want each exercise to have a title, description, and video link . I don’t want the content in the video, but I am wondering how I can structure the data in an SQL format, either using ChatGPT or my own code. Or at least, what do I need to learn in order to know how to put the title, description, and video link in an SQL file?

Thanks for your response, btw !!!

yousefmarey12 · September 4, 2023, 12:27pm

Sorry, there was a miscommunication on my part. I am a frontend developer who is just getting into backend, so I thought structured data was data in a database. Anyway, what I do want is to put the structured data from the site into a database. This is a web crawling problem, and I am also new to web crawling. I was wondering if web crawling is something that is to be done with OpenAI? If it is or it isn’t, could you perhaps give me directions on how to do it?

Thanks for the reply btw !!

Foxalabs · September 4, 2023, 12:40pm

I think you would benefit greatly from these fere short courses by Andy Ng and OpenAI staff

supershaneski · September 4, 2023, 11:34pm

Yes, you can use the OpenAI Chat api to analyze the data from your web crawler.

To test this idea, I took a partial HTML of that site and feed it to ChatGPT with the prompt:

I have the following raw html, can you parse and organize the information in readable form?

<div class="content">
<h2 class="taxonomy-heading">Excercises by Muscle Group</h2>
<p class="subheading-text">Choose the muscle group you want to target. Once in the muscle group, use the sort and filter options to find the best exercises for the equipment you have, your training experience, and goals.</p>
<div class="mainpage-category-list exercise-category-list">
<div class="grid-x grid-margin-x grid-margin-y small-up-2 bp600-up-3 medium-up-4">
<div class="cell">
<a href="/exercises/abductors.html">
<img data-src="https://cdn.muscleandstrength.com/sites/default/files/taxonomy/image/videos/abductors.jpg" class=" lazyloaded" width="400" height="250" alt="Abductors Exercises" title="Abductors Exercises" src="https://cdn.muscleandstrength.com/sites/default/files/taxonomy/image/videos/abductors.jpg"></a>
<a href="/exercises/abductors.html">
<div class="category-name">Abductors</div>
</a>
...

And it gave the result:

**Exercise Categories by Muscle Group:**
1. **Abductors**
   - ![Image](https://cdn.muscleandstrength.com/sites/default/files/taxonomy/image/videos/abductors.jpg)
   - [Link](/exercises/abductors.html)

2. **Abs**
   - ![Image](https://cdn.muscleandstrength.com/sites/default/files/taxonomy/image/videos/abs_0.jpg)
   - [Link](/exercises/abs)
...

You can probably write a better prompt to give a cleaner result. But as a programmer, I think it will be faster if you parse it on your own.

generalbadwolf · September 5, 2023, 12:27am

get a 7b (very small very fast) LLM to summarize it for you in sequence.

Have it return in a second phase of cleaned up summarized code.

3rd and 4th should either clean it up more or give a zero shot decode.

yousefmarey12 · September 5, 2023, 11:00am

This is actually really helpful, but I may have to start learning web scraping first (because I just want to lol). Once again, thanks for the reply!

yousefmarey12 · September 5, 2023, 11:04am

Well, that’s really interesting. I’ll try to convert it into an SQL format. Once again, thank you for the help!

yousefmarey12 · September 5, 2023, 11:09am

Do you mean an MPT-7B or am I talking about a completely different thing?

generalbadwolf · October 28, 2023, 6:47pm

Wait
There
Friend…

I Do hope you learned what huggingface is since this day.

Topic		Replies	Views
Fine-tuning gpt-3.5-turbo API	8	1770	September 21, 2023
Turning chatgpt API into a assistant for a (complex) website API	20	4292	December 21, 2023
Intelligent decision making for what is to be stored in database? API api	5	93	March 23, 2025
What can I do, and how can I work with OpenAI chat? Community chatgpt , api	8	5766	December 24, 2023
Building the Ultimate Chatbot: What Do You Think of My Strategy? API	30	6381	December 18, 2023

I want to extract unstructured data from a website, how do I do that?

Related topics