Create an IA which will crawl the pages and talk about it

Hi Folks,
I am trying to find where we can have AI which will crawl the pages and give us some insight.
I saw a lot of online services with their own AI. But they are super expensive and I cannot have what I want, I would like to collect my own data and eventually later on I would like to integrate with LLM so that based on my data I can do some analyzes etc. I hope I am little clear. Hopefully someone can put me on the right path. I am searching already days but still couldn’t find something which works for me.

Thanks in advance

Actually, you can try out the Live “Web” Research SDK TypeScript example from our open-source Policy Synth project. You’ll need an OpenAI API key and API keys for Google Custom Search; otherwise, it requires just four npm commands to start in dev mode: two to run the API and two more to run the web app.

Here is a screenshot:

1 Like

If you’ve got a defined list of web pages that you’d like to crawl regularly, then you can a build solution that involves basic scraping techniques and then feeds the data from the scraping into a LLM for further processing and analysis. I’ve built something along those lines for a niche domain. Depending on the design, it does not have to involve a lot of costs for development nor ongoing operation.

If your search/crawl is open-ended, then that’s a different story.

Thank you for your response robert. I dont search for this for company chatbot solution. It is for personal learning purposes, to crawl the pages eventually (sitemaps) from there make some examples and learn the topic better.
I hope I could explain it. And of course I would like to keep the data for my own usage.
I will surely check the github project and surely I will share my thoughts here.
As far as I see it does not scrap the provided pages. I think it is kinda websearch app.
Maybe I am wrong but it looks like from the screenshot.

Indeed, it is about basic scrapping and train it to the chatgpt and of course further LLM processing and analysis.
If you have some basic codes, can you share it with me that I can check and further improve it?

I’m using a low/no-code solution for the scraping part of my solution, so can’t help out with code I’m afraid.

Good luck!

You can use Policy Synth for private projects, it is open source and you can run the SDK example on any laptop if you have the OpenAI API key - if you have a list of URLs you can change the SDK example to skip the Google Search and instead just give your list of URL to scan. Just delete lines 40-90 in this link, that have to do with the search, and add your list of URLs (this could be coming from the web app with minor modifications): policy-synth/examples/liveResearch/webApi/src/liveResearchChatBot.ts at main · CitizensFoundation/policy-synth · GitHub

Thank you robertb, :heart:
I will test it, and eventually I can share with you.
I can pm you, for further investigation if you want.

1 Like

Please share what solution you come up with. I and others would be interested.

Hi Folks,
I am still trying to make python code which will collect the links
second step will be create plain text files from those links.
I am still working on. I can only do in my free time. I am also full time worker. but I will keep you updated.
Does someone already, small python code to train chatgpt. Then I dont have to spend time to find it again. I have already something but I think the methods are depricated.

import openai

# Read the plain text file
with open('your_text_file.txt', 'r', encoding='utf-8') as file:
    text = file.read()

# Set up OpenAI API credentials
openai.api_key = 'YOUR_API_KEY'

# Fine-tune the model
response = openai.Completion.create(
  engine='text-davinci-002',
  prompt=text,
  max_tokens=1000,
  n=1,
  stop=None,
  temperature=0.8,
)

# Access the generated response
generated_text = response.choices[0].text.strip()

print(generated_text)

Thank you to provide also some help.