Need help scraping a MediaWiki site

felix822 · May 31, 2023, 4:47pm

Hello,

I’m trying to scrape a MediaWiki site that we use for a support wiki, but the script is having difficulties following links/finding pages. I used this example to build my code - OpenAI API

Thank you for any feedback!

joyasree78 · May 31, 2023, 6:35pm

You may have to pay, but please look at the Langchain APIFY module. It is integrated with APIFY who does the crawl for you

github.com

hwchase17/langchain/blob/master/langchain/document_loaders/apify_dataset.py

"""Logic for loading documents from Apify datasets."""
from typing import Any, Callable, Dict, List

from pydantic import BaseModel, root_validator

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader


class ApifyDatasetLoader(BaseLoader, BaseModel):
    """Logic for loading documents from Apify datasets."""

    apify_client: Any
    dataset_id: str
    """The ID of the dataset on the Apify platform."""
    dataset_mapping_function: Callable[[Dict], Document]
    """A custom function that takes a single dictionary (an Apify dataset item)
     and converts it to an instance of the Document class."""

    def __init__(

This file has been truncated. show original

SomebodySysop · September 18, 2023, 9:26pm

Definitely look at Browse.ai https://browse.ai

I struggled with a couple of scrapers for several days, and didn’t even try Apify because of the javascript requirements. Browse.ai uses AI to set up scraper and, relatively speaking, couldn’t be easier to use. It also has integrations including Google Sheets and Zapier.

Topic		Replies	Views
How can I scrape websites and extract data to create structured entities? Prompting	3	5689	December 16, 2023
Turn any website into an API with GPT-4 Community gpt-4 , api	12	11448	December 22, 2023
Scrapping website and feeding to openai to make a chatbot Community chatgpt	4	1102	February 21, 2024
Web Crawling official documentation Community gpt-4 , api	0	817	May 31, 2023
Use for writing articles? API	3	2532	August 31, 2022

Need help scraping a MediaWiki site

Related topics