Error when using Langchain WebResearchRetriever – RuntimeError: asyncio.run() cannot be called from a running event loop

I’ve been attempting to use WebResearchRetriever from Langchain in Python, and I’m running a segment of code that works for other people, but I keep getting this error:

RuntimeError: asyncio.run() cannot be called from a running event loop

I think the issue may be with my computer and not with the code itself, but here’s the code:

from langchain.retrievers.web_research import WebResearchRetriever
import os
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models.openai import ChatOpenAI
from langchain.utilities import GoogleSearchAPIWrapper

os.environ[“OPENAI_API_KEY”] = ‘my_key’

vectorstore = Chroma(embedding_function=OpenAIEmbeddings(),persist_directory=“./chroma_db_oai”)

llm = ChatOpenAI(temperature=0)

os.environ[“GOOGLE_CSE_ID”] = “my_key”
os.environ[“GOOGLE_API_KEY”] = “my_key”
search = GoogleSearchAPIWrapper()

web_research_retriever = WebResearchRetriever.from_llm(
vectorstore=vectorstore,
llm=llm,
search=search,
)

from langchain.chains import RetrievalQAWithSourcesChain
user_input = “How do LLM Powered Autonomous Agents work?”
qa_chain = RetrievalQAWithSourcesChain.from_chain_type(llm,retriever=web_research_retriever)
result = qa_chain({“question”: user_input})
print (result)

Can anyone help me resolve this error? Any help would be much appreciated.

You’ll probably have better luck asking on the LangChain forums since the issue is with their code.

2 Likes

Are you running this in Jupyter? If so they already run an event loop in the background.

I’ve tried it in Jupyter and on Google Colab. Do you know what I should use instead so that there isn’t already an event loop?

I had this same issue in Jupyter. You should be able to solve it by tweaking your code similar to this:

# Make sure nest_asyncio is installed
!pip install nest_asyncio

# Allow nested asyncio loops
import nest_asyncio
nest_asyncio.apply()

# Move your use of qa_chain into an async function
async def main():
result = qa_chain({“question”: user_input})
print (result)

# Run your async function in the existing event loop
loop = asyncio.get_event_loop()
loop.run_until_complete(main())

2 Likes

Thank you so much! The error is gone. However, I now have another error which I have no clue how to fix. Any idea how to fix this?

ClientConnectorCertificateError: Cannot connect to host python.langchain.com:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1002)')]

Actually I’ve been getting that same SSL error while utilizing WebResearchRetriever. From what I’ve read, I think the true fix involves tweaking local Python SSL certificate config/handling, but things I’ve tried haven’t worked.

However, I did come up with a monkey-patch you can use to modify the behavior of WebResearchRetriever._get_relevant_documents to disable SSL cert verification. This isn’t ideal from a security point-of-view, but it works. This updated logic also adds a check for empty docs list before adding to the vector database, which solves a tuple exception when no docs could be decoded (if they are all pdfs, for example). I’ve added comments with my name # (BabellDev) so you can see the changes. Otherwise the code is the same as the current version on github.

Add this function declaration somewhere before your existing code, and then call it before you call
WebResearchRetriever.from_llm.

def patch_web_research_retriever():
    import logging
    from typing import List
    from langchain.retrievers.web_research import WebResearchRetriever
    from langchain.callbacks.manager import CallbackManagerForRetrieverRun
    from langchain.document_loaders import AsyncHtmlLoader
    from langchain.document_transformers import Html2TextTransformer
    from langchain.schema import Document

    logger = logging.getLogger(__name__)

    def _patched_get_relevant_documents(
        self,
        query: str,
        *,
        run_manager: CallbackManagerForRetrieverRun,
    ) -> List[Document]:

        # Get search questions
        logger.info("Generating questions for Google Search ...")
        result = self.llm_chain({"question": query})
        logger.info(f"Questions for Google Search (raw): {result}")
        questions = getattr(result["text"], "lines", [])
        logger.info(f"Questions for Google Search: {questions}")

        # Get urls
        logger.info("Searching for relevant urls...")
        urls_to_look = []
        for query in questions:
            # Google search
            search_results = self.search_tool(query, self.num_search_results)
            logger.info("Searching for relevant urls...")
            logger.info(f"Search results: {search_results}")
            for res in search_results:
                if res.get("link", None):
                    urls_to_look.append(res["link"])

        # Relevant urls
        urls = set(urls_to_look)

        # Check for any new urls that we have not processed
        new_urls = list(urls.difference(self.url_database))

        logger.info(f"New URLs to load: {new_urls}")
        # Load, split, and add new urls to vectorstore
        if new_urls:

            # (BabellDev) changed verify_ssl to False
            loader = AsyncHtmlLoader(new_urls, verify_ssl=False)
            html2text = Html2TextTransformer()
            logger.info("Indexing new urls...")
            docs = loader.load()
            docs = list(html2text.transform_documents(docs))
            docs = self.text_splitter.split_documents(docs)

            # (BabellDev) do not add if docs is empty (avoid tuple error)
            if docs is not None and len(docs) > 0:
                self.vectorstore.add_documents(docs)

            self.url_database.extend(new_urls)

        # Search for relevant splits
        logger.info("Grabbing most relevant splits from urls...")
        docs = []
        for query in questions:
            docs.extend(self.vectorstore.similarity_search(query))

        # Get unique docs
        unique_documents_dict = {
            (doc.page_content, tuple(sorted(doc.metadata.items()))): doc for doc in docs
        }
        unique_documents = list(unique_documents_dict.values())
        return unique_documents

    WebResearchRetriever._get_relevant_documents = _patched_get_relevant_documents

If you don’t like monkey-patching, you could derive your own class from WebResearchRetriever and override the _get_relevant_documents method.

Hope it helps!

1 Like

had same issue. I think the solution is to adjust the async_html.py in \Lib\site-packages\langchain\document_loaders.

first add these two:
import ssl
import certifi

then change the code block at the bottom to:

ssl_context = ssl.create_default_context(cafile=certifi.where())
conn = aiohttp.TCPConnector(ssl=ssl_context)
async with aiohttp.ClientSession(connector=conn) as session:

1 Like

FYI. Regarding the original question, it looks like this has been fixed in the latest version of langchain. If you take a look at load() in async_html it is now handling an already-running loop: