I’ve been attempting to use WebResearchRetriever from Langchain in Python, and I’m running a segment of code that works for other people, but I keep getting this error:
RuntimeError: asyncio.run() cannot be called from a running event loop
I think the issue may be with my computer and not with the code itself, but here’s the code:
from langchain.retrievers.web_research import WebResearchRetriever
import os
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models.openai import ChatOpenAI
from langchain.utilities import GoogleSearchAPIWrapper
Actually I’ve been getting that same SSL error while utilizing WebResearchRetriever. From what I’ve read, I think the true fix involves tweaking local Python SSL certificate config/handling, but things I’ve tried haven’t worked.
However, I did come up with a monkey-patch you can use to modify the behavior of WebResearchRetriever._get_relevant_documents to disable SSL cert verification. This isn’t ideal from a security point-of-view, but it works. This updated logic also adds a check for empty docs list before adding to the vector database, which solves a tuple exception when no docs could be decoded (if they are all pdfs, for example). I’ve added comments with my name # (BabellDev) so you can see the changes. Otherwise the code is the same as the current version on github.
Add this function declaration somewhere before your existing code, and then call it before you call WebResearchRetriever.from_llm.
def patch_web_research_retriever():
import logging
from typing import List
from langchain.retrievers.web_research import WebResearchRetriever
from langchain.callbacks.manager import CallbackManagerForRetrieverRun
from langchain.document_loaders import AsyncHtmlLoader
from langchain.document_transformers import Html2TextTransformer
from langchain.schema import Document
logger = logging.getLogger(__name__)
def _patched_get_relevant_documents(
self,
query: str,
*,
run_manager: CallbackManagerForRetrieverRun,
) -> List[Document]:
# Get search questions
logger.info("Generating questions for Google Search ...")
result = self.llm_chain({"question": query})
logger.info(f"Questions for Google Search (raw): {result}")
questions = getattr(result["text"], "lines", [])
logger.info(f"Questions for Google Search: {questions}")
# Get urls
logger.info("Searching for relevant urls...")
urls_to_look = []
for query in questions:
# Google search
search_results = self.search_tool(query, self.num_search_results)
logger.info("Searching for relevant urls...")
logger.info(f"Search results: {search_results}")
for res in search_results:
if res.get("link", None):
urls_to_look.append(res["link"])
# Relevant urls
urls = set(urls_to_look)
# Check for any new urls that we have not processed
new_urls = list(urls.difference(self.url_database))
logger.info(f"New URLs to load: {new_urls}")
# Load, split, and add new urls to vectorstore
if new_urls:
# (BabellDev) changed verify_ssl to False
loader = AsyncHtmlLoader(new_urls, verify_ssl=False)
html2text = Html2TextTransformer()
logger.info("Indexing new urls...")
docs = loader.load()
docs = list(html2text.transform_documents(docs))
docs = self.text_splitter.split_documents(docs)
# (BabellDev) do not add if docs is empty (avoid tuple error)
if docs is not None and len(docs) > 0:
self.vectorstore.add_documents(docs)
self.url_database.extend(new_urls)
# Search for relevant splits
logger.info("Grabbing most relevant splits from urls...")
docs = []
for query in questions:
docs.extend(self.vectorstore.similarity_search(query))
# Get unique docs
unique_documents_dict = {
(doc.page_content, tuple(sorted(doc.metadata.items()))): doc for doc in docs
}
unique_documents = list(unique_documents_dict.values())
return unique_documents
WebResearchRetriever._get_relevant_documents = _patched_get_relevant_documents
If you don’t like monkey-patching, you could derive your own class from WebResearchRetriever and override the _get_relevant_documents method.
FYI. Regarding the original question, it looks like this has been fixed in the latest version of langchain. If you take a look at load() in async_html it is now handling an already-running loop: