After working through various failures and circumventing other problems, I am able to get all the code cells to run. However, However, the assistant fails to answer the questions. Either there is no response or it simply responds with “I don’t know” to everything.
Here are challenges I’ve encountered and workarounds I tried. Nothing works.
The OpenAI website is too big for a crawl and the prototype eventually fails out or becomes unresponsive while doing the crawl. I tried to limit it to 1000 pages crawl. With that I’m able to get all the code cells to run. But it won’t answer the question.
I’ve tried crawling much smaller websites. Still no answer to the questions.
Tried different questions that were easily answered in some of the webpages crawled. For e.g. “What does do” where the named person has a web page that was crawled.
Has anybody gotten this to work? Is this prototype supported by OpenAI?
I followed the instruction on OpenAI Platform, and copy-pasted all of the blocks verbatim from there into a single python file. You can see all of the code on that page (Expand the section for Web Crawler for that code). Hope that is helpful.
The Github code is at the openai-cookbook/tree/main/apps/ as web-crawl-q-and-a
I downloaded this, am running it right now and its doing the crawl. Will monitor if it runs through fine.
To answer your question, the changes i made were straightforward:
(1) change the crawled site from “openai.com” to another site. I could see that it was crawling it fine - no errors.
(2) limiting the openai crawl to 1000 links as follows:
Defined a variable before the while loop in the crawl(url) function:
i=1
# While the queue is not empty, continue crawling
while queue:
…
Added the following inside the while loop, at the end of the loop:
print(i)
if i == 1000:
break #print(i)
i += 1
…
To add new questions for assistant, i just edited the questions at the end of the code file.
prepare data for embedding. in your case, this is the result from the web crawler. check how the info are arranged, if it is logical and even you can manually understand it. Not all webpages are written properly. That is why even Bing or Bard cannot get just any website summary sometimes.
send data to embeddings api and it will return vector data. save for later.
receive user question, get its embeddings, process it together with the vector data from previous step. here, you should already know if you have actual hit or not, if you get any proper result.
send user question, step 3 result to chat api for summary. there is an off chance that the AI will disregard the result from step 3 if it determines that the data you give does not match what was being asked.
based on what you have written, i am guessing that the result of web crawling is not good. can you verify?