Scrapping website and feeding to openai to make a chatbot

Hi Folks

I am trying to scrape my internal confluence pages (which can be 100 of those from root URL) and feed them into opani to build a chatbot,this chatbot will help users to find out relevant pages they need and should answer their queries and also provide hyperlinks to possible pages which have those answers
I have been trying to do this with beautiful soup but since there are so many pages ,unable to limit the token limit
Please let me know approach I can adopt to build this library
a) scrapes through any number of URLs provided and their subpages
b) maintains the token limit

You’ll want to look into text embeddings[1] for doing dynamic retrieval of context.


  1. Text Embeddings API Reference ↩︎

Thanks,can you help to elaborate more on what can be the architectural flow here

will it be:
scrape all web pages—create embeddings —store them in csv etc—use query+embeddings to answer …in this case how do we limit to embeddings data storage and also send user query+relevant embeddings alone to lower the token usage

If they’re your pages, why are you scraping them?

You should have the raw text already.

1 Like

I dont have raw text ,those are my teams pages like common production issues page,architecture page of our aws flow erc and many more
We want to make a chatbot which can answer any user query like i have got this error in prod today,whats the possible resolution
Ideally for above it should automatically refer prod support common issues page and get the answer based on existing data
Above is very small use case,queries can be wider as well which span through mutiple pages