Using OpenAI API and RAG to extract specific information from scraped website text

andrew.terhorst · April 23, 2024, 11:02pm

I want to extract specific information from scraped website text using a knowledge base of academic articles to improve what GPT4 extracts. Most RAG tutorials describe querying the knowledge base with a ChatBot. I want to query the scraped website text using API calls. The knowledge base is just a way to improve responses. I do not want to use a ChatBot. Can someone steer me to a tutorial that does something like this? Or explain how I go about this?

Creating a vector store from chunked PDFs is straightforward - I do not need help there. I need help with the API call - how do I access the vector store to improve responses?

Thanks.

vasyl · April 24, 2024, 5:32am

Hi, @andrew.terhorst ,

answering your question - your vector store has to index vector embeddings, means, after scraping a website you first chunk the data and send it to the embedding model, then index it in a vector store, then you compare your users queries to the chunks (documents) you have in your vector storage with cosine_similarity, finding this way the most relevant chunks, and send the top candidates to the API as a context together with the user query and a system prompt (kind of 'give the context below, answer the user’s question {question} , context: {context}.

But what I believe could be even a better for you is to send the whole web-site context as a meta-prompt (or system prompt what ever you prefer more) right to the GPT-4 API- each time. Literally, make the web site content an addition to all of your queries, every single time. Using the GPT-4 Turbo 128K is affordable (I don’t think you’d have thousands of requests per day) and with the 128K token context window you can put there a crazy amount of content. So unless your web-site is significantly more than 100k tokens and you have thousands of requests per day, I’d actually suggest you to avoid RAG and Vector stores and use this straight forward approach.

andrew.terhorst · April 26, 2024, 5:13am

Thanks for responding. I have been doing that with reasonable success. Perhaps I was overthinking RAG.

vasyl · April 26, 2024, 5:16am

Happy it was helpful! You’re very welcome.)

Topic		Replies	Views
RAG or Fine tuning for a domain specific QA chatbot API rag , development , chatbot , assistants-api	4	1371	July 3, 2024
Problem with doing RAG with 300k pages of PDFs Community gpt-4 , gpt-35-turbo , api	8	4767	March 7, 2024
How to Add Knowledge Base in API API api	12	18928	December 15, 2023
Easy RAG implementation for testing? API api , gpt4 , rag	5	9359	March 16, 2024
Seeking Advice on Building a RAG Chatbot Plugins / Actions builders chatgpt , rag	1	223	September 25, 2024

Using OpenAI API and RAG to extract specific information from scraped website text

Related topics