I want to extract specific information from scraped website text using a knowledge base of academic articles to improve what GPT4 extracts. Most RAG tutorials describe querying the knowledge base with a ChatBot. I want to query the scraped website text using API calls. The knowledge base is just a way to improve responses. I do not want to use a ChatBot. Can someone steer me to a tutorial that does something like this? Or explain how I go about this?
Creating a vector store from chunked PDFs is straightforward - I do not need help there. I need help with the API call - how do I access the vector store to improve responses?
answering your question - your vector store has to index vector embeddings, means, after scraping a website you first chunk the data and send it to the embedding model, then index it in a vector store, then you compare your users queries to the chunks (documents) you have in your vector storage with cosine_similarity, finding this way the most relevant chunks, and send the top candidates to the API as a context together with the user query and a system prompt (kind of 'give the context below, answer the user’s question {question} , context: {context}.
But what I believe could be even a better for you is to send the whole web-site context as a meta-prompt (or system prompt what ever you prefer more) right to the GPT-4 API- each time. Literally, make the web site content an addition to all of your queries, every single time. Using the GPT-4 Turbo 128K is affordable (I don’t think you’d have thousands of requests per day) and with the 128K token context window you can put there a crazy amount of content. So unless your web-site is significantly more than 100k tokens and you have thousands of requests per day, I’d actually suggest you to avoid RAG and Vector stores and use this straight forward approach.