How to search/answer with formatted documents on large knowledgebases

I just discovered openAI last week and my head is buzzing with ideas and I am still reading up on few-shot ML applications. I did think of a use case I would like to implement though, and was hoping to hear about your experiences with this type of applications.

I would like to establish a search/answer chatbot that helps with navigating of knowledgebases that are dynamic in nature (may change any time and frequently) and might well be longer than 2000 tokens and have more than 200 pages. I want to find a fast and cost-effective solution to respond to search requests and questions about the content of the knowledgebase.

I have been planning to first classify the question to figure out whether it is a knowledge request, a search request, support request, unrelated to knowledge, or inappropriate, and thought that in the same breath I could classify also for type of content which could be used to narrow the initial search request by classifying and labeling the knowledge base beforehand, (reclassifying documents each time they are edited) and using the same labeling to classify the search. Then I would filter by labels before submitting to the search, but I am wondering, first, how do I do this generically for a wide array of concepts (my knowledge bases could be pretty much about anything) and how do I avoid generic (not knowledge specific) labels to be used for the search and prompt classification? Does anybody have experience with such an approach?

My next concern is that my pages are stored in a custom JSON format representing text nodes but also tables, code snippets, lists, links to other documents, etc.
Pages containing more than 2000 tokens I would split by heading (and then next level heading, etc. down to paragraphs until I have logically coherent snippets of less than 2000 tokens)
What I am unsure about, though, is whether I should transform the text to HTML, Markdown or plain text for my search and how I would engineer the prompt to handle this best and whether I should use codex or a normal engine?

Thanks in advance for the help!

1 Like

Hi @michael.ilewicz

Welcome to the OpenAI Community.

When it comes to answering questions from large knowledge bases. The only solution that comes to my mind is Embeddings + Completions

1 Like

Hi @sps thanks for the quick response. I read up on embeddings and would like to make sure I understand how it would be applied to my problem.
Each document I have is converted into an embedding using a doc-type engine and I store it for later use. Each time it is updated, I recalculate the embedding, and when I have a search query, I create an embedding of that, calculate the distance between my query embedding and each of my documents (any suggestions what an efficient algorithm for that would be? I read something about cosine similarity, but is there a faster way to calculate the best similarity of one vector to a set of vectors?) Finally, I find the pages which are closest to my query in vector space and use them as context for my completion (I might have several pages similar to each other). Do I understand your approach correctly?

If so, wouldn’t it be better to use the question answering endpoint for the final evaluation? Also, in your experience, would I be better off splitting my documents into smaller chunks (for example based on first level headings) and calculate an embedding for each with one of the fewer dimensional engines, or should I break down only as necessary and use a higher dimensional embedding?

1 Like

Also, I’ve been thinking about how to decide when I need to recalculate the embedding for a document. Often times, a document change might only be a minor typo that does not impact the meaning of the document at all, and I want to minimize the cost of computation.

I thought about creating an embedding of the changed strings and comparing it against the document embedding, if text was added which has little similarity, the meaning might change and could justify a recalculation. Text that was removed and has high similarity might change the meaning as well but what if only a few characters change?

Can I encode the change into tokens and justify that low number tokens (for example only tokens smaller than 100) should be considered in the comparable embedding? Does this approach even make sense or has anybody tried something like this before?

1 Like

Hi, for an implementation of precisely what you are asking, you can sign up at BookMapp, description here

Thanks @vaibhav.garg, looks like you have a great startup there! You should sell to universities and ebook companies. Unfortunately, the system I want to build my project for will not be able to integrate with your startup, so I need to build my own solution. I’ve been digging into knowledge graphs last night and think I 've started to understand the principle and that it would be the best solution for me. I understand how I can extract relations from text with GPT3 but I am stuck at how I query my knowledge graph with GPT3. I don’t mean to become a competitor to you but need to apply something like this to a very specific problem of mine. If you don’t mind me asking,

  1. are you extracting relations paragraph by paragraph or do you generate prompts with a greater context?
  2. Do you store your graph in a specialized database and use prompt tuning to translate the request into SPARQL, or do you use adaptive prompt tuning?
  3. If you do the latter, how do you go about selecting the right prompts?
  4. How do you handle named entity recognition and question classification?
  5. How do you track your references back to the original paragraph of your book? Especially if a concept is touched on multiple places?
  6. Do you remove redundancy or use it for TF-IDF ranking with a reference to each appearance?

I know that’s a bunch of questions; I do generate a lot of them any given day. I would love to engage in a discussion, maybe I’ll have some ideas that could help your startup get off the ground as well? If you don’t want to respond here, you can also send me an email to

1 Like

Thanks for your interest. I have responded over E-mail. Let’s take it forward over there.

Thanks for taking your time to read through the guide.

Yes AFAIK you would have to recalculate the embedding.
Cosine similarity is definitely a good approach as is has low complexity.


Tagging @lmccallum - you worked on something similar. Your inputs on this will be indispensable.

1 Like