Creating a long document QA

I have an idea to make a search for long environmental regulations. I have combed through the documentation, and although it seems possible I’m not sure which approach would achieve the goal.

The regulations are .doc files, that I would clean and turn into jsonL. Although there is hundreds of pages of text, the text file itself is under 2 mb.

I am wondering what the best method to have it attempt to answer natural language question about that document, or possibly return the most relevant text.

For instance, one section of regulations deals with season when farmers are allowed to spread fertilizers on a field.

Given a prompt: “When is the latest date a farmer in northern region can spread nitrogen on a field?”
I would like a response: “According to regulation [number], the last date to spread nitrogen is [xyz]”
or possibly the relevant regulation itself.

After looking through docs, seems like answers api or semantic search would be similar to what I need where I can upload docs. The thing is with the token limit being 2048, the length of document would be far longer than the limit, even though file size is small. I also looked at fine tuning, but seems that is more for the prompt response aspect and less of what it’s making it’s answers based off of.

Does anyone know what the right approach would be in a scenario like this? The data processing and py I have no issue with, just trying to figure out what route to go here.


I’ve built something for legal rules @jvanegmond93. Do you want to connect? I’m at Just recently applied to go live with my app.

1 Like

Document based QA

  1. divide your document into passage
  2. Indexing passages using search engine (elastic search)
  3. search candidate passages which contains answer using keyword (you can use query expansion method)
  4. run MRC based on candidate passages and your question. MRC return answer and score
  5. display your questions or you can make story using chatGPT by inputting question and answers

The Document based QA approach mentioned above is right. We are using ES to index graph-based content structures (nodes, associations) as well as binary data (PDFs, Word documents, etc). It’s branch-based (similar to Git) and we use Elastic Search to do what smin mentioned.

ES also supports vector embeddings and some early NLP features. In our case, we use NPL to pre-process the questions and distill the essence of what is being asked (without losing semantic meaning or details). We perform some manual term expansion and also leverage ES. This lets us perform QA and prompted conversation with not only individual documents but also all content in a branch (effectively, front-ending a conversational search).

I’d also add that ES supports highlights and you can use that to generate citations. We’re using so that answers provided by the UI give links to pages and paragraphs of content that ChatGPT used to reason out its answer. Users can then quickly verify and see why the given answer was provided.

It’s pretty cool. Not specific to legal rules but can be used for any kind of modeled content (a content management engine on top of a graph structure, API for graphql and so forth). If you have any questions, I’d be happy to talk about what works well and hasn’t worked as well (

I’d also suggest looking at tools like AWS Lex or other intent engines (in terms of the front-end processing) since you can use that same chat/bot to do different kinds of things such as handle imperatives/commands, automate user tasks and more. Also, I think the token limit is 4k now. Which gives you a bit more room in terms of the prompting.

1 Like

Hi, I took great interest in your post and it has (I believe) expanded my toolset quite a bit.

I’d like to know that I’m on the right path. I’m currently uploading my objects via JSON API. Am I correct that there is no nesting allowed? For example I couldn’t do {“chapter”: {“the_rise”: data, “the_fall”: data2}? I did notice that having an array of items allows for filtering. Ex. {“product”: “colors”: [“blue”, green"]}

In regards to supporting vector embeddings, it seems that it’s all done through our own trained model and not something simple as uploading the embeddings to be used for immediate comparisons?

Lastly, I’ve only used the query tester, however it says there’s a character limit of 128 characters? Seems very restrictive to me?