Creating a long document QA

I have an idea to make a search for long environmental regulations. I have combed through the documentation, and although it seems possible I’m not sure which approach would achieve the goal.

The regulations are .doc files, that I would clean and turn into jsonL. Although there is hundreds of pages of text, the text file itself is under 2 mb.

I am wondering what the best method to have it attempt to answer natural language question about that document, or possibly return the most relevant text.

For instance, one section of regulations deals with season when farmers are allowed to spread fertilizers on a field.

Given a prompt: “When is the latest date a farmer in northern region can spread nitrogen on a field?”
I would like a response: “According to regulation [number], the last date to spread nitrogen is [xyz]”
or possibly the relevant regulation itself.

After looking through docs, seems like answers api or semantic search would be similar to what I need where I can upload docs. The thing is with the token limit being 2048, the length of document would be far longer than the limit, even though file size is small. I also looked at fine tuning, but seems that is more for the prompt response aspect and less of what it’s making it’s answers based off of.

Does anyone know what the right approach would be in a scenario like this? The data processing and py I have no issue with, just trying to figure out what route to go here.

2 Likes

I’ve built something for legal rules @jvanegmond93. Do you want to connect? I’m at lmccallum@lexata.ca. Just recently applied to go live with my app.

1 Like