I want to be able to query the tax code for my home country (India) and the entire tax code embedding has ~800k tokens.
Now, the approach I have seen recommended is to use chroma like db to store the embedding of the provisions of the tax code, search the user query against it first and then send to chat-gpt API only tokens of those matching provisions.
In practice, the above approach doesn’t seem to work well at all because the first step i.e., embedding vector search against the chroma db to shortlist the sections performs poorly especially when the query is complicated and has a lot of english text (alice invested x , bob invested y etc) .
I am wondering whether this approach is right? And is there any way chat-gpt could “process” the entire tax code content via the api rather than go through the vector search first?
I’m not familiar with the indian tax code, but I see two issues:
law is often a graph, and not a linear document. one-off search will likely not yield you the information necessary to solve a case; lookup will likely need to be a recursive approach.
if you have more complex cases, you will likely need to split them up and potentially solve them iteratively. A search may or not yield a result, but you need to track what worked, what didn’t work, what you solved and what still needs solving.
I’d offer that you need to think about it introspectively: how would a human solve this, how would a human think about this → if you can translate that into an abstract process, you can turn that into a program.
I’m sorry this may not be as simple as you’d hoped, but it’s certainly an interesting project if you decide to pursue it further!
Hi, my application uses GPT-4 and ada embeddings to answer users’ questions about U.S. and Canadian securities law - a pretty similar use case to yours. I know nothing about the Indian tax code, but I presume it’s divided hierarchically into divisions, chapters, parts, sections, subsections and the like. Have you already parsed the tax code according to these built-in semantic chunks? For each chunk, include meta data showing where the chunk comes from in the hierarchy. Here is one hypothetical chunk to be embedded:
–Indian Tax Code
—Chapter 4 [add title, if any]
----Division B [add title]
-----Part 27 [add title]
------Section 2.4 [add section title]
-------[add text of Section 2.4]
Including the meta data means the section makes semantic sense standing on its own. That’s the approrpiate way to parse legal text (and many other kinds of text too, I think) before obtaining embeddings. Especially for law, simply parsing text on a per page basis, or at character or word number intervals, won’t work well. If you’ve already tried this method and it hasn’t worked, let me know and perhaps I can brainstorm further with you.
Great to hear from you! Yes, the indian tax code is indeed quite similar; at a basic level its Chapter → Section . Each Section of course has number of sub-sections inside it (for example “deduction under 80HHC(1)” would mean Chapter VI → Section 80HHC, Sub-section (1))
Yes, I believe we have parsed the tax code on similar lines. The chunks are on the lines of metadata in the form of title of chapter, section and their numerals and then the actual content of the section. We store these embeddings per section in chroma and search against it when a query comes in. The results we get from this embedded search we take and pass on to OpenAI API with the query.
In short, the results from this embedded search are quite sub-optimal frankly because we dont get the top sections we should from the query (“Explain section 80” gives me section 80A sometimes ie the distance vector shows lesser distance for 80A).
Interesting. In my application, I noticed a significant degradation in quality of answers after switching from davinci to ada embeddings (the former being deprecated). I had to do quite a bit of trial and error to amend my instructions, to get the quality back up to where it had been. I tried to advocate for OpenAI not to deprecate davinci embeddings, but I never got a response from them. It was a huge disappointment, to be honest. Anyways, back to your application. Even if the similarity scores are a bit off, the key is to collect a large enough number of embeddings for the prompt, to ensure the prompt includes all the relevant information needed for answering the question. Once the prompt is formulated, scores don’t matter anymore. Perhaps try to increase the size of your prompt token-wise, to fit more embeddings and more detailed instructions. That worked for me reasonably well. Which model are you using for the question answering, GPT-4? 8K, 32K or 128K? I can’t prove it, but I sense that GPT-4 Turbo does not perform as well as GPT-4 at precision question answering. So I settled on GPT-4 32K.
It is interesting to know about the davinci to ada degrade.
Agreed about getting enough number of embeddings. But to keep costs down with accuracy, we choose to do a loop of the openAI api - first check chroma db for top 20 embeddings (each embedding corresponds to one Section of the code), look up the “key terms” per section (pre-generated and stored using the API) , feed these key terms with section ids again to the openai api to narrow the list of Sections whose full content we can send with the query finally with GPT-4 8k. the 32k is a tad costly in indian rupees for repeated use (due to $->Rs conversion)
Do you know if anyone has tried creating their own models with the tax code and trained it on tax questions?
Hi, I don’t know anyone who has tried with the Indian Tax Code but there is one company in Canada called Blue J Legal that is tackling Canadian tax law. I suspect if you google something like “start-up tax law generative ai” you’ll find some U.S. companies too. When davinci embeddings were deprecated, I couldn’t get high quality, reliable answers with the 8-K model any longer. I think because there was reduced performance in identifying the best embeddings and ranking them highest. So I started using the 32K model and made some changes to my prompt instructions.