Problems with long contexts - gpt that solves law cases

I am trying to make a gpt that solves law cases, I provide it with proper instructions and stuff, and give it all the relevant laws it needs to answer the question, along with the question (total content is not more than 20k tokens) in case of 4o API it doesn’t even read the laws completely, like if there are 10 lines about something related to part of a question, it would read the beginning few lines and ignore others, assuming that there’s nothing else needed to answer that part, hence getting wrong answers. Is this because models aren’t good with large amount of tokens, weird thing is that how would it have issue with 20k tokens if limit is 128k. I also tried o1-preview on chatgpt website, it did the entire relevant stuff but would still make some small mistakes, due to not following some parts of the instructions. Ideally, I want 4o to work perfectly for it, any solutions, tips and insights will be HIGHLY APPRECIATED. Thanks.

2 Likes

Model attention falls off and hallucinations begin to rise long before you approach max context.

Have you compared it to other models like gpt4 and gpt4 turbo for your use case?

I note there is a relevant discussion here:

2 Likes

First off I am not an attorney but only a programmer so take the advice related to legal with a grain of salt.

  1. Getting an LLM model to give only factual and correct answers always is a holy grail objective. As one who has been doing AI for a few decades and using other AI such as Prolog before also using LLMs, there are some pros and cons for each AI used. None are all just pros without cons. As such trying to solve this on your own with just LLMs may be a task that with current LLM technology is not possible but might be possible with agents that can check the reasoning, cases and other details.
  2. Consider using using the AI from LexusNexus Lexis+ AI

Your question is an age old question from attorneys seeking to use AI for its reasoning abilities to augment the work of an attorney. I have seen these questions posed over many years for many AI technologies and the answer is still there is no perfect AI assistant but the AI assistants are becoming more useful in selected areas as time progresses.

In short you may win the battle with your need to resolve the context length problem but will lose the war due to not achieving 100% accuracy.

HTH

3 Likes

Can finetuning gpt model fix the accuracy issue completely?

No.

You can get certain concepts to be more accurate but getting 100% accuracy with an LLM for a field as large as legal is not possible with just an LLM alone.


If we break many of the tasks that LLMs are used they follow

Natural language → translate natural language to problem statement → collect facts → apply rules → understand which rules with facts are valid → collect results → translate results to natural language

LLMs are great at

  1. translate natural language to problem statement
  2. collect facts - (I.e. using semantic search)
  3. apply rules - (think o1 model thinking, would love to know how this really works for o1 models but it is not published).
  4. translate results to natural language

LLMs are terrible at

  1. understand which rules with facts are valid

While there are many ways to augment LLMs to improve on the process and many other AI that can do the same, here is the same for

Prolog is great at

  1. collect facts - (The facts must be in the knowledge base, if they are not it is unknown and will not be used.)
  2. apply rules - ( The rules must be in the code, if not then the relation between known and generated facts will not be part of a result.)
  3. collect results - (Only if the result was discovered based on the known facts and rules in the Prolog knowledge base.) (See: Closed World)

Prolog is not so good at

  1. translate natural language to problem statement
  2. translate results to natural language

LLM agents are good at

  1. translate natural language to problem statement - (think subtask, e.g. openai-cookbook/examples/o1 at main · openai/openai-cookbook · GitHub)
  2. collect facts - (think agent highly customized to extract data from a very specific well formatted knowledge base, e.g. CSV, etc.)
  3. apply rules - (think using semantic triples or SQL joins, etc. These agents would be very highly specialized and mostly like use a non-AI function to perform the rule to avoid hallucinations.)
  4. understand which rules with facts are valid - (These agents would be very highly specialized and most likely use a non-AI function to perform the rule to avoid hallucinations.)
  5. collect results - (Think of an agent that is customized to generate results of a specific format such as a table, LaTeX, outline, screenplay, etc.)
  6. translate results to natural language - (Think an agent that collects the formatted results and combines into a single result.)

LLM agents are not good at

  1. Being used for something they were not designed to do. (Agents are often highly customized and should not be thought of as a drop in replacement for any task, they should focus on one thing and do it well.)

There are countless ways to augment LLMs but this should give you some ideas that LLMs are not a one concept solves all problems.

4 Likes

What’s the best I can do (something that’s actually feasible)?

In the world of legal I would not attempt to enter that arena, LexusNexus is AFAIK the leader and by some margin.

As for something, that is hard to say, much of the low hanging fruit has already been picked.

When the internet became public many individuals had the idea that if they put up a site they would make money. Many companies wanted me to work for them because I was working with web sites very early on and I would pass on some really good job offers because I would note they would fail within the next year because they did not understand that if you did not have a good business model that you could not just create a website and it would magically become a money maker. They took an idea and assumed it would make them money because it noted web; the same is happening today, the word is now AI or LLM. Ever heard the expression AI winter? I have lived through a few of them now; this is not my first rodeo with AI.

1 Like

Thanks a lot for your responses.

I forgot to mention few things, it’s going to be only for Greek law specifically. Another thing, that I already have RAG setup where I am able to successfully retrieve the relevant laws, now just need to perfect the question answering part. And o1 is actually doing a pretty good job, just one issue (for now) that it assumes stuff even though I have explicity stated it not to.

You’re right a very generic AI law tool is not something feasible, but after providing some more context does something come in your mind that could potentially make it work? If I finetune with like a hundred examples of input output could it be atleast 80-90% accurate? (wanted to ask before trying out because preparing 100 examples is not gonna be easy). Another question that came in my mind just now, in examples for finetuning, do I need to provide context (relevant laws) within the input of or just the user question (law case) would be enough?

1 Like

This is advice to help you win some battles, I still think you will lose the war but at least this will keep you on what I consider the best next step.

Look very closely at the o1 examples noted earlier. These were created less than a month ago and touch many of the points noted. However the examples are high level and for your problem should have many more LLM agents with the ones checking the facts based on code that is not using AI to validate the facts.


https://platform.openai.com/docs/guides/prompt-engineering

https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-the-openai-api


Also see:


A search for just papers with legal and agent in title.

https://arxiv.org/search/?query=agent+legal&searchtype=title&abstracts=hide&order=-announced_date_first&size=50



Side note:

While this is not a direct comparison or prediction, it is something to understand for possible quagmires to avoid.

How IBM’s Watson Went From the Future of Health Care to Sold Off for Parts

IBM Resurrects Watson to Ride the AI Hype Train


One of the more successful stories with LLMs is that of generating programming source code (often with bugs) and having the ability to feed back the compilation errors to eventually get correct code. Also the programming language has to be one that the LLMs have had much accurate training with, e.g. JavaScript or Python, and not languages like Prolog for which there is relatively little training data. I note this because compilers provide the feedback of what is valid or not valid. This is often a critical step that is is missing from many who apply LLMs and then fail.

1 Like

Using negation, no, not, etc. with prompts is known to fail.

See:

LLMs Don’t Understand Negation

Instead of just saying what not to do, say what to do instead

Just knowing not to use negation in a prompt is often the difference between failure and success with basic prompts.

However with an o1 model some of the regulars have seen a prompt result that went from bad to good with the added directive of No bizarre thoughts.

2 Likes

Thanks, I’ll check out these things out. Some questions:

  1. One of the things you said, why I am not sure how can be implemented, like I am not sure how to do such a thing programmatically without AI, kindly specify, how can that be done: “understand which rules with facts are valid - (These agents would be very highly specialized and most likely use a non-AI function to perform the rule to avoid hallucinations.)”

If I were to make an agent out of it, what could potentially be the subtasks, and flow of it.
I cannot think of how I could divide the tasks that could actually make a difference, only thing related to agentic way that comes to mind is getting a response, sending it to a reviewer to review it and provide feedback then giving it back along with feedback, and doing this in a loop.

I read this in one of the links: “I would note that LLMs handle this task better if you slice the two documents into smaller sections and iterate section by section. They aren’t able to reason and have no memory so can’t structurally analyze two blobs of text beyond relatively small pieces. But incrementally walking through in much smaller pieces that are themselves semantically contained and related works very well.”
I am unable to understand how to do the section by section thing. Do you?

a general question, if I want to use a model like 4o, is there a way, tip for improving/fixing the issue of ignoring information in case of very long messages. (15-20k tokens) could be something as simple as summarizing a text of 15k words.

Thanks again, you have been very helpful :slight_smile:

1 Like

Please see the research papers noted earlier

A search for just papers with legal and agent in title.

https://arxiv.org/search/?query=agent+legal&searchtype=title&abstracts=hide&order=-announced_date_first&size=50

If you start researching agents and such you will find that there is no one specific way to do this. Agents are just function calls that use AI and the agent can also call other agents or even plain old functions. So just like with any programming, you just have to modify the code/calls as needed.


Most likely that is referring to setting up the RAG or semantic search.

See: https://help.openai.com/en/articles/8868588-retrieval-augmented-generation-rag-and-semantic-search-for-gpts

A few sentences seems to be the sweet spot but only real world trial and error will tell you what really works. I would not try to go less than a single sentence or larger than a single paragraph without a very good reason.

Consider doing one or more of the free courses from DeepLearning.ai


Honestly I have not worked with doing such and since you are working with legal documents don’t want to give advice that is only a pure guess. Try asking that as a separate question, I would like to see what others note.


You are quite welcome. :slightly_smiling_face:

2 Likes

Thanks for your response. I haven’t tried gpt4 or turbo I tried o1 it doesn’t hallucinate much (very small issues) but I wonder if I give it even more tokens like 40k how it’ll do, will have to see that, but the thing is how is 128k token limit useful for 4o if it would hallucinate at 20k tokens. Is this something that will be fixed?

Hallucination is part of the “natural” behaviour of an LLM because it is a statistical model and will always output one of the more likely tokens one after an other with a slight randomness element. Unfortunately, because you can get into situations where there is no high certainty solution, you can get rubbish being printed that is of no value. But the LLM will still return tokens dogmatically all the same.

There is research into adding further processing steps to evaluate how “sure” the model is. This may lead to ways in which we might flag up and handle uncertainty.

1 Like

This approach presents several challenges:

  1. The model may not reliably reference the sources it uses.
  2. The sources could be outdated.
  3. The model might miss important relevant case law.

A possible solution could be to build a RAG (Retrieval-Augmented Generation) pipeline using a database containing all relevant case law and legislation. However, even with this, it’s still a complex challenge.

In my opinion, only an experienced legal professional can truly leverage GenAI to save time. GenAI won’t replace legal professionals for quite some time.

2 Likes

I agree with all points raised!

1 Like

I am already using RAG, it works decent when question needs 2-3 laws only to be solved, issue occurs when we want it to solve a law case that requires many laws (which is within the context window of 128k) but LLMs start missing out info even with 20k tokens, I honestly don’t even understand what’s the benefit of 128k, if it can’t fully understand/read/comprehend 20k tokens (there might be benefits that I am unaware of, let me know if any).