Awful results with fine-tuning (legal docs)

Hi folks
So I had this idea of generating a JSONL file with the entire (or almost) IRS legalities of Portugal.
Unfortunately, the results with completion are really not satisfactory.
I’m not sure if this is because

  1. The fine-tune model is curie, not davinci
  2. the way I built the JSONL file was pretty much adding a paragraph per line. Not sure if this was the best way to keep “context”

Right now, I get better results in the playground using Davinci than my fine-tuned model.

Thoughts or advice? :slight_smile:

1 Like

I suspect this is largely a difference between curie and davinci. Davinci is a lot better at understanding other languages than English, and at understanding more complex language, such as legal language. This isn’t easy to teach within fine-tuning. How many documents do you have? Finetuning a davinci model will perform better.

What are the legal tasks you’re trying to solve? My starting point would be davinci-instruct-beta-v3.

If you have specific task, where you have some number of example inputs and outputs, then fine-tuning will significantly increase the performance.

Hi @boris
Yes I did notice a significant drop in quality due to the language.
My experiment was basically a way to “ask” questions about the IRS law, without having to read through it all. Unfortunately I can’t use the answers endpoint because the document is too big and each question becomes extremely costly.

Answering your question: it’s only 1 file, where I added every single paragraph of the law as a line. It’s a total of ~1600 lines, I followed this specific example from the Documentation: Creating an expert model in the legal domain which understands internal company jargon

Can I attempt to get access to a davinci fine-tune or not worth it for this kind of utilizations?


What did the prompts and responses look like in your JSONL?

@daveshapautomator here’s a short sample

{"prompt":"","completion":"2 - Os rendimentos, quer em dinheiro quer em espécie, ficam sujeitos a tributação, seja qual for o local onde se obtenham, a moeda e a forma por que sejam auferidos."}
{"prompt":"","completion":"Artigo 2.º Rendimentos da categoria A"}
{"prompt":"","completion":"1 - Consideram-se rendimentos do trabalho dependente todas as remunerações pagas ou postas à disposição do seu titular provenientes de:"}

I’m not clear if splitting the text into such small parts is ok or if context is lost. As you can see in the last one, it ends with “de:” and then a bullet point list would follow that, but since this is split I don’t know if it can infer the connection between the several phrases.

You might want to try whole paragraphs instead of single sentences.

That is literally what he said he did in the OP.

true. But I’m going to experiment with bigger chunks.

@nunodonato I just finished a similar task, similar approach as you and with legal documents. I followed the same documentation example. I am comparing results between fine-tuning and answers endpoint.

My document has some structure. I utilized the same jsonl file changing “metadata” to “prompt” and leaving it empty as suggested in the documentation example for legal domain.

So this,

{"text": "Artículo 13.- Distrito Nacional. La ciudad de Santo Domingo de Guzmán es el Distrito Nacional, capital de la República y asiento del gobierno nacional.", "metadata": "Artículo 13"}

after preparing data

$ openai tools fine_tunes.prepare_data -f ./constitucion.jsonl

became this

{"prompt":" ->","completion":" Artículo 13.- Distrito Nacional. La ciudad de Santo Domingo de Guzmán es el Distrito Nacional, capital de la República y asiento del gobierno nacional.\n"}

I am not satisfied either. Gotta learn some more, and test some more.

My results improved when I included bigger chunks of text in the completion, instead of short sentences/paragraphs.
It seems the context is more easily inferred, but I’m not sure.
try it out!

Unfortunately, I think we are severely limited until its possible to fine-tune davinci to use with languages other than english.

1 Like

I went one article (artigo) per line.

How big are your chunks?
Any idea how big can a line be? The whole law :grimacing:?

More or less the same, one article per line. Sometimes that’s 1 or 2 sentences, sometimes 10

I am still getting better results with answers endpoint. But definitely have to explore fine tuning. A matter of learning here.

From fine tuning documentation:

You can feed a large amount of legal textual data of high quality, such as contracts

{"prompt":"", "completion":" <legal document>"}

I wonder how large such a document could be?

A contract can be dozens of pages long. Dell FedEx 2005 was +100 pages.

Anyway , let’s keep sharing. Thanks.

I did try to upload the whole thing in one “completion” field, but the file was rejected due to the size.
I also got better results with answers, but the cost is problematic as it will count the tokens in the files on every request

1 Like

Hi :wave: ,
Hope you found a solution.

Your use-case sounds pretty specific. I’d give the instruct series a try. The docs state:

The Instruct models share our base GPT-3 models’ ability to understand and generate natural language, but they’re better at understanding and following your instructions. You simply tell the model what you want it to do, and it will do its best to fulfill your instructions. This is an important step forward in our goal of building safe models that are aligned with human interests


the problem is not in understanding instructions, but getting the facts right :slight_smile:

Ah! That sounds so much what the answer API is meant for when used with examples. Definitely costly when working with large files. Even usage with Ada is costly?

didn’t even try. the performance of non-davinci models with other languages besides english is not so good. And the doc is big, so it will still be quite a significant number of tokens in every request

Just curious what if you were to use English docs and pre-process the non-english queries with a translation api like google or bing?

@nunodonato if language-specific performance is a concern, have you considered translating from spanish to english first (can pre-process your corpus if latency and cost are an issue), then performing inference with gpt3? Lots of high-quality spanish->english translation APIs out there, or you can even finetune gpt3 on a spanish-english datasets.

Losing nuance in the translation could be a problem, but then again legal text is definitely more structured and unambiguous compared to less formal text - perhaps easier to translate.