Awful results with fine-tuning (legal docs)

nunodonato · December 1, 2021, 4:26pm

Hi folks
So I had this idea of generating a JSONL file with the entire (or almost) IRS legalities of Portugal.
Unfortunately, the results with completion are really not satisfactory.
I’m not sure if this is because

The fine-tune model is curie, not davinci
the way I built the JSONL file was pretty much adding a paragraph per line. Not sure if this was the best way to keep “context”

Right now, I get better results in the playground using Davinci than my fine-tuned model.

Thoughts or advice?

boris · December 1, 2021, 4:32pm

I suspect this is largely a difference between curie and davinci. Davinci is a lot better at understanding other languages than English, and at understanding more complex language, such as legal language. This isn’t easy to teach within fine-tuning. How many documents do you have? Finetuning a davinci model will perform better.

What are the legal tasks you’re trying to solve? My starting point would be davinci-instruct-beta-v3.

If you have specific task, where you have some number of example inputs and outputs, then fine-tuning will significantly increase the performance.

nunodonato · December 1, 2021, 4:36pm

Hi @boris
Yes I did notice a significant drop in quality due to the language.
My experiment was basically a way to “ask” questions about the IRS law, without having to read through it all. Unfortunately I can’t use the answers endpoint because the document is too big and each question becomes extremely costly.

Answering your question: it’s only 1 file, where I added every single paragraph of the law as a line. It’s a total of ~1600 lines, I followed this specific example from the Documentation: Creating an expert model in the legal domain which understands internal company jargon

Can I attempt to get access to a davinci fine-tune or not worth it for this kind of utilizations?

daveshapautomator · December 1, 2021, 7:19pm

What did the prompts and responses look like in your JSONL?

nunodonato · December 1, 2021, 8:35pm

@daveshapautomator here’s a short sample

{"prompt":"","completion":"2 - Os rendimentos, quer em dinheiro quer em espécie, ficam sujeitos a tributação, seja qual for o local onde se obtenham, a moeda e a forma por que sejam auferidos."}
{"prompt":"","completion":"Artigo 2.º Rendimentos da categoria A"}
{"prompt":"","completion":"1 - Consideram-se rendimentos do trabalho dependente todas as remunerações pagas ou postas à disposição do seu titular provenientes de:"}

I’m not clear if splitting the text into such small parts is ok or if context is lost. As you can see in the last one, it ends with “de:” and then a bullet point list would follow that, but since this is split I don’t know if it can infer the connection between the several phrases.

daveshapautomator · December 1, 2021, 9:28pm

You might want to try whole paragraphs instead of single sentences.

overbeck.christopher · December 1, 2021, 9:46pm

That is literally what he said he did in the OP.

nunodonato · December 1, 2021, 9:47pm

true. But I’m going to experiment with bigger chunks.

dandrade.jose · December 3, 2021, 2:53am

@nunodonato I just finished a similar task, similar approach as you and with legal documents. I followed the same documentation example. I am comparing results between fine-tuning and answers endpoint.

My document has some structure. I utilized the same jsonl file changing “metadata” to “prompt” and leaving it empty as suggested in the documentation example for legal domain.

So this,

{"text": "Artículo 13.- Distrito Nacional. La ciudad de Santo Domingo de Guzmán es el Distrito Nacional, capital de la República y asiento del gobierno nacional.", "metadata": "Artículo 13"}

after preparing data

$ openai tools fine_tunes.prepare_data -f ./constitucion.jsonl

became this

{"prompt":" ->","completion":" Artículo 13.- Distrito Nacional. La ciudad de Santo Domingo de Guzmán es el Distrito Nacional, capital de la República y asiento del gobierno nacional.\n"}

I am not satisfied either. Gotta learn some more, and test some more.

nunodonato · December 3, 2021, 2:24pm

My results improved when I included bigger chunks of text in the completion, instead of short sentences/paragraphs.
It seems the context is more easily inferred, but I’m not sure.
try it out!

Unfortunately, I think we are severely limited until its possible to fine-tune davinci to use with languages other than english.

dandrade.jose · December 3, 2021, 3:34pm

I went one article (artigo) per line.

How big are your chunks?
Any idea how big can a line be? The whole law ?

nunodonato · December 3, 2021, 3:39pm

More or less the same, one article per line. Sometimes that’s 1 or 2 sentences, sometimes 10

dandrade.jose · December 3, 2021, 3:52pm

I am still getting better results with answers endpoint. But definitely have to explore fine tuning. A matter of learning here.

From fine tuning documentation:

You can feed a large amount of legal textual data of high quality, such as contracts

{"prompt":"", "completion":" <legal document>"}

I wonder how large such a document could be?

A contract can be dozens of pages long. Dell FedEx 2005 was +100 pages.

Anyway , let’s keep sharing. Thanks.

nunodonato · December 3, 2021, 3:53pm

I did try to upload the whole thing in one “completion” field, but the file was rejected due to the size.
I also got better results with answers, but the cost is problematic as it will count the tokens in the files on every request

sps · December 23, 2021, 5:25pm

Hi ,
Hope you found a solution.

Your use-case sounds pretty specific. I’d give the instruct series a try. The docs state:

The Instruct models share our base GPT-3 models’ ability to understand and generate natural language, but they’re better at understanding and following your instructions. You simply tell the model what you want it to do, and it will do its best to fulfill your instructions. This is an important step forward in our goal of building safe models that are aligned with human interests

.

nunodonato · December 23, 2021, 5:41pm

the problem is not in understanding instructions, but getting the facts right

sps · December 23, 2021, 5:46pm

Ah! That sounds so much what the answer API is meant for when used with examples. Definitely costly when working with large files. Even usage with Ada is costly?

nunodonato · December 23, 2021, 6:12pm

didn’t even try. the performance of non-davinci models with other languages besides english is not so good. And the doc is big, so it will still be quite a significant number of tokens in every request

sps · December 23, 2021, 6:26pm

Just curious what if you were to use English docs and pre-process the non-english queries with a translation api like google or bing?

asabet · December 23, 2021, 7:15pm

@nunodonato if language-specific performance is a concern, have you considered translating from spanish to english first (can pre-process your corpus if latency and cost are an issue), then performing inference with gpt3? Lots of high-quality spanish->english translation APIs out there, or you can even finetune gpt3 on a spanish-english datasets.

Losing nuance in the translation could be a problem, but then again legal text is definitely more structured and unambiguous compared to less formal text - perhaps easier to translate.

Topic		Replies	Views
Is fine-tuning the way to go to generate legal opinions (law technical reports)? API	10	4373	December 9, 2023
Is it possible to fine-tune a model to answer questions given a raw text? Prompting	18	10126	December 15, 2023
What to do when fine-tuning is not working? API	21	7977	December 24, 2023
Fine Tuned Chatbot forgets how to output summary of conversation API	9	1798	December 18, 2023
Fine Tuning Help defining Prompt/Completion API	17	2254	March 31, 2023

Awful results with fine-tuning (legal docs)

Related topics