Hi folks
So I had this idea of generating a JSONL file with the entire (or almost) IRS legalities of Portugal.
Unfortunately, the results with completion are really not satisfactory.
I’m not sure if this is because
The fine-tune model is curie, not davinci
the way I built the JSONL file was pretty much adding a paragraph per line. Not sure if this was the best way to keep “context”
Right now, I get better results in the playground using Davinci than my fine-tuned model.
I suspect this is largely a difference between curie and davinci. Davinci is a lot better at understanding other languages than English, and at understanding more complex language, such as legal language. This isn’t easy to teach within fine-tuning. How many documents do you have? Finetuning a davinci model will perform better.
What are the legal tasks you’re trying to solve? My starting point would be davinci-instruct-beta-v3.
If you have specific task, where you have some number of example inputs and outputs, then fine-tuning will significantly increase the performance.
Hi @boris
Yes I did notice a significant drop in quality due to the language.
My experiment was basically a way to “ask” questions about the IRS law, without having to read through it all. Unfortunately I can’t use the answers endpoint because the document is too big and each question becomes extremely costly.
{"prompt":"","completion":"2 - Os rendimentos, quer em dinheiro quer em espécie, ficam sujeitos a tributação, seja qual for o local onde se obtenham, a moeda e a forma por que sejam auferidos."}
{"prompt":"","completion":"Artigo 2.º Rendimentos da categoria A"}
{"prompt":"","completion":"1 - Consideram-se rendimentos do trabalho dependente todas as remunerações pagas ou postas à disposição do seu titular provenientes de:"}
I’m not clear if splitting the text into such small parts is ok or if context is lost. As you can see in the last one, it ends with “de:” and then a bullet point list would follow that, but since this is split I don’t know if it can infer the connection between the several phrases.
@nunodonato I just finished a similar task, similar approach as you and with legal documents. I followed the same documentation example. I am comparing results between fine-tuning and answers endpoint.
My document has some structure. I utilized the same jsonl file changing “metadata” to “prompt” and leaving it empty as suggested in the documentation example for legal domain.
So this,
{"text": "Artículo 13.- Distrito Nacional. La ciudad de Santo Domingo de Guzmán es el Distrito Nacional, capital de la República y asiento del gobierno nacional.", "metadata": "Artículo 13"}
{"prompt":" ->","completion":" Artículo 13.- Distrito Nacional. La ciudad de Santo Domingo de Guzmán es el Distrito Nacional, capital de la República y asiento del gobierno nacional.\n"}
I am not satisfied either. Gotta learn some more, and test some more.
My results improved when I included bigger chunks of text in the completion, instead of short sentences/paragraphs.
It seems the context is more easily inferred, but I’m not sure.
try it out!
Unfortunately, I think we are severely limited until its possible to fine-tune davinci to use with languages other than english.
I did try to upload the whole thing in one “completion” field, but the file was rejected due to the size.
I also got better results with answers, but the cost is problematic as it will count the tokens in the files on every request
Your use-case sounds pretty specific. I’d give the instruct series a try. The docs state:
The Instruct models share our base GPT-3 models’ ability to understand and generate natural language, but they’re better at understanding and following your instructions. You simply tell the model what you want it to do, and it will do its best to fulfill your instructions. This is an important step forward in our goal of building safe models that are aligned with human interests
Ah! That sounds so much what the answer API is meant for when used with examples. Definitely costly when working with large files. Even usage with Ada is costly?
didn’t even try. the performance of non-davinci models with other languages besides english is not so good. And the doc is big, so it will still be quite a significant number of tokens in every request
@nunodonato if language-specific performance is a concern, have you considered translating from spanish to english first (can pre-process your corpus if latency and cost are an issue), then performing inference with gpt3? Lots of high-quality spanish->english translation APIs out there, or you can even finetune gpt3 on a spanish-english datasets.
Losing nuance in the translation could be a problem, but then again legal text is definitely more structured and unambiguous compared to less formal text - perhaps easier to translate.