I would like to fine tune a model using a CEFR corpus for vocabulary, so the model will only give responses using the vocabulary contained in that corpus. Would embedding or fine tuning be the more appropriate choice? If fine tuning, how would you make the prompt completion pair when it is a corpus of thousands of vocabulary items?
Thanks in advance
The model may already know what CEFR is, but it will be a 2021 version of it, also there will never be 100% compliance with such a request, even with a temperature of 0 the responses will have variation and token choice may vary.
LLM’s abilities are as a result of building relationships between billions of words in the wild, so the model may well find a more plausible token (word part) in it’s responses that does not align with the CEFR vocabulary, it may get quite close and be acceptable for your use case, but to only use those words (I assume it is a large subset of thousands of words) would be a challenge.
Thanks for your reply.
Ideally, I would like to limit the use of vocabulary even further, so certain cases would only use vocabulary up to, say, A2 on the CEFR list. The context here is assessing foreign language ability in students, so there have to be limits on what words are used in order to accurately measure students at certain levels with IRT. Do you think this is asking a bit much?
I think it may be a challenge!
One thing to absolutely try is to use ChatGPT and just ask it to follow those standards, see if the results it comes up with approach and acceptable standard for you.
If you get responses that look usable, then you can try the same in the playground with the API and ensure you get similar results (fine tuning of model variables may be needed, use the CHAT mode in the Playground)
Then if you get some minor discrepancies, iterative prompt engineering may be able to add a layer of fine tuning to the output.
The alternative would be to embed the entire CEFR corpus and attempt to use that as context for the model, but telling the model NOT to do something is always a challenge and usually fruitless, if it can be reworded to Do something rather than not… possibility exist there.
Interesting use case.
(Note: Fine tuning adds new patterns and ways of thinking, new things to match against, it is not a good way to add additional data) e.g. it is good to tell a system how to respond, but not what to respond with.
I was starting to feel this might be the case. Forcing restrictions on language usage and the connections that can be made seems to go against all of its instincts!
Currently we are experimenting with the playground and the API; a script shares an excel file with examples and the model (3.5 turbo) is asked to generate similar items. It works very well, but it would save a lot of work on our end if we could restrict vocabulary use. Being able to control that variable would also really broaden application. Something to work on.
Thanks again for your advice
Well, you might have luck with fine tuning a model if you have good Q/A or paired data readily on hand, fine tuning is a way of approximating a models weights being updated by utilising a simpler matrix, it typically does not impart new “facts” but it can teach it new ways to work, so… there is certainly a possibility that a fine tune would yield results.
My only concern is that currently the top model that is fine-tuneable is Davinchi-003 and while a competent model, it is nowhere near GPT-4 and even 3.5 is considerably more powerful. There are plans to have a tuneable version of GPT-4 and possibly 35 by the end of the year and that may indeed be a viable option, but I think in this case I would still combine that fine tuning process with sematic search to produce appropriate context. I’m not sure of the time scales and budgetary constraints on your task but it seems like it would benifit from experimentation and time being spent just “playing” with the models to get a feel for the quality of the results.
The fact that your use case is language based means that an LLM is an excellent choice of tool, but I am somewhat reticent to say it’s a trivial task.
How to prepare the data for fine tuning i one thing Ive been wondering about; how to structure a corpus as a prompt-completion pair. Would you have one prompt with a corpus of thousands of words as the completion? That doesnt seem right. Or is there another way to train the model? It will probably take until the end of the year to work out those issues, so we might just be in time for the new tuneable versions of the more recent models. Im not sure how much success we would have with davinci anyway. We have a modest budget, but will continuing working at it. An LLM seems ideal for the task, but as always, the devil is in the details.
A short question and a long answer is as valid as long question with a short answer, the neural net is flexible enough to cope with both and everything in-between. Constancy and quality are key, there are no logical inferences to be had from junk. A single word to a multi thousand word reply would be fine if that is logic behind it. If I say to you “aria” and the context is creative work you know to write a song. If the word in question is highly technical with an expansive meaning… sure.
Ahh yes, the details I think it is eminently doable, but is going to require a fair bit of R&D.
You can create synthetic questions by AI. “What question would most stimulate an AI to give this answer as a completion? Write the question in the informal brief language of a chatbot user.” Have it play Jeopardy.
Then you just have the preformatting of transforming some information chunk into answer-like content. Another AI task.
You can look at open-source examples. Use instruct-eval to tune with the best question-answer matches.
The failure to the whole idea will be “so the model will only give responses using the vocabulary contained in that corpus”. You won’t be able to dissuade the AI from natural language with your weightings.