Do you fine tune? If so why?

@mcavanaugh

In this context, a categorizer is outputting single tokens. So example “CCSS text stuff” → ’ 0’ and “non-CCSS text stuff” → ’ 1’

Note the space preceding the value of 0 or 1. To run this, set temperature = 0 and max_tokens = 1.

Limiting it like this avoids confusion in the model, and makes it more reliable.

You can start with ada or babbage, and work your way up if needed.

1 Like

This was super informative and helpful, thank you!! It does seem like perhaps I do want to go with fine-tuning for classification, then. Would you agree?

If I’m feeding a model like 200 words of instructional content, I would want it to “classify” that content as aligning with one or more educational standards. Does that make sense?

1 Like

I think this makes sense now, thank you!

Do you think, then, that my objective should be to fine-tune a model to categorize selections of educational content as aligning with one or more standard?

1 Like

Yes, I would try that first.

1 Like

You know, this question really interested me. So, I asked GPT-4 its opinion on it. Mainly because I was torn between the thought of “well, if the current model only knows CCSS, it may be bias” and “well, embeddings don’t seem like the best option”. I think this response is the most fruitful.

To fine-tune the model for your specific use case, you should consider the following:

  1. Data format: Organize your dataset in a format that clearly presents the educational content and the corresponding non-CCSS standards. This might involve structuring the dataset as pairs of content and associated standards, or using a more elaborate structure if necessary. Ensure that the dataset is clean and well-prepared.
  2. Fine-tuning task: Since your goal is to have the model correlate educational content to alternative standards, you should fine-tune the model on a classification or a ranking task. In a classification task, the model will predict the correct standard for a given content, whereas in a ranking task, the model will rank the most relevant standards for the content. Choose the task that best suits your needs.
  3. Prompt/completion format: When fine-tuning the model, you should provide the input in a consistent format, such as:

[EDUCATIONAL_CONTENT] [SEPARATOR] [ASSOCIATED_STANDARD]

or

"The educational content is: [EDUCATIONAL_CONTENT]. The associated standard is: [ASSOCIATED_STANDARD]."

Use a consistent format for both training and inference. During inference, you can provide the educational content and ask the model to predict the associated standard.

  1. Model selection: Choose an appropriate model to fine-tune, such as a base or a large version of a pre-trained language model like BERT or GPT. Ensure that the model’s architecture and size are suitable for your computational resources and the complexity of your task.
  2. Evaluation: Set aside a portion of your dataset for validation and testing. This will allow you to evaluate the performance of the fine-tuned model and make adjustments as needed.

Fine-tuning a model in this way should help you achieve your goal of correlating educational content with alternative standards. Keep in mind that you might need to experiment with different formats, fine-tuning tasks, and model architectures to achieve the best results.

2 Likes

Seems like the best use case for fine tuning so far is classification… Sorry I’ve been a little AWOL from this thread. Been coding :slight_smile:

I don’t do classification, I think there are many uses for fine-tuning besides that.

Apologies for the late reply, was off the grid for much of the weekend!

This is a really interesting response, appreciate it! It’ll be a lot of work to get the new data for this, but it seems like a really good way to go. Thanks again!!

2 Likes

Great thread to understand usage of embeddings vs fine-tuning (FT) - thank you very much @curt.kennedy , @AgusPG and others.

I want to build a Fine Tuning model that can do the following for unseen Articles:

  1. Classify the category of the article based on a summary of the article. I have 1000s of prompt-completion examples (summary+category) to train an FT model for this.
  2. Extract citations / references from the article. I can create training data pairs by using paragraphs from sample articles that includes a citation (prompt) + the citation itself (completion).
  3. Identify keywords - I can create training data for this by giving key parts of article (prompt) + keywords (completion). Note that the articles themselves are much larger than the FT limit of 2048 so can’t be fed in.

**QUESTION: **
- Can I train a single FT model to do all of the above, or do I need to create 3 separate FT models? Or is there another approach I should consider?

- Would a few 100 training examples be enough for items 2 and 3 above?

Also any other advice would be gratefully received :pray: **

Thanks
Mim

1 Like

A fine-tune should work for this.

I don’t think a fine-tune will work for this. How will the AI “learn” of unknown or unseen citations? It can’t do this. You are better off with normal code pulling out the citations.

I would avoid a fine-tune here too. Create some sort of “word rarity index” and put all the rare words as keywords.

1 Like

Thanks for the quick response @curt.kennedy !

Re the citations, they have a very specific format such as:
[2023] ABCD 123 Full title of the article, but of course the year varies, the ‘ABCD’ can have one of 6 pre-set values, the 123 can be any number and the full title also varies. I was thinking of giving it 100s of such citations (making sure I include at least 20 versions of those 6 pre-set ‘ABDC’ values) so that it could recognise the pattern. A sample prompt would be something like:

prompt
“The matter under discussion related to a previous project, documented under [2021] IECD 79 Company A - Manual Handling of Loads, that included guidelines for handling goods on pallets.”

completion
[2021] IECD 79 Company A - Manual Handling of Loads

I thought this would work similarly to sentiment classification for Customer reviews, whereby a model trained with a few hundred sample reviews can classify unseen reviews correctly even if they contain phrases that were never seen in training e.g. “Assembling the tent was a complete fiasco” → negative.

I think a one or two shot prompt is enough to do this then, since it is formatted.

Here is Turbo 1-shot:

1 Like

@curt.kennedy thanks very much for the screenshot - I’ll try this approach.

I had tried including in the prompt instructions about the format (“the format is the year in square brackets followed by 1 of these acronyms - ABCD, etc - followed by a number and a title”), and giving it 2 example citations ("…for example, “[2021] IECD 79 Company A - Manual Handling of Loads” or “[2004] IDCD 79 Company B - Stacking Shelves in Retail”, and it got most of the citations, but still missing some every time. I didn’t put those examples in context as you have, so I will try that now - thank you!!

1 Like

For OpenAI, I do not fine tune; I use prompt engineering and embedding retrieval.

However, for the custom query language I’m working with (which wasn’t on the web in 2021) the models aren’t smart enough to understand all nuances of the language using only prompting, so I have to use fine tuning there. However, when I tried fine tuning davinci, it didn’t perform very well, so I’m now using a fine tuned MPT-7B model for this special case.