Re-using a corpus for efficiently using tokens

Hi All, is it possible to reuse a corpus while using text classification using DaVinci. We would have questions regarding the same corpus that we may ask later (and not included in the original prompt). Since the corpus is massive, re-using the corpus would reduce the cost significantly. Please advise, thank you.

Yes you can fine-tune a previous fine-tune (from the docs):

Continue fine-tuning from a fine-tuned model
If you have already fine-tuned a model for your task and now have additional training data that you would like to incorporate, you can continue fine-tuning from the model. This creates a model that has learned from all of the training data without having to re-train from scratch.

To do this, pass in the fine-tuned model name when creating a new fine-tuning job (e.g. -m curie:ft--). Other training parameters do not have to be changed, however if your new training data is much smaller than your previous training data, you may find it useful to reduce learning_rate_multiplier by a factor of 2 to 4.


Thank you, Curt. We are not fine-tuning the model actually. This is only for doing text classification using InstructGPT (DaVinci)

So, if I understand correctly, you want to leverage your previous classifications without classifying further?

So, say you have a database of all your previous classifications, and what they map to. Example: Input Text → Output Classification using DaVinci.

Then embed all your previous Input Text. When New Text comes in, embed this, correlate it against the Input Text embeddings, and determine the classification for the New Text to be the one for the closest previously classified Input Text.

Make sense? Or am I off here.

So I have a corpus of social media posts. I would like to be able to classify each of them into one of the 4 pre-defined categories that I provide. Once this exercise is completed successfully, I now realize I would also like to understand the sentiment of each of these posts. Instead of providing the same corpus again for this new task, is there a way I can provide the corpus 1 single time (and have it stored in some sort of memory) and then later I can run queries against this corpus? The reason I am asking this is so that I can save costs associated with providing the corpus multiple times. Since the corpus itself is the largest contributor to the costs of running this model. Thanks again

For sentiment, you don’t have to use DaVinci, and for classification into four categories you don’t have to use DaVinci either.

For example, for classification I normally use a fine-tune Babbage, but you can also use Ada. Costs are much less and you can categorize anything on the fly in real time, save all results to a database, and if anything repeat comes in, you just use that result instead of computing a duplicate. My fine tune consists of over 4000 examples. Works perfectly. Use 1-token, so in your case ' 1', ' 2', ' 3', ' 4' <–preceding spaces help the tokenizer, when running, set max_tokens = 1, and temperature = 0. BOOM! Done.

For sentiment, there is so much to choose from. Personally, again, go with Ada or Babbage fine-tune, or what I use is AWS Comprehend if you don’t want to create the classifying data set.

Avoid DaVinci, run cheaper models, or cheaper services. Store everything in a database, I use DynamoDB.


That makes perfect sense, thank you, Curt!

1 Like