Fine-tuning with identical prompts - does it make sense?

Let’s say I want to fine-tune a model that will generate social media bios.

I could take 500 of the best bios from different large accounts of a certain niche and use them as the completion to the same prompt:

Create an Instagram profile bio for a finance influencer

So there would be 500x this prompt and all the bios as different completions.

Is this a useful way to fine-tune a model? If not, how else should I be doing it?

Thanks for the help :pray:

EDIT

Edited to add more information regarding the way the model would be used.

A request/prompt to this fine-tuned model could be just a series of keywords representing the person. E.g.

Generate a bio inspired by those keywords: Personal finance, crypto, financial independence

And this would generate a bio based on the top 500 bios of the fine-tuned model…

Same input, multiple output would be confusing for a fine-tune. Why not embed, then retrieve, and have GPT create a related bio.

2 Likes

Thanks for the answer Kurt!

Interesting. I’m new to all of this - can you explain what you mean by “embed”?

Do you mean this? :thinking:

Thanks :slight_smile:

Edit 1

Would you be so kind as to give more pointers on how to achieve this?

First off, what would I be embedding - the 500 example bios in one string?

Then, what should I retrieve? I don’t see a retrieve endpoint for the “Embeddings”.

FInally, what is the link between the embedding and the completion endpoint (the completion endpoint would be used for this final part I suppose: “have GPT create a related bio”)?

Thank you :grin:

Edit 2

I guess I should create an embed of each of the 500 bios and then store those embeddings for continual use, but I don’t understand what I should do with those embeddings that would allow me to use the completion endpoint and create similar bios.

Good questions.

I would embed each bio, assuming they are less than 1500 words, and if they are larger, then break them into chunks. Also embed any relevant meta data, high level descriptions.

So for each bio, embed:

Keyword1, Keyword2, ..., KeywordN

<Entire bio, or chunk of bio text>

So now you have a whole set of vectors, along with the bio text (or bio chunk text) along with keywords.

Then you have a new question come in from the user: “Create a bio on cryptocurrency, with a flair for Fintech”

This gets embedded (turned into a vector) and is used to correlate against your vectors of bios.

Take the top N correlations, feed the text behind these correlations into a prompt, and ask GPT to “Create a bio on cryptocurrency, with a flair for Fintech” against the context of your embeddings. So you may need to preface your retrieved embedding text as Context.

So example prompt:

Create a bio on cryptocurrency, with a flair for Fintech.  Use the following context as your information source.

Context: <Bio1>

<Bio 2>

...

<BioN>

Just be sure the total tokens of your input bio data and your output bio doesn’t exceed the max tokens for the model you are using. You may need to play with the prompt wording, but this is the general idea.

The keywords are optional, as the embeddings might do that for you. It’s your call, since they could be a hassle to generate for each bio.

1 Like

Oh got it! That makes sense.

So I would basically be using embeddings to find the best examples to feed GPT3 via the prompt as opposed to using a fine-tuned model.

The only thing that I don’t totally understand from your embed example are the keywords:

Keyword1, Keyword2, ..., KeywordN

<Entire bio, or chunk of bio text>

I understand that I should embed each bio, along with any other useful (meta)data from a user’s profile (other descriptions, titles, etc), but in this example, what is Keyword1, Keyword2, ..., KeywordN? Simply this other useful metadata we mentioned?

I ask because I wonder if those keywords are keywords that I need to manually tag each bio with. Hopefully not, I would like to automate the whole process and avoid having to manually tag each bio.

Thanks :pray:

Sorry, I did a late edit above:

So no, don’t worry about the keywords. Try the straight embedding path, it should extract out that information in the embedding. But if it doesn’t or if the searches lock onto unrelated bios, you might want to enhance the embedding with the keywords added.

1 Like

Awsome.

So to find the “Top N correlations”, I would simply need to something like is described in this guide (more precisely the “Recommendations using embeddings” example)?

So basically create a Python script that extracts those correlations.

It would be great if I could do it in Node.JS, but it seems like this kind of stuff is mostly done in Python :face_holding_back_tears:

Cheers Curt!

1 Like

ChatGPT4 is quite powerful. I would be surprised if it couldn’t translate the python code into JS for you. Although, it may use an outdated library. But it’s a great start with a template.

1 Like

You can really do it in any language. All you are doing is taking the dot-product of your input vector and the set of bio vectors. The dot-product is multiply each term of the two vectors you are comparing, and sum these multiplications. Since the embedding vectors are unit vectors, the max it will ever be is +1. If you use ada-002 for embeddings, the space isn’t isotropic, so you might see a min of 0.7, even though theoretically it should be -1.

Anyway, the algorithm is so simple, you don’t need Python, and can just code it straight in any language of your choice, including JavaScript.

1 Like

I see.

So I need to iterate over each bio, calculate the dot-product (of prompt vector + current bio I’m iterating on), and see which dot-products are the the closest to 1?

Then the top 5 closest dot-products to 1.0 would be the top 5 correlations?

Correct, then you would feed those 5 bios to GPT and ask it to create a new bio based on the 5 bios in the prompt.

1 Like