Exploring the Intricacies of Fine-Tuning Through Loss Metrics

Introduction

Hello.

I’ve fine tuned gpt-3.5 to learn a traing data of 7,776 examples, with a validation data of 864 examples (89%/11%).

My goal is to instruct gpt-3.5 to know very well a certain topic in Italian: it will need to use the right terminology and minimize hallucinations.

I’m new to fine tuning and i’ve done ⁓10 till now. In this case i used Auto hyperparameters.

So i’m asking based on this graph, what do i need to change or improve about the fine tuning?

If you need more information please ask me.

1 Like

Hi there!

Taking a step back here: From what it sounds like in your post, you are trying to use fine-tuning as a means to inject knowledge. Is this the correct understanding? If so, then please be mindful that this is not what fine-tuning is intended for. Fine-tuning is commonly used to get the model to behave in a certain way or produce output in a certain style or format.

For your use case, you want to look at a RAG solution, i.e. converting your knowledge base into vector embeddings and then using these to perform semantic search using a distance metric such as cosine similarity to find best matches for a given user query.

With that in mind, here a few OpenAI resources you want to have a look at for additional context and some initial guidance on embeddings-based RAG:

  1. Overview of common use cases for fine-tuning

  2. A new OpenAI guide on “Accuracy Optimization” which brings the different concepts of fine-tuning, RAG and other mechanisms to achieve optimal outcomes together and discusses which mechanism to use for which purposes

  3. Some initial guidance on embeddings including a link to a OpenAI cookbook with a worked example on how to use embeddings for Q&A

1 Like

Hello @jr.2509

Thank you for the response and for sharing useful links.

Yes my goal is to inject knowledge into an LLM to improve its understanding specifically of the rules of soccer.

And from my tests (at least in Italian) gpt-3.5 and 4 lack precise terminology and make mistakes.

So from my understanding:

  1. I need to gather the rules of soccer
  2. Convert it into a “language” which is easier for the model to understand (vector embeddings)
  3. Implement a way for the model to find the best piece of knowledge (semantic search)

Is that correct?

I have few follow up questions:
a. Is there a specific way to clean the text, before being vectorized? Like plain text with Markdown, is that ok?
b. Are there limitation on how much knowledge i can inject in the model? In total it will be 75k-100k tokens worth of knowledge.
c. How would you handle updates to the knowledge base? Do you need to reprocess the entire knowledge base to “update” the embeddings?
d. How do you evaluate the performance of a RAG system?
e. Can i use as injected model gpt-4o?

Your basic understanding is correct. To be clear, when you using embeddings you are not de-facto injecting knowledge into the model itself. You are injecting knowledge into the context. Semantic search via embeddings is a way to retrieve the most relevant pieces of information from a (proprietary) knowledge base in response to a specific user query/question. The information associated with your top-ranked embeddings are included in the context when making an API call.

I think it depends. I personally don’t have a strong view on that. You might want to use the Forum’s search function to discover discussions around this point. Besides the formatting, there’s lots of other things to consider, including how to chunk your knowledge, whether to include metadata etc. The optimal approach will heavily depend on the nature of your knowledge as well as the type of user queries you are expecting. As you get started, you want to invest some time experimenting to identify what works and what doesn’t.

As per my earlier point, as you are not injecting knowledge into the model itself, there is no such constraints. Generally, vector databases often contain a significant volume of vector embeddings. Your main constraint is how many text, i.e. knowledge, you can include in the context when making an API call. That in turn depends on the GPT model you are using. In the case of Gpt-4o that context is 128k tokens, which should be more than plenty in your case.

You simply add new vectors to your vector database that encapsulate the new knowledge.

The basic performance indicator is whether top-k matches returned from your semantic search contain the information required to answer a user query. That said, there are other considerations. Here again, you should look through previous Forum posts for further deep dive discussions on the matter.

You can use any gpt model in order to formulate a response to a user question using RAG. The process for information retrieval is separate for the process from actually using the knowledge the formulate an answer. The main point to be aware of is that the user query should be converted into an embedding with the same embedding model that you used to embed your knowledge base.

I hope this helps as a starting point.

As indicated in a couple of places, you should take advantage of other Forum posts to dive deeper into specific questions around RAG.

####Introduction

Hi @jr.2509. I’ve run my first test and it works quite well.

I’ve converted a markdown txt file with my knowledge to csv embeddings (as as test i started with 6,000 tokens) and than incorporate them in a context window.

Thanks to your explanation, i’ve tested the model with various questions and the response are correct.

Here my process.

####Markdown text file into embeddings (CSV file)

  • Reads the text from the specified file.
  • Splits the text into chunks.
  • Generates embeddings for each chunk.
  • Saves the chunks and embeddings to a CSV file.
import openai
import os
import pandas as pd
import tiktoken

# Set up API key
openai.api_key = os.getenv("OPENAI_API_KEY", "sk-proj-12345...")

# Function to read the text file
def read_text_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    return text

# Function to chunk text into manageable parts
def chunk_text(text, max_tokens=2000, model="gpt-4o"):
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(text)
    chunks = []
    for i in range(0, len(tokens), max_tokens):
        chunk = tokens[i:i + max_tokens]
        chunks.append(encoding.decode(chunk))
    return chunks

# Function to generate embeddings for each chunk
def generate_embeddings(text_chunks, model="text-embedding-3-large"):
    embeddings = []
    for chunk in text_chunks:
        response = openai.Embedding.create(
            input=chunk,
            model=model
        )
        embeddings.append(response['data'][0]['embedding'])
    return embeddings

# Function to save embeddings to CSV
def save_embeddings_to_csv(chunks, embeddings, file_name="embeddings.csv"):
    df = pd.DataFrame({
        "text": chunks,
        "embedding": embeddings
    })
    df.to_csv(file_name, index=False)

# Main execution
file_path = "(1) Il terreno di gioco - REGOLA 1.txt"
text = read_text_file(file_path)
chunks = chunk_text(text, max_tokens=2000, model="gpt-4o")  # Adjust max_tokens as needed
embeddings = generate_embeddings(chunks, model="text-embedding-3-large")
save_embeddings_to_csv(chunks, embeddings)

####Incorporating embeddings into a context window

  • Defines an example question in Italian.
  • Uses the user’s question as the query to find relevant text chunks based on cosine similarity of embeddings.
  • Constructs a context window from the relevant text chunks.
  • Queries GPT-4o with the user’s question and the constructed context to get an answer in Italian.
  • Prints the answer.
import openai
import os
import pandas as pd
import tiktoken
from sklearn.metrics.pairwise import cosine_similarity

# Set up API key
openai.api_key = os.getenv("OPENAI_API_KEY", "sk-proj-12345...")

# Load embeddings from the CSV file
df = pd.read_csv('embeddings-regola-1.csv')

# Function to get the embedding for a query
def get_query_embedding(query, model="text-embedding-3-large"):
    response = openai.Embedding.create(
        input=query,
        model=model
    )
    return response['data'][0]['embedding']

# Function to find the most relevant chunks
def find_relevant_chunks(query, df, top_n=5):
    query_embedding = get_query_embedding(query)
    df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity([query_embedding], [eval(x)])[0][0])
    return df.nlargest(top_n, 'similarity')

# Function to create a context window
def create_context_window(chunks):
    context_window = "\n\n".join(chunks['text'].tolist())
    return context_window

# Function to ask GPT-4o a question using the context window and get a response in Italian
def ask_gpt(question, context, model="gpt-4o"):
    prompt = f"Context:\n{context}\n\nDomanda: {question}\nRisposta:"
    response = openai.ChatCompletion.create(
        model=model,
        messages=[
            {"role": "system", "content": "Rispondi alle domande in italiano."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=150,
        temperature=0.7
    )
    return response['choices'][0]['message']['content']

# Example usage
# Example question in Italian
question = "Di che colore devono essere le superfici artificiali?"

# Use the user's question as the query
relevant_chunks = find_relevant_chunks(question, df)
context_window = create_context_window(relevant_chunks)

# Get the answer from GPT-4o in Italian
answer = ask_gpt(question, context_window)
print(answer)

####Conclusion

  1. In your opinion, what things should be improved of my process? Do you find something wrong?
  2. I still missing a thing. Doing this process i’ve not “injected gpt-4o with knowledge”. I’ve just said to gpt-4o like “look at these knowledge before responding”, right?
  3. Now, what would be my next step to inject this new knowledge in gpt-4o? I’m concerned about the context window expanding and becoming too costly. Can you help me address this issue?

How happy are you with the results you are receiving? Did you run into any issues?

Your understanding is correct. There is currently no direct way to “inject knowledge” into a gpt model.

Look at the chunks that are retrieved through your approach and identify options for further filtering and/or cutting down the number of chunks to include in the context window. You can for example add metadata for each chunk and then further filter results by metadata fields.

Finally, depending on what your longer-term goals are, you might want to consider storing the embeddings in a vector database, such as Pinecone.

Again, I encourage you to take a look around the Forum and read up on the other extensive discussions around RAG including optimization techniques. Additionally, the OpenAI cookbook also provides more specific RAG-related workbooks including the use of vector databases. I encourage you to explore those as well.

Best of luck with your project! Let us know how it goes.

1 Like

THANKS
Hello, I quickly reviewed your forwarded content and I’m not sure if I understood it correctly. I think what you’re saying is that fine-tuning a large model to learn specific domain knowledge and expression styles through a dataset may involve emotions, words, tone, etc. These are the main functions. And RAG is a combination of large models and external plugins, such as a large model+knowledge base. In our example, it should represent vectorized data storage+semantic retrieval+GPT polishing?

So my current plan is to retrieve the content of the knowledge base through a large model. If the user’s question matches the answer, the answer will be retrieved and the user can choose whether to use GPT polishing and output. If the answer is not matched, the large model ID will be fine tuned to answer.

Actually, it’s like fine-tuning a large model with a speaking style that’s like a mouth. If that’s the case, I need to use system prompts to make some settings to help the fine-tuning base model understand the purpose of the dataset, which is more convenient for the base model to understand.

The focus of RAG is on setting matching thresholds, while the focus of fine-tuning is on introducing the dialogue and setting the temperature.

Please help me correct it. I am a beginner and I may have had misunderstandings about fine-tuning before. I have always believed that as long as I train my Q&A skills, it will give me the corresponding answer by inputting questions and spreading them through temperature.

man, spectacular response. the patience and the work behind your comment is admirable. I swear, society needs more people like you. I do not have such patience most of the time and end up just ignoring. Didn’t learn a thing but its cool to see some people are down to spend their time deeply explaining something.

1 Like