Make answers align more closely with dataset completions in fine-tuned models?

Probably a better way to frame this question, but here goes:

I fine-tuned a model following the documentation instructions with a basic prompt-completion jsonl file. It’s basically a Q&A dataset with every prompt a question, every completion the answer. Processed the same dataset on ada and davinci.

One of my prompts was: “What is an online training system?”.

When I tested in the playground, neither ada nor davinci gave me anything near the completion answer in the dataset (although davinci was the closest), even with the temperature turned all the way down to 0. Here is what is in the dataset:

prompt: What is an online training system?

completion: An online training system is a system of training that takes place in an online environment. Instructors, educators and trainers create online courses using platforms, like a content management system (CMS), and make them accessible via the Internet on PCs, laptops, tablets, smartphones and other devices.

Here is what I got back from playground:

I know it’s probably not possible to get a word for word response (which I really don’t want anyway), but are there any techniques I can use to get responses more closely aligned to the dataset completion I provided when a question matches a dataset prompt?

1 Like

Yes, if you want exact responses you should consider a rule-based approach in front of your GPT3 model.

Fine-tuning is useful but it is suboptimal (and more expensive) to query a model when you already have a question and an answer.

You might consider using embeddings to search for matches.

However, it is suboptimal to try to fine-tune a GPT model to respond like an expert system.

A more optimal system architecture is to have a rule-based engine “in front” of your GPT3 model, where GPT3 only is queried if a rule cannot be matched (based on your matching criteria).



Thank you for the reply. Makes sense. Right now I’m playing around with creating a Q&A knowledgebase about a particular subject. Doesn’t have to be precise (like an expert system), just in the ballpark.

I have looked at the documentation on embeddings but still don’t really understand how to use them for fine-tuning and/or searching. I see the example of how to retrieve one, but I have no idea what to do with it (“extract, save and use”) once I get it. Can someone point me to a tutorial on this?

In a nutshell,

You would assign all your questions and all the answers an embedding vector.

Then, you use some math (like the dot product of two vectors) to score the combinations (questions / replies) and use this ranking to choose the optimal reply.

It’s like a rules-based system but it uses embedding vectors to assist in the search for answers part.

I’m not sure if using GPT3 embedding vectors is optimal. I think it depends on the use case, TBH and requires some testing and evaluations.

Also, there is ongoing research into various methods to remove bias from a set of embedding vectors in order to optimize results. Again, this depends on the dataset and the use case.

Does that make sense?

1 Like

Yes, makes sense. Just above my current skillset to embrace and execute. My background is in coding, databases and language. Absolute beginner in NLP. I saw a video by David Shapiro where he appears to have used the same approach, but he wrote this (python?) program to create and process the vectors, and eventually lost me there as well.

I should mention that some of the answers I got back from the model were pretty good. They were the ones usually about a subject which was particular to my knowledge base (like, for example, my personal company experience). It appears that the more general subjects, like “online training system” is where the models went off on their own, despite the low temperatures.

Perhaps what I am trying to do is not possible. I just want to try. Can you point me to something similar to “Embeddings for Dummies” to see if this is something that I a) I can do and b) I even want to attempt?

Everything is possible. This is software :slight_smile: Some tasks are easy and some take a long time and can cost a small (or very large) fortune.

I have no idea about “Embeddings for Dummies”. I gained my knowledge by reading and participating in embeddings-related topics here, chatting with ChatGPT on the topic of embeddings, reading based on Google search results, and coding experiments.

Plus, I have a background in engineering and have worked with AI and expert-systems before, going back a long time.

To be honest, I learned the most about this topic by coding and working with the OpenAI API, as I always learn better “by doing” and then when I get stuck, I search for some on-line text or video help to get past the “stuck part”.

Two days ago, I coded an app which takes text and grabs the vectors (the embeddings, via the API) and stores that info along with the model name, etc. in a DB. I run queries against the DB and does some linear algebra in code to look at the results for various “prompts” and “answers”, etc.

I coded this in a day, and I’m not the smartest guy in the world (but I’ve been doing this kind of IT for many decades, so it’s pretty natural to have fun with tech like this), so you can maybe do it faster than me!

Coding with APIs are fun. Enjoy! Learn Coding By Coding.

OK. At the very least I should be able to write a program to go out and grab the embeddings. Guess I can figure out what to do with them eventually.

Last question:

curl \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{"input": "Your text string goes here",

Do I use my custom model, or a specific “text-embedding-(openai model)-002” embedding model?

You can try both. Embedding API calls are very cheap.

Regarding curl, that’s OK if you like it, I do all API calls with Ruby.

Thanks! Been using php a lot for the past few years, so comfortable trying to hack it out with php curl.

At least I’ve got the connection working now. Next, saving the vectors to a database with the text.

Looks like for some of my longer completion paragraphs, there are going to be a LOT of vectors!

Thanks for all your help!

Send me a private message, and I will give you beta access to a system that may help you. Dont post your email publically - but send it to me as a private message

Well, I didn’t give up. At least I think I understand the embedding process now.

1 Like

Good flowchart.

However, you don’t need to only use the cosine similarity function. There are many different ways to process the vectors including:

  • Cosine Similarity (same as the dot product for unit vectors like these embeddings)
  • Dot Product
  • Euclidean Distance
  • Manhattan Distance.

Also, you are missing an (optional but important) block in your flowchart where you remove (hidden) bias from the vectors. There are many ways to do this, the easiest is to average all the vectors in your set (sum all vectors and take the average) and subtract this average from each vector before you process (your cosine similarity block).

Many experts believe the “removing the bias” process is perhaps one of the more critical steps in the process, so it’s important to show this step in your flowchart.

Hope this helps.

Yikes! Just when I think I’ve got a hold on this thing.

I was going to ask here “What is hidden bias? How do I find it and why do I want to get rid of it?”. Then I thought, “Hey, why not ask ChatGPT3?”

how do i remove hidden bias in my embeddings?

  1. Pre-processing data: Remove any gender, race, religion, or other biasing words or phrases that might impact your embeddings.
  2. Using diverse training data: Train your embeddings on diverse and inclusive data to ensure that it is representative of different groups of people.
  3. Post-processing embeddings: Use bias mitigation techniques like equalizing odds, demographic parity, or adversarial training to reduce the impact of bias in the embeddings.
  4. Monitoring and evaluating the embeddings: Regularly evaluate the embeddings for bias and monitor performance on diverse data sets to ensure that bias is being addressed and reduced over time.
  5. Engage in continuous improvement: Continuously evaluate and improve your bias mitigation techniques to ensure that they remain effective over time.

The data I will be embedding, at least initially, will be customer service and customer support related. After that, I’m looking to include physical (and dance) training data. I can’t see how there would be that type of bias in my data as it would be inclusive of everybody. But, perhaps I’m missing something here?

There is bias in all data. This is not “bias” like in “discrimination against others”. It is a mathematical concept.

Please (for you own sake) do not depend on ChatGPT for answers to these kinds of technical questions. ChatGPT (like all LLM GPTs) are language prediction models (similar to how text completion works on your smart phone etc.) not expert systems. You must search and read “real human techincal papers” to get accurate answers. ChatGPT will hallucinate answers, and ChatGPT always sounds very confident when in fact, ChatGPT is simply wrong.

ChatGPT is an nice “assistant” and it’s technical replies must be confirmed outside ChatGPT.

OK. One final question – at least for now:

Let’s say I am able to create a database / spreadsheet that contains my dataset consisting of text and their vectors.

What are my options for a search platform to put this together in some sort of usable way? Based on my flow chart, I need a search mechanism where:

a search term can be entered
search vector retrieved for term
mathematical functions performed against database vectors using search vector
results returned

All the tutorials I’ve seen so far use python to create these. Do you know if there are any php (my preference) or other simple plug and play solutions?

Well, I’m off to the gym soon, so let me quickly reply that what you are asking is easy to code.

You basically have a DB full of vectors and a vector for a search term. Assuming you will not (yet) remove bias the in the vectors (which is fine when you are just building your algorithm and testing), you simply would take the dot product of your search term vector with each vector in your DB and sort (rank) the output. Then you can select the best matches and then reply with the text from the DB which corresponds to the highest dot product (or one of the top you selected based on your own criteria) with the search term vector.

A developer should be able to code this in php in less than a hour with basic php skills given the data is in your DB already.

Sorry, gotta run …


BTW, thank you for the caution. I’ve seen the “hallucinations” for myself. I’ve been working with ChatGPT a lot lately to solve a variety of coding issues. I’ve seen it confidently write code that is either lacking in it’s task or flat out wrong. Being pretty good at coding myself, I’ve been able to spot these issues immediately and correct the AI.

Being that I know so little about this particular issue of embeddings, I may have been a bit over reliant on it’s answers. I have gotten this far from reading a number of tutorials and watching several videos.

I’ll be a little bit more careful moving forward.

OK. Thanks much!

WITH search_vector AS (
  SELECT <search_term_vector_column> AS search_vector
vectors AS (
  SELECT <vector_column> AS vector
SELECT dot_product(search_vector.search_vector, vectors.vector)
FROM search_vector, vectors

Compliments of ChatGPT3. I couldn’t resist!

Yeah, and ChatGPT writes code better (more accurately) than it explains concepts.

Don’t get me wrong, ChatGPT is very helpful but ChatGPT is very often confidently-wrong due to not having a perfect dataset.

On the other hands, when ChatGPT makes up parameters when coding and other strange “non-existent” technical details, it is often instructive to review these hallucinations with an open, creative mind.

Think of it like when Dall-E hallucinates an image based on a prompt. The image is creative. When ChatGPT hallucinates a technical solution or explanation, it is often also very creative (even it not feasible to implement based on actual parameters, libs, etc.)

Yes, I often go to ChatGPT for first drafts of methods I wish to code.

Also, if you use Visual Studio Code (VSC) with the Copilot GPT-based completion engine extension it is very useful and sometimes it “code completes” methods very fast and with no syntax or spelling errors.

Developing code on VSC with Copilot saves me so much time. It’s awesome!