When using embedding models, why to ask normal models instead of embedding ones?

Hi, I’m trying to use an embedding model to work in an isolated fashion, as I want to provide sensitive data that I don’t want to get stored anywhere, so my idea is:

  • Generate an embedding based on a prompt containing business rules
  • Store the embedding in my database
  • Ask to the embedding model a question, by providing the embedding generated previously + my question (about the data contained in the embedding)

I was looking into platform openai com/docs/guides/embeddings and looked into the use cases, I build my code based on the one about asking questions:


there I can see how to generate the embedding, but then at the end, I can see they just ask a normal GPT model (like 3.5 or 4.0) a question and I cannot see at which point the embedding is used or injected into the GPT model.

def ask(
    query: str,
    df: pd.DataFrame = df,
    model: str = GPT_MODEL,
    token_budget: int = 4096 - 500,
    print_message: bool = False,
) -> str:
    """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
    message = query_message(query, df, model=model, token_budget=token_budget)
    if print_message:
    messages = [
        {"role": "system", "content": "You answer questions about the 2022 Winter Olympics."},
        {"role": "user", "content": message},
    response = client.chat.completions.create(
    response_message = response.choices[0].message.content
    return response_message

I thought it would be about asking the question to the embedding model, not a normal GPT model. Is there a way I can achieve the result I want (asking an embedding model, that doesn’t remember data, based on an embedding previously generated + sensitive data and questions about that)?

It seems you are using client.chat.completions.create

To use the embedding model, you need to use the embeddings model as follows:



To get embeddings vectors, you have to send the full text to OpenAI.

They don’t train models on text, but they do retain it.

And that is the result of an embeddings model, a vector - a list of numbers.

This vector and the numbers inside contains some meaning, but it is only useful for comparing to other vectors returned by the same embeddings AI model, to then find how similar the language inputs were in meaning and topic, using the AI’s natural language understanding backend.

Example Use:

I have a database of 100 tech support questions and answers.

I send the text of each Q&A entity to an embeddings model, obtain a vector result, and store those results in a vector database. That will be my knowledge base.

Then, I have another new question. I can run embeddings on that new input (also requiring it being sent to OpenAI). The resulting vector can be compared against all others in the database (using a dot product algorithm).

The knowledge base search thus can return the highest quality matches of text back to me – or into an AI model that is supposed to answer questions.

I’m pretty sure there’s no way to do “secret” work unless you just spend a few thousand to buy a top of the line NVIDIA GPU (or similar) and then just do it all locally. All the Cloud services are going to monitory your queries, and so will all the TLA government agencies.

When it comes to cloud services, whether data is retained or not makes a big difference.

How much does it cost if you ever calculate to invest the top of the line NVIDIA GPUs (for at least the minimum) and do all the jobs locally?

I’m not experienced enough to know the answer to that, but I do know things are painfully slow if you don’t have a good GPU. I mean obviously it depends on the number of parameters in the model, the model architecture, etc., but the main place I’d go to learn all that is Huggingface.