Need Guidance about Custom Data Training. [Thanks]

My Goal :

  • Upload text data like article, product description, company rules into OpenAI (API)
  • A Custom Chatbot replies based on Document uploaded. (API)

I have study this forum and OpenAI documents.
i am not sure I get the idea right, which is

  1. Upload text doc as Embeddings
    I have no idea how the Prompt and Completion are refer to my embeddings(which is a list of numbers).
    in API document, there is no such parameter for Completion, that include Document Embeddings.

Thanks in advance.

This is not correct.

Embeddings are not used in this manner to my knowledge.

Training data should used to fine-tune a model is actual text, not embedding vectors.

I have not seen any example by OpenAI where fine-tuning is accomplished by using embedding vectors as the values in the prompt, completion, key values pairs, @zhihong0321

If you have an OpenAI reference which demonstrates this approach, please share the reference.

Thanks so much.


thanks for reminder.

i tried both embedding and fine-tuning API
not quite sure the technical different

but obviously Fine-tuning is much easier with just few API calls.
but embedding is much more complex where it involved indexing.

Currently continue to testing my data via fine-tuning, see whether can I achieve my desired result.

In a nutshell,

  • (Text) Embeddings are numerical representations of text, represented as a (unit) vector (using the OpenAI API). This vector can be tested against another vector using linear algebra, commonly the “dot product” (among other methods), and ranked numerically. Embedding vectors are used for search, classification, etc.

  • OpenAI Fine-Tuning is the process of training an OpenAI model to change it’s generative output text based on the input text.

If you want to build a custom chatbot to reply to a prompt with your company data, you might need fine-tuning.

If you want to just search a DB and return the required text based on a semantic match, you can use embeddings.

Many well developed applications with use both and they will search their DB for a good match using either a full-text DB search (for short phrases and keywords) or a vector-based semantic search (for larger strings and text), and then if no high scoring reply is found, then query a GPT model for a reply.

Then, take the GPT reply (if it meets a certain criteria) add it do the DB (generate a vector for it) and then in the future that reply can be matched via the same search process as above.

The key is to have a DB of text and for each row of text in the DB to have an embedding vector. In that way, the DB can be searched by vectorizing the search text and then taking the “dot product” (for example) between the vectors in the DB and the search term vector, then ranking the results and picking the highest ranked match (if that is what you want).

Hope this helps.


1 Like