Hi all, I’m Graham and I’m new on here and new to GPT-3.
I have a pretty ambitious project in mind, which is to create a teaching assistant for trainee pilots, using embeddings created from my pilot training books.
The trouble is I just can’t get past preprocessing 101!
I’ve read the Github notebook over and over again and studied the 2020 Olympics example but I still can’t grasp the basic concepts of how to do preprocessing.
I’ve manually created a .csv from a chapter in one of my books and organised it much like the Github examples and now I want to calculate the embeddings. So what do I do next?
As far as I understand the notebook, the preprocessing steps are:
Upload the file.
Run commands to create embeddings.
Store the new file with the embeddings online (1gb limit)
The examples seem to use a .csv file, but elsewhere I’ve read that only JSONL files can be uploaded.
So, do I have to convert this to a JSONL document before uploading the file or can I upload the .csv itself? Or have I completely misunderstood the whole process?
Hi Raymon, thank you for your most helpful response. So, I’m stuck at your bullet point 2 step: "use a program to pass each block to the embedding API (one at a time).
Looking at the notebook it seems to suggest that, for anything up to about 10,000 entries, its fine to use OpenAi’s embedding functionality. But the notebook only shows the code for doing this to a single text string.
Can I programmatically send an entire file of entries up to the embeddings endpoint or do I have to use an entirely separate programme like Word2Vec?
This code, included in the Notebook, suggests that OpenAi can process a .csv file:
If you use Word2Vec, then you would also use it when you do the semantic part of the search. You would not use GPT Embeddings at all.
If you use GPT, then you need to call it for each block of text (as a one off exercise). It is possible to batch requests - but you need to stay within the roughly 8000 token limit
The vectors in ADA-002 are 1056 tokens long and in their own space. Word2Vec uses a different method.
With that said, if you are happy with Word2vec, you could do the entire embedding task using that model, and then only use GPT to ask a question and give an answer once you have found the text using word2vec.
You can take the text you find with word2vec as the inline context in a final question to GPT. You would pass the text you find as part of the prompt.
eg
With reference to the context shown below answer the question
Context:
Q:
A:
The prompt could be better worded - but you get the idea
Thanks Raymond. You’re being extremely helpful and patient! So, just so I’m clear on this. If I want to preprocess embeddings using GPT the procedure is:
Offline, prepare a .csv file
Offline, read the .csv file, select a line
Connect to the GPT embeddings endpoint
Send the text string to GPT
Obtain the vector embeddings for the entry
Offline append the embedding to the .csv file
Repeat until all rows are done
Upload the completed .csv to my own server ready to be called by GPT when constructing a query.
Step 8. Put the csv somewhere your app can see it
Step 9. User inputs a query
Step 10. You connect to the GPT embeddings endpoint
Step 11. You send the users query string or question to GPT
Step 12. Obtain the embedding vector for the query string
Step 13. Now you go through every record in your csv file and you use cosine_similarity or dot product of the vector you just got in step 12 with the vector for each piece of text you got in step 5 (one at a time) - dot product and cosine_similarity do the same thing in the case of GPT and dot product is fractionally faster and easier to code if you need to make your own version
Step 14: Keep track of the highest value(s) returned from that calculation. The highest value will come from the vector (and related text) that are the closest match
Step 15: if you want GPT to finally answer the users question, you build a prompt like:
Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say “I don’t know.”
Context:
{text from the record you found in Semantic search - step 14 - goes here}
Q: {your original query goes here}
A:
Step 16: Then you pass the prompt to GPT’s completion endpoint (or use a library to do this). The completion should be your answer. The answer should use the text you found as the source of information
The results will depend on the temperature (keep this low-ish) and the top_p values (leave default for now)
Thanks, I’ll go and read up
I do have a related question: does it make sense/is it really needed to still fine-tune a model when you use embeddings?
Or, is fine-tuning more related to e.g. creating certain texts like product descriptions or summaries etc. Like, creating a template/model for the expected replies. A bit confused still about embedding vs fine-tunes.
Embedding is good for large block of text (Even though you have to break them down)
Its especially good for semantic search, and clustering (finding patterns on text)
In some cases, it is a better way to do classifying
Fine-tuning is good for producing outputs based on pattern matching, extracting entities, and most completion tasks where you can provide a known question and answer. Based on my understanding, its not so good for large blocks of fiction writing etc I’ve also had mixed results on training chat bots.