Preprocessing - I just don’t get it!

Hi all, I’m Graham and I’m new on here and new to GPT-3.

I have a pretty ambitious project in mind, which is to create a teaching assistant for trainee pilots, using embeddings created from my pilot training books.

The trouble is I just can’t get past preprocessing 101!

I’ve read the Github notebook over and over again and studied the 2020 Olympics example but I still can’t grasp the basic concepts of how to do preprocessing.

I’ve manually created a .csv from a chapter in one of my books and organised it much like the Github examples and now I want to calculate the embeddings. So what do I do next?

As far as I understand the notebook, the preprocessing steps are:

  1. Upload the file.
  2. Run commands to create embeddings.
  3. Store the new file with the embeddings online (1gb limit)

The examples seem to use a .csv file, but elsewhere I’ve read that only JSONL files can be uploaded.

So, do I have to convert this to a JSONL document before uploading the file or can I upload the .csv itself? Or have I completely misunderstood the whole process?

Hi Graham,

Just to help you understand the process:

  • You break the document down into manageable blocks of 1 to 2 thousands tokens
  • You use a program to pass each block to the embedding API (one at a time)
  • It returns a big long string of 1500 odd numbers for each one (these are vectors)
  • You store these locally with your data (in the csv file maybe)

Then when you want to do a search you do the following:

  • take the text you are searching for
  • call the embedding endpoint with the text
  • It returns a big long string of 1500 odd numbers (these are vectors)
  • You use this with a function called cosine similarity (or dot product) against all your locally saved values
  • You take the ones that are the closest match (normally 1 - but could be top 2 or 3)


You make a prompt that says something like (but exactly this)

With reference to the following text

##whatever you found goes here##

Question: ##whatever the user typed in here##


Thats the rough version. I can clarify further if you need it

You do need to convert those strings of numbers into vector (in your program) to make the calculation work


Hi Raymon, thank you for your most helpful response. So, I’m stuck at your bullet point 2 step: "use a program to pass each block to the embedding API (one at a time).

Looking at the notebook it seems to suggest that, for anything up to about 10,000 entries, its fine to use OpenAi’s embedding functionality. But the notebook only shows the code for doing this to a single text string.

Can I programmatically send an entire file of entries up to the embeddings endpoint or do I have to use an entirely separate programme like Word2Vec?

This code, included in the Notebook, suggests that OpenAi can process a .csv file:

df = pd.read_csv(‘’)
df = df.set_index([“title”, “heading”])
print(f"{len(df)} rows in the data.")

But there doesn’t seem to be a suitable OpenAi Files function to allow me to upload a csv.

Out of interest, what words corresponding vector is closest to the origin? {0, 0, 0, …, 0, 0}

If you use Word2Vec, then you would also use it when you do the semantic part of the search. You would not use GPT Embeddings at all.

If you use GPT, then you need to call it for each block of text (as a one off exercise). It is possible to batch requests - but you need to stay within the roughly 8000 token limit

The vectors in ADA-002 are 1056 tokens long and in their own space. Word2Vec uses a different method.

With that said, if you are happy with Word2vec, you could do the entire embedding task using that model, and then only use GPT to ask a question and give an answer once you have found the text using word2vec.

You can take the text you find with word2vec as the inline context in a final question to GPT. You would pass the text you find as part of the prompt.


With reference to the context shown below answer the question



The prompt could be better worded - but you get the idea

I hope that helps

Thanks Raymond. You’re being extremely helpful and patient! So, just so I’m clear on this. If I want to preprocess embeddings using GPT the procedure is:

  1. Offline, prepare a .csv file
  2. Offline, read the .csv file, select a line
  3. Connect to the GPT embeddings endpoint
  4. Send the text string to GPT
  5. Obtain the vector embeddings for the entry
  6. Offline append the embedding to the .csv file
  7. Repeat until all rows are done
  8. Upload the completed .csv to my own server ready to be called by GPT when constructing a query.

Is that right?

Not sure about Step 8.

Step 1 to 7 are correct. Step 8 is 1/2 correct

Step 8. Put the csv somewhere your app can see it
Step 9. User inputs a query
Step 10. You connect to the GPT embeddings endpoint
Step 11. You send the users query string or question to GPT
Step 12. Obtain the embedding vector for the query string

Step 13. Now you go through every record in your csv file and you use cosine_similarity or dot product of the vector you just got in step 12 with the vector for each piece of text you got in step 5 (one at a time) - dot product and cosine_similarity do the same thing in the case of GPT and dot product is fractionally faster and easier to code if you need to make your own version

Step 14: Keep track of the highest value(s) returned from that calculation. The highest value will come from the vector (and related text) that are the closest match

Step 15: if you want GPT to finally answer the users question, you build a prompt like:

Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say “I don’t know.”

{text from the record you found in Semantic search - step 14 - goes here}

Q: {your original query goes here}

Step 16: Then you pass the prompt to GPT’s completion endpoint (or use a library to do this). The completion should be your answer. The answer should use the text you found as the source of information

The results will depend on the temperature (keep this low-ish) and the top_p values (leave default for now)

Let me know if you need any more help with this.

edited: to clarify a few things


Thanks Raymond, I finally understand it!

1 Like

If all goes well, I’m releasing a set of 13 videos in 1 or 2 days on this exact method.

They have been recorded and are being compiled now. I’ll post a link when they are ready.


I’ll be looking for it!

Me too! Sounds wonderful!

That was very nicely laid out, @raymonddavey

Love to get the videolink as well as it will most likely be very helpful, too.

1 Like

@manyapps Hi Paul, Here are the videos

Thanks, I’ll go and read up :slight_smile:
I do have a related question: does it make sense/is it really needed to still fine-tune a model when you use embeddings?

Or, is fine-tuning more related to e.g. creating certain texts like product descriptions or summaries etc. Like, creating a template/model for the expected replies. A bit confused still about embedding vs fine-tunes.

Thanks again!

Embedding is good for large block of text (Even though you have to break them down)

Its especially good for semantic search, and clustering (finding patterns on text)

In some cases, it is a better way to do classifying

Fine-tuning is good for producing outputs based on pattern matching, extracting entities, and most completion tasks where you can provide a known question and answer. Based on my understanding, its not so good for large blocks of fiction writing etc I’ve also had mixed results on training chat bots.