Typos in Embeddings- to Fix or Not?

charlieevert · February 2, 2023, 2:25pm

I’ve worked out a method to scrape PDFs paragraph by paragraph and convert them to OpenAI embeddings. The chunks of text are great, and I’m blending them together in case any context is lost (each separate chunk consists of the 3 previous chunks). However, there are cases where there are typos… meaning I might scrape a paragraph that comes out as “The d og is bl ue”.

I was going to experiment to see what issues this causes, but has anyone had experience with implementing a spellchecker in python for individual strings? Before even doing that though, does anyone know if these typos even matter? I was watching a few videos and one of them mentioned that since there are a lot of typos on the internet and since these models were trained from the internet, OpenAI found a way to handle those typos… wondering if that also applies to contextual Q&A and semantic search.

PaulBellow · February 2, 2023, 2:47pm

A space in the words would make an extra token which would make it a different “word”… So… I would remove all the typos if you can.

Hope this helps!

charlieevert · February 2, 2023, 3:37pm

Thank you! Did not know that, but makes sense! Off to find Python spellcheck lol

proro · April 24, 2023, 6:28pm

Hey I am facing a similar issue @charlieevert, what if the words are not present in the spellcheck library? In my usecase there are a lot of out of domain words. Any help is appreciated.

Topic		Replies	Views
Embedding for words with typos API	0	396	April 24, 2023
Making embeddings more accurate? API embeddings	7	2726	December 17, 2023
Custom Spell Checker API problems API fine-tuning , api , prompt , custom-instructions , beginner-help	1	2397	August 31, 2023
OpenAI Embeddings - Multi language API embeddings	27	15985	December 17, 2023
What's the appropriate way to convert pdfs to text files? Prompting	6	4966	December 23, 2023

Typos in Embeddings- to Fix or Not?

Related topics