Typos in Embeddings- to Fix or Not?

I’ve worked out a method to scrape PDFs paragraph by paragraph and convert them to OpenAI embeddings. The chunks of text are great, and I’m blending them together in case any context is lost (each separate chunk consists of the 3 previous chunks). However, there are cases where there are typos… meaning I might scrape a paragraph that comes out as “The d og is bl ue”.

I was going to experiment to see what issues this causes, but has anyone had experience with implementing a spellchecker in python for individual strings? Before even doing that though, does anyone know if these typos even matter? I was watching a few videos and one of them mentioned that since there are a lot of typos on the internet and since these models were trained from the internet, OpenAI found a way to handle those typos… wondering if that also applies to contextual Q&A and semantic search.

A space in the words would make an extra token which would make it a different “word”… So… I would remove all the typos if you can.

Thank you! Did not know that, but makes sense! Off to find Python spellcheck lol

Hey I am facing a similar issue @charlieevert, what if the words are not present in the spellcheck library? In my usecase there are a lot of out of domain words. Any help is appreciated.