I’ve worked out a method to scrape PDFs paragraph by paragraph and convert them to OpenAI embeddings. The chunks of text are great, and I’m blending them together in case any context is lost (each separate chunk consists of the 3 previous chunks). However, there are cases where there are typos… meaning I might scrape a paragraph that comes out as “The d og is bl ue”.
I was going to experiment to see what issues this causes, but has anyone had experience with implementing a spellchecker in python for individual strings? Before even doing that though, does anyone know if these typos even matter? I was watching a few videos and one of them mentioned that since there are a lot of typos on the internet and since these models were trained from the internet, OpenAI found a way to handle those typos… wondering if that also applies to contextual Q&A and semantic search.