Say you’re doing an embedding project on large text with ada-002, is there any guidelines on how the Vector Store is made based on how you preprocess text chunks?
More on my use case:
I’m training a QA model on a variety of research papers/documents. Much of the chunks are normal text, but many mix in experimental processes and results (numbers) with text. I can’t find guidelines on how different preprocessing techniques affect outcomes, e.g., should I remove chunks with numbers, delete line breaks that indicate different paragraphs, make all characters lowercase, etc.?
I imagine the answer to this would be uniform across a large variety of use cases, so if anyone has resources on ideal input formats for ada-002 embdeddings, please share.