HI every one Hope you all doing good.
I just saw the use of text embedding-ada-002 model for text search and preparation of data.
But the whole point of my question is about preparation of data for example
we have seen in example in mentioned link as we have text column and summary column then we combine both column in dictionary format like title: summary , content: text ok i got that i must have summary for every text but
if i have documents how i have to prepare dataset because summarizing each document is difficult so please tell me how to tackle this problem
- Should i use directly document text content for embedding or i must have to follow the same structure for creating imbedding’s?
- How to extract documents embedding as i known preparation of summary is difficult for each document and then formatting document text content and summary to the structure same like given in example will be difficult?
- Please give me a proper wy to solve this use case of semantic text search by the help of text-embedding-ada-002
Ideally each embedding would have enough semantic uniqueness. The use cases should be used as inspiration and idealization, not step-by-step guides for your purpose unless it perfectly reflects your goal. The structure ultimately depends on the purpose of the text. You are the director of your documents, what separates them? What parts are important? What benefits does a semantic search have compared to using any other sort of search engine?
- The structure should create unique documents that highlight their semantic differences. Typically this includes everything
- You don’t extract documents embeddings. You are condensing/converting your documents into a comparable format for computers. You don’t need to summarize it.
- The proper way is to first understand embeddings and ask yourself: is this the best solution? If you understand embeddings well, you will see that it’s a very straightforward process and your questions will answer themselves
Thank you for your reply. ok one more question.
If i use each document text in a row of CSV file and encode to embedding after i search string query in these embedding. the output result will give me the whole document text or the semantic sentences from that document?
we are trying to create application where we will upload a text file and in the back end it will auto-encode to embedding then we will search for queries in those embedding. will that approach work?
we must create embeddings of hundred of documents then the search will be greater?