Data without meaning gets the highest similarity score

Knordy · March 19, 2024, 8:56am

Unfortunately I do not have a solution yet for the raw chunks. There are some techniques to try to improve the chunks, for example when chunking your documents, make automated summaries for those chunks, or sets of chunks. These summaries will probably eliminate the irrelevant data. There are also techniques that can cluster chunks from different documents with similar data and summarizes those clustered chunks. Using both the summaries and the raw chunks might increase the relevant context for the answer. Perhaps by having more relevant data, the irrelevant chunks will be ranked lower.

I’m not a fan of pre-processing the data, as it could potentially remove relevant data.

Do you have any other ideas?

Topic		Replies	Views
Inconsistent Embedding Results for my dataset API embeddings	1	73	November 14, 2024
Document Retrieval in Large Database Community embeddings	4	3938	October 27, 2024
Semantic search through embeddings API	3	1285	January 22, 2023
Need help for vector embedding search with Open Ai embeddding and elastic search cosine similarity API	1	455	February 15, 2024
Embeddings and Cosine Similarity API	20	14204	February 25, 2024

Data without meaning gets the highest similarity score

Related topics