Data without meaning gets the highest similarity score

Unfortunately I do not have a solution yet for the raw chunks. There are some techniques to try to improve the chunks, for example when chunking your documents, make automated summaries for those chunks, or sets of chunks. These summaries will probably eliminate the irrelevant data. There are also techniques that can cluster chunks from different documents with similar data and summarizes those clustered chunks. Using both the summaries and the raw chunks might increase the relevant context for the answer. Perhaps by having more relevant data, the irrelevant chunks will be ranked lower.

I’m not a fan of pre-processing the data, as it could potentially remove relevant data.

Do you have any other ideas?