How do you handle text embedding ranking?

I’m currently studying laws, and as part of this process, I need to understand and query references. To this end, I break down documents into units called “sections”, and store each law in a separate YAML file. I then use ChatGPT to summarize each section and generate a title for it. The “title”, the “summarized text”, and the “original text content” are then embedded and incorporated into the YAML file.

Whenever a prompt is entered (by me, for instance), the text of the prompt is embedded and its similarity is compared with the “title”, “summarized text”, and “text itself” of each section. Following this, the mean similarity calculation for the “title”, “summarized text”, and “text itself” is computed, giving me the average similarity for each item.

I then base my selection of text chunks on their ranking in terms of similarity. Could you suggest any improvements to this process?

1 Like