Can we improve the embedded data?

Hi There,

I have a requirement where I need to make an option for my business users to validate and correct the embedded data through a user interface.

Use case:
As an admin of the system if I ask the model a question that is answered through embedded data(similarity search). If I find that the answer is not right then I would like an option correct that answer and save it.

Expected result:

  1. Saved answser and question should be saved back to embeddings
  2. Whenever same question is asked next time then priority is the answer saved above and not the embedded material.

I know this can be done using fine tuning but since I am using embeddings here so looking for a solution around this only.

Looking forward to hears from all the experts here.


1 Like

Maybe you can do some manipulations on the embedded data and then embedd it again on a fresh model - what kind of data is it?

You can train another neural network on your embeddings to map it to the correct answer … but lots of work to get there.

Otherwise look into keyword based correlation algorithms. Embeddings capture meaning, and if there are a bunch of keywords without meaning, you need to use a keyword based algorithm instead.

Also, if the user input is super vague, you might need a fine-tune to intercept this and ask the user to be more specific so your embedding (or keyword) search is meaningful.

A hybrid of embedding and keyword search is also an option. Just need more details on what you are searching over. Also chunk sizes, etc.

Maybe also some filters on top after you take the data like a bad word filter. Not the best solution though.


Thanks for responding.
This is a chatbot which has PDF + FAQ as embeddings as source. The issue is where embeddings are not accurate and sometimes responds with random answers.

I would like to keep an option open where we can add accurate answers for most common questions and this new answer has priority than the actual text stored in embeddings.

Hope this clarifies