How can I optimize the data I am embedding to increase vector search result quality?

r_christian · August 30, 2024, 11:31am

I am trying to implement semantic/vector search for images.

To do that, I am using gpt-4-mini to analyze an image and create data from it with this prompt:

Your job is to generate json data from a given image.
          
            Return your output in the following format:
            {
            description: "A description of the image. Only use relevant keywords.",
            text: "If the image contains text, include that here, otherwise remove this field",
            keywords: "Keywords that describe the image",
            artstyle: "The art style of the image",
            text_language: "The language of the text in the image, otherwise remove this field",,
            design_theme : "If the image has a theme (hobby, interest, occupation etc.), include that here, otherwise remove this field",
            }

The data I am getting back is pretty accurate (in my eyes). I am then embedding the json with the “text-embedding-3-small” model.

The problem is that the search results are pretty bad.

For example: I have 2 images with only text. One says “straight outta knee surgery” and one says “straight outta valhalla”.

When I search for “straight outta”, I have to turn down the similary treshold to 0.15 to get both results.

This is my postgres search function:

CREATE
OR REPLACE FUNCTION search_design_items (
  query_embedding vector (1536),
  match_threshold FLOAT,
  match_count INT
) RETURNS TABLE (
  id BIGINT
) AS $$
BEGIN
    RETURN QUERY
    SELECT id
    FROM public.design_management_items
    WHERE 1 - (design_management_items.description_vector <=> query_embedding) > match_threshold
    ORDER BY (design_management_items.description_vector <=> query_embedding) asc
    LIMIT match_count;
END;
$$ LANGUAGE plpgsql;

When I go into higher numbers (0.5) there are pretty much no results at all. This seems wrong because in every tutorial I have seen they use a threshold of 0.7+

What do I need to change in order to improve the accuracy of my search results?

curt.kennedy · August 30, 2024, 5:51pm

I would just embed the image directly with something like amazon.titan-embed-image-v1. You can also attach descriptions too. But it works with the standalone image.

r_christian · August 31, 2024, 10:27am

Thanks I will give it a try!

Topic		Replies	Views
Improving Semantic Search Engine Accuracy Using OpenAI Embeddings and Llama VectorStoreIndex API embeddings , gpt-4 , fine-tuning , vector-db , semantic-search	1	1244	May 17, 2024
Image selection with API - How to achieve high relevancy? API gpt-4 , api	4	215	October 9, 2024
Using Embeddings for search poor results vs GPT3 API	1	770	December 17, 2023
Embedding and searching from similar embeddings API	6	6755	October 27, 2023
Get embeddings for images API embeddings , gpt-4-vision	8	29745	February 12, 2025

How can I optimize the data I am embedding to increase vector search result quality?

Related topics