Improving Semantic Search Engine Accuracy Using OpenAI Embeddings and Llama VectorStoreIndex

hitensharma710 · May 17, 2024, 7:53pm

Hi,

I am currently working on a semantic search engine that uses OpenAI embeddings and then uses Llama’s VectorStoreIndex (for testing purposes, might switch to Pinecone for scale) to generate relevant results based on queries.

I am using the text-embedding-ada-002 model for my embeddings.

I have issues with getting accurate results. Also, my embeddings seem to be very similar regardless of the query I put in, as the similarity score is always very close. I’m not sure if the problem I am having is in my prompting, the data I am passing in to get embedded or my method itself.

For context, I am using product reviews data, and my end goal is to be able to return top k relevant products based on the specified query. I’ve attached a sample product review data structure in JSON format:

{
    "product_name": "SuperWidget 3000",
    "reviewer_email": "reviewer@example.com",
    "review_id": "1234-ABCD",
    "evaluation": {
        "overall_score": 4.5,
        "comments": [
            {
                "text": "Excellent build quality and performance.",
                "sentiment": 1
            },
            {
                "text": "A bit pricey for the features offered.",
                "sentiment": -1
            }
        ],
        "tags": ["Build Quality", "Performance", "Price"],
        "highlights": [
            "Excellent build quality.",
            "Great performance."
        ],
        "lowlights": [
            "Expensive.",
            "Limited feature set."
        ],
        "summary": "Overall, a great product with excellent build quality and performance, but a bit expensive for the features it offers."
    },
    "review_details": [
        {
            "question_id": "971847C7",
            "question": "What do you think about the build quality?",
            "evaluation_criteria": "Looking for comments on durability and materials used.",
            "score_of_1": null,
            "score_of_5": 5,
            "evaluation_score": 5,
            "evaluation_details": "The build quality is excellent, very sturdy and durable.",
            "review_text": "The build quality is excellent, very sturdy and durable."
        },
        {
            "question_id": "1B7AD751",
            "question": "How would you rate the performance?",
            "evaluation_criteria": "Looking for smoothness of operation and speed.",
            "score_of_1": null,
            "score_of_5": 4,
            "evaluation_score": 4,
            "evaluation_details": "The performance is top-notch, handles all tasks smoothly.",
            "review_text": "The performance is top-notch, handles all tasks smoothly."
        },
        {
            "question_id": "812CDFE0",
           "question": "Is the product worth its price?",
            "evaluation_criteria": "Checking for value for money.",
            "score_of_1": null,
            "score_of_5": 3,
            "evaluation_score": 3,
            "evaluation_details": "A bit pricey for the features offered.",
            "review_text": "A bit pricey for the features offered."
        },
    ],
    "review_complete": true,
    "tags": [],
    "timestamp": "2024-05-15T15:27:16.794000",
    "is_new": true,
    "request_feedback": true,
    "feedback_score": null
}

For the input data, my current setup takes the long JSON document and simply formats each review in this format (so the final text being embedded is a long string). Here is the sample format for one review:

concatenated_text = (
    f"Product Name: {product_name} | Reviewer Email: {reviewer_email} | Review ID: {review_id} | "
    f"Review Complete: {review_complete} | Tags: {tags} | Timestamp: {timestamp} | Is New: {is_new} | "
    f"Request Feedback: {request_feedback} | Feedback Score: {feedback_score} | Overall Score: {overall_score} | "
    f"Comments: {comments} | Highlights: {highlights} | Lowlights: {lowlights} | Summary: {summary} | "
    f"Review Details: {review_details}"
)

this is my main code for now:

Settings.llm = LlamaOpenAI(model="gpt-4-turbo-preview", temperature=0, max_tokens=1024)
Settings.embed_model = OpenAIEmbedding()
llm = Settings.llm
embed_model = Settings.embed_model

embedding_service = EmbeddingService()

docs = data()
vector_documents = []
for doc in docs:
    vector_doc = create_vector_document(doc, embedding_service)
    vector_documents.append(vector_doc)

index = VectorStoreIndex.from_documents(vector_documents, embed_model=embed_model, llm=llm)

def query_index(index, prompt, top_k=5):
    query_engine = index.as_query_engine(similarity_top_k=top_k)
    response = query_engine.query(prompt)
    
    relevant_documents = []
    if response.source_nodes:
        for node in response.source_nodes:
            relevant_documents.append({
                "id": node.node.metadata['id'],
                "name": node.node.metadata['name'],
                "text": node.node.text,
                "score": node.score,
                "match_score": str(node.node.metadata.get('match_score', 0))
            })

    return relevant_documents

res = query_index(index, prompt())
for doc in res:
    print(f"Score: {doc['score']}, Reviewer ID: {doc['id']}")
    print('\n')

These are some example queries I have been testing with:

positive tags
reliable
good customer service

I just need direction on where to go, whether its:

further fine-tuning
cleaning my data (removing unnecessary noise)
technical implementation issues
others

If you got to here, thanks a lot!

RonaldGRuckus · May 17, 2024, 9:23pm

You need to reduce the noise.

Instead of grouping everything together just simply embed the reviews and then link each embedding with the product name.

Separate the concerns.

You can combine embeddings many ways. So it’s better to create groupings of single-concern embeddings. Product names are meaningless for this, but you could do some fun things like see how the embedding engine “feels” about the product names.

I’d recommend Weaviate. They offer a database that accepts your schema and also can embed items individually, and in groups

Topic		Replies	Views
Embedding and searching from similar embeddings API	6	5937	October 27, 2023
How can I optimize the data I am embedding to increase vector search result quality? API embeddings , api , gpt-4o-mini	2	97	August 31, 2024
Document Retrieval in Large Database Community embeddings	4	3550	October 27, 2024
Creating a support chat bot for my business API	4	3447	December 18, 2023
How to use langchain RAG to search data from JSON? API embeddings , gpt-4 , langchain , vector-db , rag	0	1351	May 28, 2024

Improving Semantic Search Engine Accuracy Using OpenAI Embeddings and Llama VectorStoreIndex

Related topics