Improving Semantic Search Engine Accuracy Using OpenAI Embeddings and Llama VectorStoreIndex

Hi,

I am currently working on a semantic search engine that uses OpenAI embeddings and then uses Llama’s VectorStoreIndex (for testing purposes, might switch to Pinecone for scale) to generate relevant results based on queries.

I am using the text-embedding-ada-002 model for my embeddings.

I have issues with getting accurate results. Also, my embeddings seem to be very similar regardless of the query I put in, as the similarity score is always very close. I’m not sure if the problem I am having is in my prompting, the data I am passing in to get embedded or my method itself.

For context, I am using product reviews data, and my end goal is to be able to return top k relevant products based on the specified query. I’ve attached a sample product review data structure in JSON format:

{
    "product_name": "SuperWidget 3000",
    "reviewer_email": "reviewer@example.com",
    "review_id": "1234-ABCD",
    "evaluation": {
        "overall_score": 4.5,
        "comments": [
            {
                "text": "Excellent build quality and performance.",
                "sentiment": 1
            },
            {
                "text": "A bit pricey for the features offered.",
                "sentiment": -1
            }
        ],
        "tags": ["Build Quality", "Performance", "Price"],
        "highlights": [
            "Excellent build quality.",
            "Great performance."
        ],
        "lowlights": [
            "Expensive.",
            "Limited feature set."
        ],
        "summary": "Overall, a great product with excellent build quality and performance, but a bit expensive for the features it offers."
    },
    "review_details": [
        {
            "question_id": "971847C7",
            "question": "What do you think about the build quality?",
            "evaluation_criteria": "Looking for comments on durability and materials used.",
            "score_of_1": null,
            "score_of_5": 5,
            "evaluation_score": 5,
            "evaluation_details": "The build quality is excellent, very sturdy and durable.",
            "review_text": "The build quality is excellent, very sturdy and durable."
        },
        {
            "question_id": "1B7AD751",
            "question": "How would you rate the performance?",
            "evaluation_criteria": "Looking for smoothness of operation and speed.",
            "score_of_1": null,
            "score_of_5": 4,
            "evaluation_score": 4,
            "evaluation_details": "The performance is top-notch, handles all tasks smoothly.",
            "review_text": "The performance is top-notch, handles all tasks smoothly."
        },
        {
            "question_id": "812CDFE0",
           "question": "Is the product worth its price?",
            "evaluation_criteria": "Checking for value for money.",
            "score_of_1": null,
            "score_of_5": 3,
            "evaluation_score": 3,
            "evaluation_details": "A bit pricey for the features offered.",
            "review_text": "A bit pricey for the features offered."
        },
    ],
    "review_complete": true,
    "tags": [],
    "timestamp": "2024-05-15T15:27:16.794000",
    "is_new": true,
    "request_feedback": true,
    "feedback_score": null
}

For the input data, my current setup takes the long JSON document and simply formats each review in this format (so the final text being embedded is a long string). Here is the sample format for one review:

concatenated_text = (
    f"Product Name: {product_name} | Reviewer Email: {reviewer_email} | Review ID: {review_id} | "
    f"Review Complete: {review_complete} | Tags: {tags} | Timestamp: {timestamp} | Is New: {is_new} | "
    f"Request Feedback: {request_feedback} | Feedback Score: {feedback_score} | Overall Score: {overall_score} | "
    f"Comments: {comments} | Highlights: {highlights} | Lowlights: {lowlights} | Summary: {summary} | "
    f"Review Details: {review_details}"
)

this is my main code for now:

Settings.llm = LlamaOpenAI(model="gpt-4-turbo-preview", temperature=0, max_tokens=1024)
Settings.embed_model = OpenAIEmbedding()
llm = Settings.llm
embed_model = Settings.embed_model

embedding_service = EmbeddingService()

docs = data()
vector_documents = []
for doc in docs:
    vector_doc = create_vector_document(doc, embedding_service)
    vector_documents.append(vector_doc)

index = VectorStoreIndex.from_documents(vector_documents, embed_model=embed_model, llm=llm)

def query_index(index, prompt, top_k=5):
    query_engine = index.as_query_engine(similarity_top_k=top_k)
    response = query_engine.query(prompt)
    
    relevant_documents = []
    if response.source_nodes:
        for node in response.source_nodes:
            relevant_documents.append({
                "id": node.node.metadata['id'],
                "name": node.node.metadata['name'],
                "text": node.node.text,
                "score": node.score,
                "match_score": str(node.node.metadata.get('match_score', 0))
            })

    return relevant_documents

res = query_index(index, prompt())
for doc in res:
    print(f"Score: {doc['score']}, Reviewer ID: {doc['id']}")
    print('\n')

These are some example queries I have been testing with:

  • positive tags
  • reliable
  • good customer service

I just need direction on where to go, whether its:

  • further fine-tuning
  • cleaning my data (removing unnecessary noise)
  • technical implementation issues
  • others

If you got to here, thanks a lot!

You need to reduce the noise.

Instead of grouping everything together just simply embed the reviews and then link each embedding with the product name.

Separate the concerns.

You can combine embeddings many ways. So it’s better to create groupings of single-concern embeddings. Product names are meaningless for this, but you could do some fun things like see how the embedding engine “feels” about the product names.

I’d recommend Weaviate. They offer a database that accepts your schema and also can embed items individually, and in groups

2 Likes