Hi,
I am currently working on a semantic search engine that uses OpenAI embeddings and then uses Llama’s VectorStoreIndex (for testing purposes, might switch to Pinecone for scale) to generate relevant results based on queries.
I am using the text-embedding-ada-002 model for my embeddings.
I have issues with getting accurate results. Also, my embeddings seem to be very similar regardless of the query I put in, as the similarity score is always very close. I’m not sure if the problem I am having is in my prompting, the data I am passing in to get embedded or my method itself.
For context, I am using product reviews data, and my end goal is to be able to return top k relevant products based on the specified query. I’ve attached a sample product review data structure in JSON format:
{
"product_name": "SuperWidget 3000",
"reviewer_email": "reviewer@example.com",
"review_id": "1234-ABCD",
"evaluation": {
"overall_score": 4.5,
"comments": [
{
"text": "Excellent build quality and performance.",
"sentiment": 1
},
{
"text": "A bit pricey for the features offered.",
"sentiment": -1
}
],
"tags": ["Build Quality", "Performance", "Price"],
"highlights": [
"Excellent build quality.",
"Great performance."
],
"lowlights": [
"Expensive.",
"Limited feature set."
],
"summary": "Overall, a great product with excellent build quality and performance, but a bit expensive for the features it offers."
},
"review_details": [
{
"question_id": "971847C7",
"question": "What do you think about the build quality?",
"evaluation_criteria": "Looking for comments on durability and materials used.",
"score_of_1": null,
"score_of_5": 5,
"evaluation_score": 5,
"evaluation_details": "The build quality is excellent, very sturdy and durable.",
"review_text": "The build quality is excellent, very sturdy and durable."
},
{
"question_id": "1B7AD751",
"question": "How would you rate the performance?",
"evaluation_criteria": "Looking for smoothness of operation and speed.",
"score_of_1": null,
"score_of_5": 4,
"evaluation_score": 4,
"evaluation_details": "The performance is top-notch, handles all tasks smoothly.",
"review_text": "The performance is top-notch, handles all tasks smoothly."
},
{
"question_id": "812CDFE0",
"question": "Is the product worth its price?",
"evaluation_criteria": "Checking for value for money.",
"score_of_1": null,
"score_of_5": 3,
"evaluation_score": 3,
"evaluation_details": "A bit pricey for the features offered.",
"review_text": "A bit pricey for the features offered."
},
],
"review_complete": true,
"tags": [],
"timestamp": "2024-05-15T15:27:16.794000",
"is_new": true,
"request_feedback": true,
"feedback_score": null
}
For the input data, my current setup takes the long JSON document and simply formats each review in this format (so the final text being embedded is a long string). Here is the sample format for one review:
concatenated_text = (
f"Product Name: {product_name} | Reviewer Email: {reviewer_email} | Review ID: {review_id} | "
f"Review Complete: {review_complete} | Tags: {tags} | Timestamp: {timestamp} | Is New: {is_new} | "
f"Request Feedback: {request_feedback} | Feedback Score: {feedback_score} | Overall Score: {overall_score} | "
f"Comments: {comments} | Highlights: {highlights} | Lowlights: {lowlights} | Summary: {summary} | "
f"Review Details: {review_details}"
)
this is my main code for now:
Settings.llm = LlamaOpenAI(model="gpt-4-turbo-preview", temperature=0, max_tokens=1024)
Settings.embed_model = OpenAIEmbedding()
llm = Settings.llm
embed_model = Settings.embed_model
embedding_service = EmbeddingService()
docs = data()
vector_documents = []
for doc in docs:
vector_doc = create_vector_document(doc, embedding_service)
vector_documents.append(vector_doc)
index = VectorStoreIndex.from_documents(vector_documents, embed_model=embed_model, llm=llm)
def query_index(index, prompt, top_k=5):
query_engine = index.as_query_engine(similarity_top_k=top_k)
response = query_engine.query(prompt)
relevant_documents = []
if response.source_nodes:
for node in response.source_nodes:
relevant_documents.append({
"id": node.node.metadata['id'],
"name": node.node.metadata['name'],
"text": node.node.text,
"score": node.score,
"match_score": str(node.node.metadata.get('match_score', 0))
})
return relevant_documents
res = query_index(index, prompt())
for doc in res:
print(f"Score: {doc['score']}, Reviewer ID: {doc['id']}")
print('\n')
These are some example queries I have been testing with:
- positive tags
- reliable
- good customer service
I just need direction on where to go, whether its:
- further fine-tuning
- cleaning my data (removing unnecessary noise)
- technical implementation issues
- others
If you got to here, thanks a lot!