Duplicate comparision score

ravithakur50 · July 11, 2024, 3:55pm

I am working on a project in my team. We have a use case which i am planning to solve using AI.
I wanted your advise if you can guide me.

Scenario - We have a repository where we store the API specifications for any API that we build.
We think that there are many API specifications which might be duplicate in nature based on their functionality.
We are trying to detect that.

Approach - We were planning that we can give 2 API Specifications at a time via API call to Chatgpt 4 and get a comparison score.
Do you think think would be a wise approach ?

Considerations - we have 15000 API specifications and need to compare each on of them to find the Duplication.
The API specification can have different name for attributes or description can vary as well.

Can you advise how to approach it ?

Diet · July 11, 2024, 3:59pm

Welcome to the community!

I believe that’s ~ 112_500_000 comparisons?

have you considered maybe

using the LLM to generate keywords to build an index
using the LLM to generate homogenous descriptions and then using embeddings to build an index

and using either of these indices to narrow down your comparison choices?

ravithakur50 · July 11, 2024, 4:23pm

yes have thought of classifications and grouping them . But is that it ? Will that be the best and most efficient way ? I am not sure on if creating embedding for all 15000 specifications and storing them in vector database and then sending them for comparison is a feasible idea ? I mean i am just thinking about all this and really wanted to ensure i am not doing making inefficient approach here.

Diet · July 11, 2024, 4:35pm

what do you define as inefficient? If I understand your OP, you’d consider making the 100 million LLM calls to be more efficient than making around 30000?

I guess it’s a matter of balancing engineering cost vs API cost.

ravithakur50 · July 11, 2024, 4:42pm

Inefficient as in

It should have very less number of iterations as large payloads are there.
API which was already converted in embedding for comparison should not be again converted in embedding in next iterations.
The matching score should be consistent and results should not be unpredictable.

Diet · July 11, 2024, 4:49pm

There’s no real free lunch here, but

It should have very less number of iterations as large payloads are there.

you’d need at least 15000 completion calls, imo, to prepare the API specs to be embedded

API which was already converted in embedding for comparison should not be again converted in embedding in next iterations.

yeah, you only need to embed each one once. 15000 embedding calls. your vectordb handles the rest, you don’t need to embed stuff multiple times.

The matching score should be consistent and results should not be unpredictable.

yeah, you need to do some engineering to figure out your thresholds for retrieval, but that’s where you can do a post processing step where you’re comparing the most likely candidates with each other. That would take no more than 15000 completion calls.

so: 15000 - 30000 completions, 15000 embedding calls, and some vector index calls (basically free, just run the 100 million dot products on your laptop during lunch break).

ravithakur50 · July 11, 2024, 5:13pm

Thank you @Diet . This looks a promising approach. Let me try that.

Topic		Replies	Views
Compare PDF files and shows changes API gpt-4 , chatgpt , api , assistants-api	6	987	May 23, 2024
Optimized way to approach this problem API api , assistants-api	4	100	September 4, 2024
Compare 2 long texts using GPT API	6	4397	August 15, 2023
Comparing PDF Files to get Changes API chatgpt , api , python	5	5525	June 2, 2024
Inquiry on ChatGPT's Document Analysis Capabilities API gpt-4 , chatgpt , api , assistants-api	2	358	June 17, 2024

Duplicate comparision score

Related Topics