Duplicate comparision score

I am working on a project in my team. We have a use case which i am planning to solve using AI.
I wanted your advise if you can guide me.

Scenario - We have a repository where we store the API specifications for any API that we build.
We think that there are many API specifications which might be duplicate in nature based on their functionality.
We are trying to detect that.

Approach - We were planning that we can give 2 API Specifications at a time via API call to Chatgpt 4 and get a comparison score.
Do you think think would be a wise approach ?

Considerations - we have 15000 API specifications and need to compare each on of them to find the Duplication.
The API specification can have different name for attributes or description can vary as well.

Can you advise how to approach it ?

1 Like

Welcome to the community!

I believe that’s ~ 112_500_000 comparisons? :thinking:

have you considered maybe

  1. using the LLM to generate keywords to build an index
  2. using the LLM to generate homogenous descriptions and then using embeddings to build an index

and using either of these indices to narrow down your comparison choices? :thinking:

3 Likes

yes have thought of classifications and grouping them . But is that it ? Will that be the best and most efficient way ? I am not sure on if creating embedding for all 15000 specifications and storing them in vector database and then sending them for comparison is a feasible idea ? I mean i am just thinking about all this and really wanted to ensure i am not doing making inefficient approach here.

what do you define as inefficient? If I understand your OP, you’d consider making the 100 million LLM calls to be more efficient than making around 30000? :thinking:

I guess it’s a matter of balancing engineering cost vs API cost.

Inefficient as in

  1. It should have very less number of iterations as large payloads are there.
  2. API which was already converted in embedding for comparison should not be again converted in embedding in next iterations.
  3. The matching score should be consistent and results should not be unpredictable.

There’s no real free lunch here, but

  1. It should have very less number of iterations as large payloads are there.

you’d need at least 15000 completion calls, imo, to prepare the API specs to be embedded

  1. API which was already converted in embedding for comparison should not be again converted in embedding in next iterations.

yeah, you only need to embed each one once. 15000 embedding calls. your vectordb handles the rest, you don’t need to embed stuff multiple times.

  1. The matching score should be consistent and results should not be unpredictable.

yeah, you need to do some engineering to figure out your thresholds for retrieval, but that’s where you can do a post processing step where you’re comparing the most likely candidates with each other. That would take no more than 15000 completion calls.

so: 15000 - 30000 completions, 15000 embedding calls, and some vector index calls (basically free, just run the 100 million dot products on your laptop during lunch break).

1 Like

Thank you @Diet . This looks a promising approach. Let me try that.

1 Like