Should I use Embedding to search for duplicate reports?

I’m still reading the OpenAI API documentation. But in the meantime, in case anyone is already familiar with it, I was wondering if Embeddings would be a good solution for a feature that detects possible duplicate bug reports.

In the case, for personal purposes, I would like to create a tool (for the Blender software page) that reads a user’s bug report and suggests some confirmed reports that may indicate that the user’s report is a duplicate.

I’ve already created this tool by using the Chat Completions API and the following prompt:

You will be provided with a report containing a title and description of a bug,
With the title and description of this bug report, check if it could be a duplicate of any of these last ${issues.length} confirmed reports:
${titlesString}

To clarify, a duplicate report refers to a bug that has already been reported with the same defect and source.
Specify the titles of existing reports that could potentially make this bug report a duplicate.

It works. But it’s a little flawed. And it spends a lot of tokens because the titlesString list exceeds 500 report titles (I limited it to 300).

To avoid spending even more tokens, I didn’t even provide the report contents.

I don’t understand much about Embeddings, so I don’t know yet if it really makes sense to use this API for that. Is it worth investing time in it?

Hi and welcome to the developer forum!

I think tis is certainly a task that Embeddings could be useful in, they turn the semantic meaning of a group of words into a vector that can be compared to other vectors stored in a database, roughly speaking, context with similar sematic meaning are “close” to each other and so you can pull back “similar” results for a given set of input text.

There is a section on embeddings over on the main site OpenAI Platform

Happy to give any advise you might need to get up and running.

Thanks for the reply @Foxalabs!

It tried to use the Embedding API.

I noticed however that I still won’t be able to include the description of the reports in the search. Because of this error:

This model's maximum context length is 8191 tokens, however you requested 124432 tokens (124432 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.

(If we consider that I only put 500 of the almost 6 thousand reports, this is really unfeasible).

However, as embeddings can be obtained once, this apparently can serve as a cache and may save some tokens at the end of the day.

This is what I did to use the API:

async function getEmbedTexts(texts) {
  const data = await new Promise(resolve => chrome.storage.local.get('openai_secret_key', resolve));
  const OPENAI_API_KEY = data.openai_secret_key;
  if (!OPENAI_API_KEY) {
    console.error('No OpenAI Key.');
    return 'Please enter and save the OpenAI Key first.';
  }

  const apiUrl = 'https://api.openai.com/v1/embeddings';
  const requestOptions = {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${OPENAI_API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      input: texts,
      model: "text-embedding-ada-002"
    }),
  };

  try {
    const response = await fetch(apiUrl, requestOptions);
    const data = await response.json();
    return data.data;
  } catch (error) {
    console.error('Error making request to OpenAI EmbedTexts:', error);
    return 'Sorry, something went wrong. Please try again later.';
  }
}

I made a mistake by bundling more than just the description into a report title.
That error has been fixed now.
However the number of tokens is still considerable. (I will have to improve the cache).