Should I use Embedding to search for duplicate reports?

mano-wii · August 20, 2023, 9:55pm

I’m still reading the OpenAI API documentation. But in the meantime, in case anyone is already familiar with it, I was wondering if Embeddings would be a good solution for a feature that detects possible duplicate bug reports.

In the case, for personal purposes, I would like to create a tool (for the Blender software page) that reads a user’s bug report and suggests some confirmed reports that may indicate that the user’s report is a duplicate.

I’ve already created this tool by using the Chat Completions API and the following prompt:

You will be provided with a report containing a title and description of a bug,
With the title and description of this bug report, check if it could be a duplicate of any of these last ${issues.length} confirmed reports:
${titlesString}

To clarify, a duplicate report refers to a bug that has already been reported with the same defect and source.
Specify the titles of existing reports that could potentially make this bug report a duplicate.

It works. But it’s a little flawed. And it spends a lot of tokens because the titlesString list exceeds 500 report titles (I limited it to 300).

To avoid spending even more tokens, I didn’t even provide the report contents.

I don’t understand much about Embeddings, so I don’t know yet if it really makes sense to use this API for that. Is it worth investing time in it?

Foxalabs · August 20, 2023, 10:19pm

Hi and welcome to the developer forum!

I think tis is certainly a task that Embeddings could be useful in, they turn the semantic meaning of a group of words into a vector that can be compared to other vectors stored in a database, roughly speaking, context with similar sematic meaning are “close” to each other and so you can pull back “similar” results for a given set of input text.

There is a section on embeddings over on the main site OpenAI Platform

Happy to give any advise you might need to get up and running.

mano-wii · August 21, 2023, 12:37am

Thanks for the reply @Foxalabs!

It tried to use the Embedding API.

I noticed however that I still won’t be able to include the description of the reports in the search. Because of this error:

This model's maximum context length is 8191 tokens, however you requested 124432 tokens (124432 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.

(If we consider that I only put 500 of the almost 6 thousand reports, this is really unfeasible).

However, as embeddings can be obtained once, this apparently can serve as a cache and may save some tokens at the end of the day.

This is what I did to use the API:

async function getEmbedTexts(texts) {
  const data = await new Promise(resolve => chrome.storage.local.get('openai_secret_key', resolve));
  const OPENAI_API_KEY = data.openai_secret_key;
  if (!OPENAI_API_KEY) {
    console.error('No OpenAI Key.');
    return 'Please enter and save the OpenAI Key first.';
  }

  const apiUrl = 'https://api.openai.com/v1/embeddings';
  const requestOptions = {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${OPENAI_API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      input: texts,
      model: "text-embedding-ada-002"
    }),
  };

  try {
    const response = await fetch(apiUrl, requestOptions);
    const data = await response.json();
    return data.data;
  } catch (error) {
    console.error('Error making request to OpenAI EmbedTexts:', error);
    return 'Sorry, something went wrong. Please try again later.';
  }
}

mano-wii · August 21, 2023, 1:26am

I made a mistake by bundling more than just the description into a report title.
That error has been fixed now.
However the number of tokens is still considerable. (I will have to improve the cache).

Topic		Replies	Views
Can we use OpenAi Embeddings (ada) for Similarity / duplicates on articles? Community api , ada	6	1124	March 25, 2024
What am I doing wrong on my semantic search JSON embeded? API	16	4535	February 21, 2024
How to bypass broken restrictions/censorship for embeddings? Bugs api	3	337	December 14, 2024
OpenAI Embeddings - Search through ~1000 PDFs API embeddings	3	3376	August 28, 2024
Embedding Longer Texts API	8	15136	December 25, 2023

Should I use Embedding to search for duplicate reports?

Related topics