Embedding -> Completions: how about token use?

I have been experimenting with embedding-3-small to simulate long-term memory on my main chatbot, but albeit reading the documentation (which, i have to say, is very chaotic) I still cannot figure out how the use of embeddings affects the number of used tokens in Completions.

In my code i am using these functions to add conversations and files to the long term memory:

// ─── Long-Term Semantic Memory ────────────────────────────────

function cosineSimilarity(a, b) {
  const dot = a.reduce((sum, val, i) => sum + val * b[i], 0);
  const normA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
  const normB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
  return dot / (normA * normB);
}

async function saveToLongMemory(userInput, botResponse) {
  const apikey = localStorage.getItem("openaikey");
  const testo = `Utente: ${userInput}\nBot: ${botResponse}`;
  try {
    const res = await fetch("https://api.openai.com/v1/embeddings", {
      method: "POST",
      headers: { "Content-Type": "application/json", Authorization: "Bearer " + apikey },
      body: JSON.stringify({ model: "text-embedding-3-small", input: testo })
    });
    const data = await res.json();
    const vettore = data.data[0].embedding;
    const memoria = JSON.parse(localStorage.getItem("vivacityLongMemory") || "[]");
    memoria.push({ testo, vettore, data: new Date().toISOString() });
    localStorage.setItem("vivacityLongMemory", JSON.stringify(memoria));
  } catch (e) { console.error("saveToLongMemory error:", e); }
}

async function searchLongMemory(query, topK = 3) {
  const apikey = localStorage.getItem("openaikey");
  const memoria = JSON.parse(localStorage.getItem("vivacityLongMemory") || "[]");
  if (memoria.length === 0) return "";
  try {
    const res = await fetch("https://api.openai.com/v1/embeddings", {
      method: "POST",
      headers: { "Content-Type": "application/json", Authorization: "Bearer " + apikey },
      body: JSON.stringify({ model: "text-embedding-3-small", input: query })
    });
    const data = await res.json();
    const queryVettore = data.data[0].embedding;
    const risultati = memoria
      .map(item => ({ testo: item.testo, data: item.data, score: cosineSimilarity(queryVettore, item.vettore) }))
      .sort((a, b) => b.score - a.score)
      .slice(0, topK);
    if (risultati[0].score < 0.5) return ""; // threshold: ignore less relevant memories
    return "Past sessions relevant memory:\n" + risultati.map(r => `[${r.data}] ${r.testo}`).join("\n---\n");
  } catch (e) { console.error("searchLongMemory error:", e); return ""; }
}

//init log term memory        
async function initSelfAwareness() {
  const memoria = JSON.parse(localStorage.getItem("vivacityLongMemory") || "[]");
  
  // check if already saved
  const giaSalvata = memoria.some(item => item.testo.includes("VivacityGPT_Master_Chatbot"));
  if (giaSalvata) return;
  
  // Salvala come ricordo permanente
  await saveToLongMemory(
    "What are your specifics?",
    selfAwareness
  );
  console.log("Self-awareness saved in long memory.");
}


// ─── getChatGPTResponse  ────────────────────────
async function getChatGPTResponse(userInput, chatMemory = []) {
  const apikey = localStorage.getItem("openaikey");
  document.getElementById("apikey").value = apikey;
  if (apikey === "") { alert("No OpenAI API Key found."); return; }
  showSpinner("Generating response...");

// search in long term relevant memory before calling OpenAI
const contestoMemoria = await searchLongMemory(userInput);

  // if there is any active document, attach it as follow-up
  // so that the model sees the entire content, not just a summary
  if (contestoMemoria && chatMemory.length > 0) {
    chatMemory[0].content = chatMemory[0].content + "\n\n" + contestoMemoria;
  }        
        
  const userMessage = currentFileId
    ? {
        role: "user",
        content: [
          { type: "text", text: userInput },
          { type: "file", file: { file_id: currentFileId } }
        ]
      }
    : { role: "user", content: userInput };

  try {
    const response = await fetch("https://api.openai.com/v1/chat/completions", {
      method: "POST",
      headers: { "Content-Type": "application/json", Authorization: "Bearer " + apikey },
      body: JSON.stringify({
       
        model: "gpt-4.1-mini",
        messages: [...chatMemory, userMessage],
        max_tokens: 15999
      })
    });
    if (!response.ok) throw new Error("Error while requesting to the API");
    const data = await response.json();
    if (!data.choices || !data.choices.length || !data.choices[0].message || !data.choices[0].message.content)
      throw new Error("Invalid API response");
    const chatGPTResponse = data.choices[0].message.content.trim();

[... rest of the call... ]

But when i try to check the number of used tokens it seems not reliable because it looks too low.

They are catched from the reponse with data.usage.completion_tokens and data.usage.prompt_tokens:

const tokenCount = document.createElement("p");
    if (data.usage.completion_tokens) {
      const requestTokens = data.usage.prompt_tokens;
      const responseTokens = data.usage.completion_tokens;
      const totalTokens = data.usage.total_tokens;
      const pricepertokenprompt = 1.25 / 1000000;
      const pricepertokenresponse = 10 / 1000000;
      const priceperrequest = pricepertokenprompt * requestTokens;
      const priceperresponse = pricepertokenresponse * responseTokens;
      const totalExpense = priceperrequest + priceperresponse;
      tokenCount.innerHTML = `<hr>Your request used ${requestTokens} tokens and costed ${priceperrequest.toFixed(6)}USD<br>This response used ${responseTokens} tokens and costed ${priceperresponse.toFixed(6)}USD<br>Total Tokens: ${totalTokens}. This interaction costed you: ${totalExpense.toFixed(6)}USD.`;
    } else {
      tokenCount.innerHTML = "Unable to track the number of used tokens.";
    }

Now my long term memory has reached over 110 memories from past conversations, but the amount of used tokens calculated in the completion seems too low.

This is a little exchange this morning:

+[9:43] - Guest: What is my favourite car?

-[9:43] - VivacityGPT: Your favourite car is the Pontiac Firebird from 1987. Would you like me to tell you more about this car?

Your request used 2417 tokens and costed 0.003021USD
This response used 25 tokens and costed 0.000250USD
Total Tokens: 2442. This interaction costed you: 0.003271USD.

+[9:43] - Guest: and my favourite meal?

-[9:43] - VivacityGPT: Your favourite meal is lasagne al pesto. It’s a delicious twist on traditional lasagne, combining classic Italian pasta layers with the fresh, aromatic flavor of pesto sauce. Would you like recipes or suggestions related to it?

Your request used 2455 tokens and costed 0.003069USD
This response used 44 tokens and costed 0.000440USD
Total Tokens: 2499. This interaction costed you: 0.003509USD.

Notice that the amount is increasing, but the 2400 and some tokens is not reflecting the 110+ memories in the long term memory.

Anybody has an idea on how it works or can see if my code is missing something?

1 Like

Yes, in your code you are only retrieving up to 3 memories, not all 110.
searchLongMemory(query, topK = 3) computes similarity against the full stored memory list, but then slice(0, topK) keeps only the top 3 results, and those are the only ones appended to the chat context.

So the numbers you see are not inconsistent with 110 stored memories. Most of those memories are only used for similarity search, not sent to the model. The chat request only includes your normal chat history plus the few retrieved memory snippets you append.

4 Likes