Help decoding base64 embeddings in NodeJS

TLDR: I can’t figure out what the encoding format is for base64 embeddings returned by the create embedding API.

  • We’re using the create embedding API (called via a Node script) to embed text and then store the result in a database.
  • We’re specifying the base64 encoding format to play more nicely with our database.
  • We’re failing to decode the embedding when we pull it out of our database (or directly when it is returned by the API)

Would love help from anyone who has restored base64 embeddings successfully!

Embedding generation

const embeddingInput = ... // this is a text field
const embeddingResponse = await openai.embeddings.create({
  model: 'text-embedding-ada-002',
  input: embeddingInput,
  encoding_format: 'base64'
});
const [{ embedding }] = embeddingResponse.data;
return embedding // note that typing for this is broken when specifying 'base64' but that doesn't really matter here!

Sample embedding

Here is a sample embedding that was returned by the API that has failed our decoding attempts so far (you can try with this link):



Answering my own question! Looks like it’s stored as a buffer of floats. Here is working code:

const embeddingBuffer = Buffer.from(embedding, 'base64');
const decodedEmbedding = new Float32Array(embeddingBuffer);
1 Like

cuious: why’d you go with b64 encoding?

1 Like

Ok just kidding incredibly odd. That returns a 6144 length array (suspiciously 3x 2048) of integers

Float32Array(6144) [
  212, 162, 131, 188, 247, 228, 239, 187, 246,  84, 214, 188,
  235,  80, 163, 188,  55,  98,  18, 189, 138, 188, 151,  60,
   86,  33,  85, 188,  39, 253, 136, 187, 179,  50,  55, 188,
  122, 121, 225,  59, 118, 246,  55,  61, 146,   5, 174, 187,
  160, 207,  77, 188, 168,  24, 100,  60, 198,  93, 173, 187,
  184, 136,  61, 188, 234, 192,   9,  61,  37, 154,  18, 188,
  116,  80, 126,  59,  20, 188, 242, 188,   5,  53,  22, 188,
  169, 190, 157,  59, 157,   9, 225, 186,  66,  91, 245,  59,
  216, 237,  57, 187,
  ... 6044 more items
]

I’ll dig in more here and post what the fix is to deserializing this as a float array.

More stream of thoughts, it looks like the following code is working with the same base64 encoded value:

buffer = base64.b64decode(embedding)
base64_embedding = np.frombuffer(buffer, dtype=np.float32)

so the issue appears to be really just be on how to de-serialize an array of float32 values.

Ok confirmed the issue is that I needed to access the “buffer” property of the buffer.

new Float32Array(
      Buffer.from(cacheItem.embedding, 'base64').buffer
);