What languages does the retrieval embedding support for gpt4-1106-preview

Does it support for example Norwegian? Or does it translate it to english and search that way? Is it a updated version of the openai embedding api or is it the same?

From what I understand, an embedding is just a model which groups similar words and meanings together. Since it’s a LLM, it should work basically in “any” language.

From what I know, you must do the embedding yourself, as in: Vector search in the embeddings, give the results to GPT-3.5 or GPT-4 and ask it to answer based on the snippets “in the user’s language”, or similar.

You have to pre-train a embedding model for it to get meaning for a group of words. So my question is what language is it trained on.

You should augment your knowledge about what you know - it all changed Monday.

https://platform.openai.com/docs/assistants/tools/knowledge-retrieval

Knowledge Retrieval

Retrieval augments the Assistant with knowledge from outside its model, such as proprietary product information or documents provided by your users. Once a file is uploaded and passed to the Assistant, OpenAI will automatically chunk your documents, index and store the embeddings, and implement vector search to retrieve relevant content to answer user queries.

The embedding engine is likely the same text-embedding-ada-002 we know. It does semantic similarity search, and can match topics across languages. Its fluency is based on training similar to ChatGPT, but it is a smaller model.

When using the model directly instead of through the new assistant feature, it doesn’t have its own document system. You have to add knowledge to the API call yourself.

1 Like

Oh, this is neat.
However, it still seems to use the use the same method I know. The only thing that changed is the entry level for embeddings; You had to do everything yourself or use libraries such as LangChain. Now OpenAI does “everything” for you.

OpenAI will automatically chunk your documents, index and store the embeddings, and implement vector search to retrieve relevant content to answer user queries.

So my point still seems valid. It should work in “any” language the smaller ada model has been trained on, which are probably a lot.

But what this means is that it eliminated a lot of possibilities of custom prompts. In which language will it answer?
There’s just one way to find out, I guess.

Embedding will retrieve material of semantic relevance. The AI will answer so that you can understand.

Retrieval augmentation from a Japanese website, in the assistant role with “relevant knowledge:” preprompt:

The gpt-4-1106 from Assistans API itself tells how it does RAG:

Certainly! When searching within a document using the myfiles_browser tool, I utilize full-text search capabilities that rely on keywords and phrases to locate relevant sections within the text. This tool can recognize variations and context-related instances of the search terms, aiding in finding pertinent excerpts of the document.
In addition to the search function, I have the ability to open and peruse the document from beginning to end. I can navigate through different sections and extract relevant information using commands such as click(), back(), quote(), and scroll(). This means that I can either read through the entire document or search for specific sections of text using the search feature.
For the current query, I conducted searches using the term “vanhustenhuolto” to find candidates’ opinions on elderly care. If a more comprehensive review of the document is required, I can open it from the beginning and scroll through it.

The “full-text search” is obviously based on embeddings. Documentation says that the model can also read complete shorter files into the context. I have no confirmation for the capabilities it describes above, except that it has offered to read “the beginning” of a file, or maybe even the entire file, after it didn’t find relevant content with embeddings.

I haven’t approved, because my file is ≈500kB JSON. I have also got rate limit errors, don’t quite know if the model has hiddenly tried to skim through the entire file or part of it, or if someone has used my quota around here. (It is shared by several projects.)

(Oh yeah that ‘vanhustenhuolto’ is Finnish for ‘elderly care’, which it derived from the user question.)

1 Like

Please elaborate more on the storage part. Earlier, I had to use FAISS to store my embeddings. Has this changed? Can I keep my vector db with openai?

For context:I’ll be using the API, my data is about 40 millions tokens

Your method should still work fine. They only added a new retrieval feature, afaik.
Removing basic features like Embeddings (the manual way) would break A LOT of workflows.