It is not gpt-4o-mini that would have 8k token limitations, but the input context to embeddings models or to the search feature that is powered by it. gpt-4o-mini
has a context window length of 128000 tokens.
“Handle the error” while “don’t want to truncate the input”?? If you send more than the model can accept, you will get an API error, and “handle” can thus only be “don’t completely crash”, because if you do not modify your technique, it cannot succeed.
You will need to come up with a strategy for handling this. What I would suggest is a token counter, and then if it would approach or exceed the embeddings model input limit, but not that of a language model such as gpt-4o-mini itself, have a summarizing API call.
This also can be not just “summarize this text”, but can use hypothetical answering, where you prompt the AI to write a new text that looks like the kind of answer and the kind of document that would contain such an answer. This increases your matches further.
If you are not using OpenAI’s product, but your own embeddings database, you can split texts and make multiple embeddings calls. Then either add and renormalize those vectors for a query vector, or combine the multiple results into a new weighted ranking.
Token counting is by tiktoken
by OpenAI, in Python. Encode a string and see the length in token number elements. You can set up a token counting API worker yourself if you are coding in a different language and platform without Python.