Does OpenAI do continuous number encoding?

basically, LLM is not good at math because of its tokenizer and sampling system.
So there are many papers about this,
such as,

So my question is, does OpenAI do this for now? or in the future?
I do not think they do because their tokenizing/sampling system is open source then I think we do not see such a process to decode numbers in special way.

1 Like

:thinking:

Well, you can just take a look at the tokenizer https://platform.openai.com/tokenizer

and the answer is, up to a point (9), yes (maybe more with the new tokenizers)

But generally no :confused:

(220 is a space)

But I don’t think they will - IMO this is such a niche application, the LLM would be better served just using a tool for comparison where needed.

2 Likes

It basically means to retrain the whole model, which is not very economical at all. Plus, adding a separate line (or vector space rather than token embedding space) is still not economical. What about separately adding [img] [video] and all other applications? I believe those will make the specific application more robust. Still, the computation-wise is a bit heavy, and it should be more efficient if OpenAI (and other major LLM providers) could adopt it.

1 Like

Does this mean that there is no demand?

For example, the coordinates of objects in an image or the specific distance of objects in an image—are such low-level answers not in high demand?

For example, when doing category classification, would predicting the probability for each category be a similar case?

we have a thread on this here (GPT-4o Model: Image Coordinate Recognition), it sorta works already, kinda. I expect it to get a lot better over time

There’s already an approach for this, generally: embedding models. You can remap embeddings (Customizing embeddings | OpenAI Cookbook) if you want, in theory. (I don’t know if anyone still does this though)