Research topic: How we can handle information of coordination for robot manipulation

I want to use ChatGPT for robot manipulation. The robot can pick and put objects on the table.
I want GPT to understand the location of each object and output the inference results with coordination for the robot action.

There is one problem. How GPT can output the coordination? without fine-tuning (I think it’s possible with fine-tuning there are some papers.)
Let’s assume we cannot use a specific object to say position, this is going to be easy because LLM can output specific objects instead of coordination.

The question is, is there a good way to handle location information in LLM?? because everything have text context in LLM, so it’s hard numerical information. for example coordination on the table.