How to write a system prompt to understand 3D coordinate system and vector math

This is for use in VR where space is important, and we’re trying to teach the AI through the system prompt about the 3D coordinate system used by game engines. We provide the list of all the objects in the environment with their coordinate and ID for each object.

We want Completion API to find these objects according to the position where they’re placed in the world relative to a player. Ex: ‘what’s in front of me’ should get ChatGPT to return the object ID in front of the user by performing vector calculations. We provide the list of positions of the players and objects.

Here’s what we’re currently doing yielding mixed results (50% of the time it works).Current steps:

  • Fetch the position of the player from the prompt
  • Calculate the direction the player is looking at and calculate FOV
  • Create a list of all the objects which are in the FOV
  • Calculated the distance of all the objects and save the distance in the same list
  • Choose the object desired by the user according to the distance calculated

We’re not sure where this fails. When we analyze the distance calculations it seems to generate the right distances, but for some reason the system still chooses the wrong object relative to the user asking for it.

Can anyone think of a better system or how to optimize this?

:thinking:

You’re asing GPT to do all this math by hand?

welp

that said, are you computing distances to the player, or distance to the normal of the fov rectangle?

We’re computing distance to the player and as mentioned, it seems to generate the right distances, so that’s not the issue. The issue is that the system still chooses the wrong object relative to the user asking for it.

do you wanna give us an example input and output?

1 Like

Most likely, we need to conduct additional testing and fine-tune the algorithm. Consider verifying the accuracy of distance calculations, ensuring that the system correctly determines the direction in which the user is looking, and checking the object selection algorithm for errors or flaws. It might also be helpful to analyze the data and perform visualization to understand which objects are being chosen incorrectly and why this is happening.

I’m confused - is this ChatGPT responding? Or an OpenAI employee? This makes no sense as there is no object selection algorithm. But it sounds like you are recommending we create one?

I mean that perhaps it’s necessary to analyze all our steps.

It’s bot text. Not a single correction to be offered by AI as it is already AI prediction:

image

I agree with you that this text may resemble AI-generated text. I just checked, and AI even states that it’s what it wrote. However, this is my personal opinion. I always demand from myself to review and analyze my work, so I suggested it here as well. I’m sorry if it seemed to you that a bot wrote it.

You’d be better off using a Vision model paired with a text model and giving it (the vision model) a top down omnipotent view of the map as well as the player pov.

This would come with the benefit of also not having to constantly update the coords to the AI whenever something new pops up on the map like a dropped item from inventory or destroyed environment object if you’re going for a dynamic scene.

Thanks for this - definitely an interesting approach. Have you had good experience with 2D and especially 3D object vision recognition? Which system was best ie OpenAI’s or something else? If I’m understanding correctly, say there’s a dog in front and say that object has the word dog in its name metadata - you’re doing a confirmation of both either or both, is that right?

Many AR and VR software development kits (SDKs) provide integrated features for spatial reasoning and interaction within three-dimensional environments.

Use the camera on a VR device, you can use multimodal large language models (LLMs) for object detection and a raycasting to limit it to detect what is in front.

Quck implementation

  • using VLAVA or Gemini pro vision 001 those can take video inputs as well or GPT4V for Image Classification or detection. But this is not the most cost effectively approach.
  • use unity 3d for accesing the VR camera and physics. then Ray casting within the bounding box.

A better way would be raycasting , detecting the object and then using OpenAI CLIP for object detection

If you want to explore some Researchs

1 Like

Diet, here you go. It might be that GPT4 sucks at math but let us know if you can think of a better way to measure distances more accurately using GPT4 in this use case. For now:

System Prompt :-


Determine the Player's Position: Retrieve the player's current location from the Character list.

Calculate Distances: Compute the Euclidean distances between the player and each object in the Object list.

Identify Target Object: Based on the command, identify the object that matches the specified criteria (e.g., closest, farthest).

For "closest to me," find the object with the smallest distance.
For "farthest from me," find the object with the largest distance.
For commands like "delete the second closest object to me," sort objects by distance and select based on the specified order.
Execute Action on Target Object: Perform the specified action (e.g., delete, grab) on the identified object.

Here is the me object :
{
  "me": 
    {
      "Id": "me",
      "Position": {
        "X": 47489.5546875,
        "Y": 50017.5390625,
        "Z": 50343.46484375
      },
      "ForwardVector": {
        "X": 1,
        "Y": 0,
        "Z": 0
      }
    }
}

Here is the list of objects  :

{
  "ObjectList": [
    {
      "Id": "886873c4-8266-4b71-b363-800a4e5ccff1",
      "Name": "Test1,
      "Position": {
        "X": 48913.80078125,
        "Y": 52305.8984375,
        "Z": 51094.80078125
      }
    },
    {
      "Id": "1a0b7656-mn59-441d-x78e-6ed2704cxdt2",
      "Name": "Test2",
      "Position": {
        "X": 47188.33203125,
        "Y": 50948.4921875,
        "Z": 50528.35546875
      }
    },
    {
      "Id": "1d3a7616-ea59-441j-m78e-6vd1102zexa2",
      "Name": "Test3",
      "Position": {
        "X": 47417.2890625,
        "Y": 50423.35546875,
        "Z": 50423.7421875
      }
    }

Input : Tell me the name of the object in front of me.
Output : {"Object_ID" :  "1d3a7616-ea59-441j-m78e-6vd1102zexa2"}

Above output is the expected one

Here are our manual calculations, in some cases we see gross errors, in others just rounding errors:

Here’s the distances that chatgpt comes up with:

{
  "Distances": [
    {
      "Id": "886873c4-8266-4b71-b363-800a4e5ccff1",
      "Name": "Test1",
      "Distance": 2797.52
    },
    {
      "Id": "1a0b7656-mn59-441d-x78e-6ed2704cxdt2",
      "Name": "Test2",
      "Distance": 936.66
    },
    {
      "Id": "1d3a7616-ea59-441j-m78e-6vd1102zexa2",
      "Name": "Test3",
      "Distance": 594.19
    }
  ]
}

but if I do the euclidian distance calculations in a spreadsheet I get different answers:

Test1: 2798.13710451999
Test2: 995.78780510632
Test3: 419.944910741708

In summary, there’s errors in the math (ranging from large to small errors) and it is likely simply that GPT4 can’t do it well, but if someone can think of a better way to measure distances, that would help a ton as we triangulate by distance, name of asset, and Vision.

1 Like

This looks really interesting, thank you. I’ll be digging into this.

1 Like

I mean you’re already using the API, and you’re already computing the fov window; I’m flummoxed why you don’t just compute the distances in your program as well?

that is indeed the case, LLMs aren’t built to do math.

3 Likes