This works with gpt-4o:
Row 0:
[0,0] Alice
[0,1] Jim
[0,2] Stuart
[0,3] William
[0,4] Angela
[0,5] June
[0,6] Wendy
[0,7] Tim
Row 1:
[1,0] Rick
[1,1] Laura
[1,2] George
[1,3] Rowan
[1,4] Isla
[1,5] Helen
[1,6] Henry
[1,7] Calum
Row 2:
[2,0] Fred
[2,1] Arthur
[2,2] Pamela
[2,3] Ben
[2,4] Kate
[2,5] Amy
[2,6] Philip
[2,7] Paul
Row 3:
[3,0] Mary
[3,1] Pat
[3,2] Kelly
[3,3] Alan
[3,4] Lily
[3,5] Dan
[3,6] Steve
[3,7] Mike
Row 4:
[4,0] Mat
[4,1] Cameron
[4,2] Duncan
[4,3] James
[4,4] Oliver
[4,5] John
[4,6] Aulay
[4,7] Connor
The model sees this list as:
Row 0: [0,0] Alice [0,1] Jim [0,2] Stuart [0,3] William [0,4] Angela [0,5] June
[0,6] Wendy [0,7] Tim Row 1: [1,0] Rick [1,1] Laura [1,2] George [1,3] Rowan [1,4] Isla [1,5] Helen [1,6] Henry [1,7] Calum etc…
The model may see in 1D but it actually does a decent job of mapping that information spatially. You can help it out by giving it anchors… The “Row n:” gives the model anchor points to know where clusters of tokens start and clusters is probably the best way to think about it. The token “Row” puts the model in the right frame of mind to think spatially. It knows that rows can potentially be above and below each other. The cell coordinates gives each name an anchor that can be reasoned over (or at least fake reasoned over because they can’t truly reason.) Through RLHF the model has learned that 2 comes before 3 and 4 comes after 3 and so on.
You can’t just ask who’s diagonally below lily because that’s not specific enough. You have to ask below and to the right. Everything is really a function of how far a value is from it’s label/anchor. The closer a value is to an anchor that has semantic meaning the more likely you are to get an accurate answer. When you have just a bunch of names separated by pipes (|) there’s nothing for the model to latch on to.
Hope that helps…