Large Language Models and Conversational User Interfaces for Interactive Fiction and other Videogames

Introduction

This forum post is about implementational details pertaining to conversational user interfaces (CUI) for interactive fiction (IF) and other videogames and pertaining to bidirectional synchronizations between game engines and large language models (LLMs).

Natural Language Understanding and Semantic Frames

How can LLMs be utilized as text parsers for existing or new works of IF?

One approach involves mapping spoken or written commands to semantic frames with slots that could be filled by nouns or by embedding vectors which represent those nouns. Perhaps the zero vector could be utilized to signify an empty slot, null or undefined.

Consider that commands like “take lamp” and “pick up the bronze lamp” could both utilize the typed semantic frame for “taking” (https://framenet2.icsi.berkeley.edu/fnReports/data/frame/Taking.xml).

A command like “take it” or “pick it up” could be interpreted using by LLMs using dialogue context after a command like “inspect lamp”.

Disjunction support for semantic frames’ slots could be useful for reporting multiple candidate nouns. A NLU component might want to output that for “pick it up” the “lamp” is 90% probably the resolution of the pronoun and “treasure chest” 10%. With disjunctive and potentially probabilistic outputs, CUI for IF or other videogames could ask players whether they meant “the lamp” or “the treasure chest” in a previous command.

Bidirectional Synchronizations between Game Engines and Large Language Models

Envisioned here are bidirectional synchronizations between game engines and LLMs. In these regards, let us consider that game engines could manage and maintain dynamic documents, transcripts, and logs and that these could be components of larger prompts to LLMs.

Consider, for example, an animate creature arriving on screen and that a player is desired to be able to use a CUI to refer to it. How did the LLM know that the creature, e.g., an “orc”, was on screen, that it had entered the dialogue context?

By managing dynamic documents, transcripts, or logs, game engines could provide synchronized contexts as components of prompts to LLMs.

This would be towards providing an illusion that the CUI AI also sees or understands the contexts of IF or other videogames.

Next, that creature, e.g., an “orc”, might enter view and then exit view. How would an LLM interpret a delayed command from the player to then respond that that creature was no longer in view? This suggests features of a dynamic transcript or log.

That is, a fuller illusion would be one that the AI sees or understands the present and recent past contexts of IF and other videogames.

Game engines, e.g., Unity and Unreal, could eventually come to support interoperation with LLMs’ dialogue contexts via features for the maintenance of dynamic documents, transcripts, or logs. These engines would then be of general use for creating CUI-enhanced IF and other videogames.

Also possible are uses of multimodal LLMs.

Transmission Efficiency

Instead of having to transmit the entirety of dynamic documents, transcripts, logs, or prompts for each spoken or written command to be interpreted by the LLM CUI, it is possible that “deltas” or “diffs” could be transmitted to synchronize between client-side and server-side copies of larger prompts or portions thereof.

Conclusion

Thank you. I hope that I expressed these ideas clearly. I look forward to discussing these ideas with you. Is anyone else thinking about or working on these or similar challenges?

1 Like

@scottfarris81, I am also interested in enhancing interactive fiction and videogames with conversational user interfaces. I think about how to provide players with play experiences involving their smart speaker devices, for audio-only games, and about how other devices, e.g., computers, videogame consoles, or smart televisions, might interoperate to enable accompanying visual experiences.

I hear you that running one or more large AI models on players’ computers or consoles could be problematic and that responses to players’ spoken or written commands should be fast.

With respect to theory, about how this all could work under the hood, game engines’ resources for text-based descriptions of things, e.g., “orc”, could involve templates. Changes to the runtime model for a particular “orc”, then, could result in updates to its text description.

For example, were one “orc” in a group of “orcs” to pick up something from the ground, e.g., “a wooden club”, its natural-language text description would update as a result and players could then refer to that specific “orc” by indicating “the orc holding the wooden club”.

Also, for more information on templates and large language models, see also the Guidance project.

A good example of a template engine is Handlebars. It utilizes a curly-bracket syntax: {{...}}. I don’t know whether developers have integrated it, or similar components, into game engines before.

Potential efficiencies and optimizations include processing the templates to obtain dependencies between runtime models and the templates’ processing so that only relevant changes, as opposed to any and all changes, to runtime models would result in reinvoking template processors to update text descriptions.

In theory, these text descriptions would, then, be parts of larger scene descriptions, dynamic documents, transcripts, or logs which would, then, be parts of the prompts to the LLMs for interpreting the players’ commands and engaging with players in other game-related dialogue.

@scottfarris81, most of my forum posts here, thus far, have pertained to civics and civic technology. I happen to chair a Civic Technology Community Group. In that role, I think about how existing and new technologies, e.g., AI and LLMs, can improve society and democracy.

With respect to this forum thread, I am hoping to participate in some discussion, to share knowledge, and to learn more about others’ theories and experiences with respect to challenges involving large language models and conversational user interfaces for interactive fiction and other videogames.

Ok, that makes more sense, the only thing I have that is similar to what your looking for is that I am disabled and and have type of learning deficit. And I am using AI try to overcome them but there are definitely challenges in that too.

Yes, the ideas mentioned in the original post definitely cover major aspects of the challenges that need to be overcome to implement interfaces that enable real time interaction.
My research is leading me towards two different paths that are potentially worth exploring:
Translating the game state into a text document/spreadsheet and providing the model with a set of rules, history log and likely strategy documents and a translational interface to enable the LLM to interact with the game.
Going this route I identified open source turn-based strategy games as interesting entry points. The second idea is treating the challenge like a development for deaf and blind people. Admittedly this is even more challenging as both heavily rely on their other senses to compensate but they also made great advancements when it comes to defining needs and thus helping developers create games that can be played by these audiences.
And, question, have you seen the OpenAI gym? It’s a project for reinforcement learning but also offers insights into established techniques to communicate game states and enable interaction between a model and video games.

@vb, I had considered natural-language text documents but hadn’t yet considered uses of tables or spreadsheets, instead of or additionally, with respect to bidirectional synchronization of game state between game engines and LLMs.

Conversational user interfaces could enable new accessibility scenarios for videogames.

I have encountered Gym/Gymnasium. I have also taken a look at and dabbled with Unity ML Agents. I could look at these anew to explore, in particular, established techniques for communicating and synchronizing game states with AI systems.

On these topics of interactive fiction and reinforcement learning agents, one might also find interesting the following publication and the works that it cites:

Basavatia, Shreyas, Shivam Ratnakar, and Keerthiram Murugesan. “ComplexWorld: A Large Language Model-based Interactive Fiction Learning Environment for Text-based Reinforcement Learning Agents.” In International Joint Conference on Artificial Intelligence 2023 Workshop on Knowledge-Based Compositional Generalization . 2023. [PDF]

1 Like

Summarizing the discussion thus far and adding some new thoughts and ideas:

Bidirectional synchronization between games or computer simulations running in game engines, e.g., Unity or Unreal, and large language models, e.g., GPT, can enable: (1) command interpretation, (2) game-related dialogue, e.g., Q&A, (3) gameplay by AI agents, (4) narration and transcription by AI agents, and (5) automated storytelling by AI agents.

With respect to command interpretation, the resultant output might resemble semantic frames with game objects filling slots in the semantic frames. Game objects in the slots of semantic frames could be referenced by IDs or by embedding vectors.

With respect to game-related dialogue, e.g., Q&A, AI systems, e.g., GPT, could answer questions about game worlds, game states, and on-screen contents. This could also be performed per virtual camera so that AI agents, each having a virtual camera, could ask questions about game worlds as visible to them.

With respect to gameplay by AI agents, AI systems, perhaps RL-based, could control (non-player) characters.

With respect to narration and transcription by AI agents, AI systems could narrate unfolding game events or computer-simulation events to players. AI agents could also transcribe unfolding events into transcripts or logs.

With respect to automated storytelling by AI agents, GPT-based systems could procedurally generate story content and events to be provided to and subsequently run by game engines.

Also, with respect to the resultant structure, coherence and cohesion of dynamic documents, transcripts, or logs, beyond template engines simply outputting chunks of content per game object to be simply assembled together into larger scene descriptions, templates could access documents’ object models to procedurally generate more organized content into various sections of dynamic documents, transcripts, or logs, or portions of these chunks could contain metadata annotations to enable more complex assembly procedures.

As standard approaches and best practices are discovered for these use cases, functionalities for these forms of bidirectional synchronization between game engines and large language models could, eventually, be built into game engines, e.g., Unity or Unreal, perhaps made available via plugins, and/or could be provided to developers via other SDKs, e.g., DirectX or open-source libraries.