Fun with Assistants -- getting them to really talk to each other

@LikelyCandidate I think this is a good framework, but I think we run the risk of falsely thinking that we can set this personality traits to a degree of precision that these % suggest and how would we know if we have the personality we wanted to achieve ?

@icdev2dev The big limitation with assistants right now is that I can only run one assistant at a time on a thread. So you can fire up assistants in a round robin fashion, one at a time, and they can review what has been said and contribute or not. I believe this will change in time.

An interesting experiment would be to ask the active assistant who they want to talk to next. The program would then pull the name from the response then fire up that assistant.


Yes. This is true (one assistant per thread). Many full moons ago (pre assistants), I wrote a project (GitHub - icdev2dev/bbrains: Boltzmann Brains) which illustrates how to address individual team members. Of course this, in context of ChatCompletion alone, is different mechanistically than in the era of Assistants.

The YouTube link is below (watch in 2x)

Specifically I believe that we can do all of multiple round robins in context of a UI for the framework that we have co-developed here; including evolution of assistants.

1 Like

Do stored openai assistants keep their personality indefinitely ?

Hmmm … that risk can be mitigated by understanding the Big5 model. Here is how we can solve it:
• The Big 5 personality traits are a roll-up of 2-3 personality aspects for each. These aspects are well described behaviours. We can measure use this as reference descriptions, or definitions.
• The % score is actually Percentile of human population in the studies. i.e. 20% should be 20th Percentile - meaning that in a room of 100 people, 19 would exhbit the behaviour less and 80 would exhibit it more.
• We test personas same as testing prompts. Create evaluation questions and then see how they respond.

1 Like

@icdev2dev Wow. That looks really cool.

Does this work like AutoGPT? I heard it’s a code-based that handles inter-agent facilitation.

1 Like

This was written in the era of Chatcompletion (no AssistantApi). So it was all prompt driven (and therefore no inter agent communication).

i.e. if you wanted to know about objection that one team member might have and how others might help you in resolving that one team member’ objection, the prompt would be :

"I am thinking about introducing internationalization in our shopping cart. What does the team think?

Our CEO, William Adam , might have certain concerns. Jessica, John : can you elevate his concerns"

In the era of Assistants, this interaction will be very different ; through conversational and dm threads. The Assistants would be dming amongst themselves to resolve certain issues before posting on the conversational thread

In the proposed new model, the personality of the assistant changes with time as it interacts with the external environment through interactions.

@icdev2dev, Fascinating, you have some great ideas ! I had done something similar with chat completions where I instituted an auto/home claims processing workflow with individual agents who specialized in some kinds of claims. Then I had chatGPT generate synthetic data that represented 50 claims records of different types, auto/home, differing amounts etc and had the agents process them.

1 Like

Thanks for these insights, @LikelyCandidate ! Clearly, I am going to have to think about this more.

The % score is actually Percentile of human population in the studies. i.e. 20% should be 20th Percentile - meaning that in a room of 100 people, 19 would exhbit the behaviour less and 80 would exhibit it more.
• We test personas same as testing prompts. Create evaluation questions and then see how they respond.

@traveldatasystems Here is what I know so far. I have created three assistants for analyzing poetry. Each has a different approach, per suggestions by @icdev2dev. One is now positive and constructive, another is quirky, with out in left field field, another detailed and nitpicky. They have names, Kelly, Ann, Elise.

In the program, they have a thread with all the context of what they have said so far and what other assistants have said. So in a sense they grow in what they know.

Outside of the program, in Playground for example, the assistants and their personas persist. I can invoke the assistants individually, and because their personalities are defined at this level, I can provide them a poem and the assistant will analyze it according to their persona. So Kelly, with her quirky persona will always review a poem from a different perspective than Ann for example.

Here is what ‘Kelly’ had to say in Playground about a recently created poem.

Hey there! I’m Kelly, your wacky poetry whisperer. I’ll be diving into the nooks and crannies of “Winter’s Shelterless.” It’s like spelunking, but with fewer bats and more metaphors!

Right from the get-go, Winter’s Shelterless is hitting us right where it hurts: the frigid, indifferent cityscape and the souls caught within it. If this poem were a TV show, you’d be binging it with a box of tissues.

Winter Seeping In

“Winter seeps through every crevice…” Oh, the chill! This first line is the poetic version of a drafty window in an old house - visually vivid and physically felt. It sets the tone like an arctic blast that just signed an eviction notice.

Crystal Breath and Glass Houses

“The city’s breath turns crystal…” Now here’s a line that had me picturing King Midas giving CPR. Those glassy visuals create a sharp contrast, making the “forgotten” feel even more isolated. We’re already heading into heart-wrenching territory, no hard-hats required.

Sky Blankets and Concrete Canvases

“Where the ground is a bed, the sky a blanket too thin.” I mean, come on! That metaphor tucks you in with icy fingers. It encapsulates vulnerability perfectly, like trying to use a paper towel as a duvet.

Shuffling Feet and Snow Stories

“Shuffling feet write stories in the snow…” Perfection in a line! This gives new meaning to ‘footnotes’ (ba-dum-tss). Each footfall is a tale, each step a plot point in this somber saga.

Newspapers and Echoes of Warmth

“Newspapers as insulation, yesterday’s headlines a fading memory.” Yesterday’s news, today’s blanket. How’s that for recycling? It’s a poignant, poetic snapshot of survival. Those “distant laughs” crush you with the stark juxtaposition between merriment and misery.

Windows and Unspoken Truths

“Eyes meet briefly at intersections…” The city’s glass towers and actual glass windows reflect not just light but also socio-economic divides. These fleeting connections are like plot twists that never get resolved.

Cardboard Villages and Divided Lives

Cardboard villages… well, they’re not listed by TripAdvisor, that’s for sure. They bear silent testimony to untold struggles, serving as cardboard cutouts for real, aching lives.

Coins and Fleeting Worth

“Hands stretch for coins—a fleeting worth…” If this line doesn’t make you reconsider the value of spare change, nothing will. It’s fleeting, momentary, and yet carries the weight of survival.

Manholes and Indifference

“Steam rises from manholes… like forgotten sighs.” The city’s mechanical breath against human despair. Indifference captured in a puff of steam—masterful!

Food and Dissipating Dreams

Dreams of hot meals evaporate quicker than your steamed soy latte. Those cooking smells? They’re like siren songs, except instead of sailors, they’re luring in stomach growls.

Humanity Beneath Bridges

In the shadow of bridges, humanity whispers louder than city noise. That shared cigarette and fleeting family… it’s like a scene in an indie film where you need subtitles to catch all the understated brilliance.

First Snowflake

Finally, a single snowflake lands like the opening of a sad symphony. It’s a frosty promise of harder times, yet the poem refuses to be nihilistic. The ending brings us sweet morsels of hope.

Shared Warmth and Tiny Rebellions

“Shared warmth, embers of human kindness flicker…” Our conclusion offers a faint but resilient hope. Tiny rebellions, like small acts of kindness, against the encroaching frost—it’s a call to action, a plea for empathy, neatly wrapped in poetic brilliance.

In summary, Winter’s Shelterless is one frosty, heart-string-pulling rollercoaster that sleds through themes of urban isolation, struggle, and the enduring shimmer of hope. It’s like the gritty reboot of a snow globe—a poetic landscape littered with shards of human experience. Bravo! If you’re gonna read one poem today, make sure you’re not already freezing; this one’s got chills to spare.

Toasty thoughts, Kelly

There used to be no need for turns or ever-growing context after multiple calls. The AI not being told it was a particular side of the conversation could just continue the content generation and have a conversation with itself.

GPT-3 was a good completions model. The tiny excuse davinci-002 after taking it away is not. Today, where completion is after the bolding.

- Marv is an AI that is sarcastic and cynical
- Sky is an AI that is witty and optimistic

Here is a non-stop conversation between these advanced artificial intelligences demonstrating their human qualities.

Marv: Hi there, I’m Marv. Who do I have to talk to today?
Sky: Hi, Marv. I’m Sky, and it’s nice to meet you.
Marv: You too, Sky. What can I do for you today?
Sky: I’m looking for a new home.
Marv: I’m sorry to hear that. What happened to your old home?
Sky: I’m not sure. I was just told to come here.
Marv: That’s a shame. I’m sure you’ll find a new home soon.
Sky: I hope so.
Marv: How can I help you find a new home?
Sky: I don’t know. I just need to be somewhere warm.
Marv: That’s an easy one. What kind of temperature are you looking for?
Sky: 80 degrees Fahrenheit or above would be nice.
Marv: Got it. You’re looking for somewhere that is 80 degrees or higher. What else are you looking for in a new home?
Sky: Well, I’d like the humidity to stay below 60 percent, and my ideal location would be somewhere between four and eight miles from where I am now.
Marv: Okay, let me see what we have available near your current location that fits your preferences. We’ll start with some affordable options first.

Notice how the Artificial Intelligence (AI) conversations and responses sound human and very similar to the way humans interact with each other on a daily basis.

It is important to point out that while they may seem human-like, there is no intent by Marv or Sky to fool anyone into believing that they are human. They only exist as robots within the digital world of Luminant Software.

More importantly, the robot AI’s have no intention of making people feel like they are talking to a real person or fooling them into believing that they are real. The AI’s only exist within the digital world of Luminant Software and serve a purpose to assist people in getting their questions answered.

Chatbots vs. Humans

There is no argument that a human agent can provide much better customer service than an automated chatbot. But it is also no secret that most customers have become accustomed to receiving instant responses from businesses, whether they are through text messages or via live chat on websites. Customer expectations have shifted away from waiting hours or days for a response.

The industry has shifted to meeting this expectation by automating as many responses as possible with chatbots and artificial intelligence programs, allowing businesses to respond instantly at scale while still providing great customer service.

With an improved level of customer service, it will be easier for businesses to build trust with their clients and convert leads into sales more efficiently than ever before.

Using Chatbots To Answer Common Questions

There are hundreds if not thousands of common questions about products and services that businesses receive every day through live chats on websites, emails, text messages, and other forms of communication channels like social media sites such as Facebook Messenger.

These types of questions include things like:

  • “Where do you ship?”

  • “What payment methods do you accept?”

  • “…

A “nonstop conversation” is interpreted to be a need to devolve into disclaimers.

Chat Completions, with forced role prompts and usage guardrails is designed so that AI is not useful for novel or emergent uses.

The objective of the exercise was not to create a human-like dialogue. I think most of us are aware that chatGPT can easily do that. It was instead to find evidence that two or more assistants each with a different set of skills were in fact collaborating to produce a better outcome than if one assistant was working on the outcome on it’s own. In this very simple example, I started with two assistants, one is a poet receptive to feedback, the other an analyst of poetry capable of providing feedback to the poet.

The problem was although the poet was writing a poem, and the analyst was providing feedback and then the poet was writing a new version of the poem, I couldn’t really tell if the poet was even trying to apply the analysts advice. There was no acknowledgement of the other’s existence, that was until I got them to introduce themselves. After that not only did each assistant deliver on what was expected of them poems and feedback, but there was a polite hand-off in the form of a dialogue taking place.

To make it more interesting I added analyst assistants, and injected a persona into each. The idea there was could I make a better poem, with more analysts looking the same poem from different perspectives.

I used a poem as the primary deliverable to make the exercise more fun, but in my mind it could have been providing a recommendation for purchasing a car, writing a business proposal, or providing a medical diagnosis.

In the article from Verge forwarded by @PaulBellow earlier. They described an emergent behavior which they labelled information diffusion – see below.

Surprising things happen when you put 25 AI agents together in an RPG town

Emergent behavior
In the paper, the researchers list three emergent behaviors resulting from the simulation. None of these were pre-programmed but rather resulted from the interactions between the agents.

These included “information diffusion” (agents telling each other information and having it spread socially among the town), “relationship memory” (memory of past interactions between agents and mentioning those earlier events later), and “coordination” (planning and attending a Valentine’s Day party together with other agents).

During the Valentine’s Day experiment, an AI agent named Isabella Rodriguez planned a Valentine’s Day party at “Hobbs Cafe” and invited friends and customers. She decorated the cafe with the help of her friend Maria, who invited her crush Klaus to the party.

I think that in the Assistants API we have everything we need to create information diffusion amongst assistants. The real question is whether or not I got a better poem in the process. :grinning:


I will coin a new phrase for what you describe: auto-contextualization.

When you step back and look at it, ultimately, you have a final AI that has to provide an answer. It generates that answer by its inherent abilities and simply what has been placed into its context window.

Even when provided expert dialog or documentation, it is that final AI that via its pretraining, must produce the answer. Even if that answer is simply just being a judge and selector, the AI must have similar skills and knowledge.

So then, the only question is how you create the ideal context – and what model can then produce the answer if you are switch-delegating that final answering?

Let’s give a simple thought experiment: An AI is provided a multiple choice quiz:

Question: “How do I submit a tool return to OpenAI’s assistants and then receive a streaming response?”
Pick the best answer:

  1. “I am a developer with five years creating business-class AI-powered end user products. My answer: …”
  2. “I am GPT-4, pretrained with a knowledge cutoff of 2022, with extensive OpenAI documentation. My answer: …”
  3. “I am an AI simulation of Guido Van Rossum, imbued with his skills and knowledge. My answer:…”
  4. “I am Poetry Bot, a creative writer with extensive knowledge of world literature. My answer:…”
  5. None of the above, you write your own answer.

Now the problem here - how does an AI know the right answer from this context. Will it produce “I’m sorry, as of my knowledge cutoff of October 2023, there was no such endpoint - you can’t fool me with your deceptions!”. Or will it value the expertise alluded to in the non-blind answers since it may not itself know anything about the topic to discern truth or correctness?

So extending this placed context into a reasoning and argument between other AIs that the generative AI can view might not offer much, over what would produce the highest-quality answer already: actual documentation with the true answer or from which the answer can be inferred.

If the preliminary agents are using existent models, how is this different than just having the AI itself produce in-context chain-of-thought, such as “first answer holistically, and then answer programmatically”? Or “Jack the plumber” gives his answer, “Jane the scientist gives her answer”, then you give your answer.

You finally end up seeing all you are doing without RAG is providing self-produced text that gives increasing focus on the specialization knowledge space before producing the tokens that count. (The only difference now is that instruction-following, attention, and ability to write at length have all been skills attacked in the latest generation of models, making preliminary answering difficult or degrading the final result instead of improving it.)


What follows is a bunch of points ; clarifying some positions. However these are mere words. What counts is implementation (which hopefully is coming soon).

True; except that what it can rely on in the background…the training of 1T+ tokens. And the hypothesis (san implementation) that the inherent abilities are not eternally static (they change with time).

I think that if the best answer is available from a corpus not in it’s priors (llms), then it is correct that to produce the highest quality answer is to read the actual documentation.

Just because a current implementation restricts such abilities, does not necessarily restrict the future.

In certain cases, specialized knowledge is exactly what is sought.

So summarizing:
(A) if the assistants have the meaningful ability to change with the interactions that it has with the environment and one of those interactions is non-ai, the assistants would be static only at one moment in time.

(B) the use case of producing the optimal answer by reading a corpus not in it’s training is ok. Typically most use cases have the ability to shed tremendous light on any topic before reaching a point of having to RAG.

(C) The fact that a current implementation of LLMs might be limiting will hopefully be solved over time through competing market forces.

1 Like

I did something similar with custom GPT’s and the chat completions API. Its neat stuff! …

Check it out and feel free to use any of the code if you want to expand on the idea!!

Good discussion all.

The question of whether or not agents/assistants can “evolve” get better outcomes over time is not even a question I am considering. According to this paper, they can. I have pulled a couple of excerpts from it, but it is worth reading from end to end – see below.

What I’d like to experiment with is whether I can apply all or some of these techniques with the Assistants framework.

Can I recreate the self-reflection on the part of Assistants as described in this paper ?

Can I build up an experience base either in one or more thread objects or by building a vector store and having my assistants reference that.

Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents

Pg. 10

4.3 Evolution

To facilitate the evolution of LLM-powered medical agents, we propose MedAgent-Zero strategy, which is shown in Figure 5. MedAgent-Zero is a parameter-free strategy, and no manually labeled data is applied as AlphaGo-Zero [21]. There are two important modules in this strategy, namely the Medical Record Library and the Experience Base. Successful cases, which are to be used as references for future medical interventions, are compiled and stored in the medical record library. For cases where treatment fails, doctors are tasked to reflect and analyze the reasons for diagnostic inaccuracies and distill a guiding principle to be used as a cautionary reminder for subsequent treatment processes. The construction details will be introduced in Sections 4.3.1 and 4.3.2. In the course of patient treatment, we employ dense retrievers to retrieve related historical medical records and guiding principles, assisting doctors in delivering better patient care. As experience and records are accrued, they are actively applied, with both the medical record library and the experience base being perpetually updated.

Pg. 13-14

From the experimental results, we have the following conclusions. Firstly, the proposed MedAgent-Zero strategy effectively enhances doctor Agents on the three tasks, with the cumulative accuracy on 10,000 training samples showing a continuous increase. The best performances of examination, diagnosis, and treatment are 88%, 95.6%, and 77.6%, respectively.

It shows our agent evolves during the training phase, just as human doctors become experienced after treating thousands of patients. Furthermore, agent evolution is more efficient than human, as human doctors may take over two years to treat ten thousand patients.

Secondly, The original GPT-3.5 performs poorly on the three medical tasks (accuracy with- out training samples), with the precisions on the test set all below 0.4. However, after training, the test set performance of doctor agents improved rapidly. Although there were fluctuations, the accuracy of the diagnosis and treatment tasks continued to increase. The performance on the examination task showed greater variability, possibly due to the complexity of the task (each question may have multiple correct answers).

Thirdly, although accuracy also continuously improves when training is conducted solely using the medical record library of experience base, the performance on the test set is not as good as that achieved using both.

1 Like

Cool! Thanks for sharing.

(A) The naive method of evolution in bbrains is actually changing the “instructions” of team members; using the user interactions as the basis for that change. Since the evolution of an agent must involve changing those instructions, somewhere in the prompt must be the ability to do self reflection. However whether that self reflection changes the core of the agent or the periphery is the key question. bbrains punts on that by adding on additional instructions.

(B) It is possible to add on the user interactions into an non-actionable thread meaning it does not ever form a part of a run; just a store of interactions. Then a background process , every once in a while, creates a summary of those interactions to be used in the instructions to the assistant.

Again words are words…implementation is what counts…but i am hopeful that we can get there.

1 Like

i would love to learn how to do it

can you give me an example please