Using OpenAI Threads for Session History with External Models

Hi! I am currently exploring the OpenAI Assistants API, specifically looking at using Threads for session and history management.

My goal is to use OpenAI’s architecture to handle the conversation state, but route the actual inference/generation to an external model (like Google Gemini).

The Issue: After researching the documentation, it seems that the Assistants API’s management features are tightly coupled to the OpenAI model registry.

If I understand correctly, to use an external model, I would effectively have to treat the Threads API merely as a remote database. The workflow would look like this:

  1. Fetch the full history from the Thread.

  2. Manually handle truncation/windowing.

  3. Send the context to the external model (Gemini).

  4. Post the response back to the Thread.

My Question: Is this assessment correct? Does using an external model inevitably mean losing the automatic context management features of the Assistants API?

I am trying to avoid the friction and latency of fetching full histories and managing context manually. If anyone has experience decoupling the “State/Thread Management” from the “Generation” while keeping the benefits of the API, I would appreciate your insights.

Thanks!

1 Like

Hello from this newbie and “civilian nerd”.

If it is permitted, I am just popping my head over the parapet to request all the gurus and wise folk who reply here, as I am trying to learn:

I would be immensely grateful if they could just try if possible to include vocabulary that a non-coding healthcare-informatics geek :nerd_face: plus Stargate SG1 buff :milky_way: :shooting_star: would be able to read and enjoy, even if she can never hope to implement anything beyond writing creative cosmic lore.

Apologies if that sounded too forward, it is the ADHD :person_facepalming:t4: (excited to be here :sweat_smile::grimacing:)

:folded_hands:t4: :vulcan_salute:t3:

I’ve done some research on this as well, within the ChatGPT UI and have a few questions which may provide some additional options and clarity.

The Agent doesn’t recall previous chats… period. Each session is segmented within its own Data Plane and remains there.

I’ve used the public shared URL for the agent to reference its own previous chat session, and it works to an extent where it will analyze and summarize the chat session and use that as additional training content but not export or bring it into another session. You can export the conversation and other data from the API, however,

Take caution on the purpose and potential impact in exporting and using or referencing previous chats:

If this is for Auditability, Traceability, Reference (think legal, security and historical purpose) then it makes sense. If to be used for additional Training, caution… huge caution: Drift, Noise, Context focus, hallucinations, accuracy, fabricated content\information, and more are all in play.

What I have done is create a macro that determines drift risk and potential mistakes in the cognitive context flow and make a determination if we can refocus, correct, or suggest a reinitialization (new session/chat). If a new chat, there is a new chat initialization script that takes a minimal instruction block from that chat that includes the focused content to bring over, not the entire chat. Bringing over or referencing the chat will also bring in the drift, noise, etc..

Newbie overview

Assistants: API endpoint introduced Nov 2023, deprecated 2025, shutoff in 2026, that has these stateful objects and techniques:

  • assistant: contains settings and instructions that can be reused
  • threads: linear container for chat history, where message lists are placed, from which AI responses are retrieved.
  • runs: executing a thread ID by an assistant ID

Also uses these accessory services:

  • files: place to upload and receive AI generated file output
  • vector store: collection of document extracted files, chunked for semantic search

Creating the server-side assets does not cost anything. They are preliminary mandatory parts of simply using Assistants. The AI inference is what costs you.

A thread is where an AI response is generated and where it is retrieved from.

Exploit API services for off-platform AI

What you describe is using the free basic parts of the API in a truly free manner. It would be similar to using the moderations API, but not on AI inputs, but to make safety checks for your discussion forum. Probably a terminable use covered by “we can shut off any account at any time for any reason and keep their money” language in the terms and policies. And in the case of you relying on any stateful product from OpenAI: we can delete all your user’s data in a heartbeat and break all your products.

Plus, there would be plenty of friction and latency. What your code could do locally on its own database would be instead dependent on someone else’s remote services and them even keeping it operational. Then that the multiple API calls that Assistants alone takes to use is magnified by a described pattern of not even getting the automatic population of assets with AI responses.

If you want someone to manage your conversation state of chats, for a customization of Gemini, just use Gemini as the consumer chatbot, make some Gemini “Gems”.

2 Likes