I’m planning to move from the Assistants API to the Responses API and not sure where to start. Research shows me that I download the OpenAI Library and set it up in Python—is that correct? Or is there an easier way for no-code developers? And do I need to set up a RAG system/layer? Finally, how exactly do you use chat prompts? Do they have anything to do with the Response API? Apologies for the super basic questions, but any help is much appreciated.
Chat Completions API use pattern:
- maintain your own “system” message with instructions and behaviors, and retain your user’s chat session history of messages between the user and the AI, and send that history of messages along with the latest user input to get an immediate conversational response sent back in the same API call. Program your own functions for the AI to call.
What Assistants did for you, with these server-side objects and methods:
- Assistants - make a “preset container” ID with instructions that also persists some settings, along with functions. A cousin to “GPTs”. All which can be created and managed through API calls.
- Assistants offers: internal tools that the AI could continue calling until satisfied: file search (vector stores), and code interpreter (python tool). Previously “retrieval” at release (AI could explore, load, and read documents itself, at high expense).
- Threads: A chat history container, where you place the user message in the container by API call. Then the AI assistant response message is also placed there for your retrieval.
- Runs: Take an assistant ID, and set it to work on the thread ID to produce that response.
Responses API use pattern, basic:
- similarly, maintain your own “system” message with instructions and behaviors (which can be sent in an “instruction” parameter in case you are confused), and retain your user’s chat session history of messages between the user and the AI, and send that history of input along with the latest user input to get an immediate conversational response sent back in the same API call. Program your own functions for the AI to call.
- PLUS: the AI can use its internal tools that you specify iteratively before responding, now including web search also.
Then you have these features patched in incrementally to Responses:
- Response ID server state
- store: true - the default, is permanently (it seems) logging every call you make on OpenAI’s servers, by
response.id. Many things won’t work without this. - previous_response_id - if you send the prior stored response ID as a parameter, it will act like a chat history, where you only need add the newest input. You’d have to have your own database of user sessions and these previous IDs that you’d reuse and resend; how is that saving you anything from just resending messages themselves, one wonders?
- Prompts
- can only be updated by the organization owner, using the platform site and a chat-like UI. Cannot be created, updated, listed, or retrieved over the API. Is not a ‘prompt’ but is a terrible name for a ‘server side preset’, with the only use being to provide it as an ID when you make a call. It has similar configuration to an “assistant”, giving parameters and a tools list that you could place yourself per call, but can contain more than just one message (like examples). If your application is keeping track of these prompts, then what tools are enabled per prompt ID so that the correct additional parameters can be included in the API call, what user is employing them for what app, and they cannot be adapted via API, why would you ever choose this locked-down mechanism instead of any API key sending whatever instructions and settings it wants per call?
-
Conversations
Instead of a response ID automatically made each time (which is still preposterously mandatory), you can create a “conversation” ID on its own endpoint, and send this ID with your Responses API request. Then the new input and the AI output is saved there for the future as a chat history when you continue specifying that ID. You can put messages there manually or get them out, but you shouldn’t need to. -
Unmanaged costs
When you use either of these chat mechanisms, unlike assistants, there is no limit to how long the chat history will grow when placed into the AI model context. If setting “auto”, the chat can continue by OpenAI dropping the oldest messages, but only at the model maximum, which can be a million tokens of input. OpenAI can set how many tokens to send to the AI model to avoid errors, YOU CAN"T. -
Containers
- A fast-expiring ID for a code interpreter session, where you have to upload files to the container for them to be used in code interpreter, and the only way to get generated files back out is for the AI to “cite” them in a response it produces, otherwise this is locked down. Pay each time one is created. The server side chat history is destroyed and cannot continue if this connected container expires.
- Vector stores
- Pretty much the same as before, except for you paying for every search invocation. Also same as before: injected messages saying that the user uploaded files, so it is bad for application-based RAG.
- Web search
- OpenAI injects a system message. The AI will follow this message instead of yours. Responding like it was Google, not smarter, but spouting web links.
- Streaming
- write event handlers for dozens of different types of events streamed back at you, as well as echoing back all the parameters you sent, instructions, and multiple copies of the AI text.
So: The easiest way is to start with Chat Completions. Keep your own user’s chat history and manage the length necessary for good experience and budget. Use a system message to tell the AI how to act.
If you want vector stores provided by OpenAI, they have a direct pay-the-same-per-use search endpoint now that will return the top results, which can power your own function that you write and act how you want (and your own AI function isn’t inclined to search a half dozen times at your expense like Responses).