Edge Pass 4096 Token Limit. Systematic approach to sent API call

zhihong0321 · March 6, 2023, 6:47am

i believe the Stream = True, allow us to send multiple call, as 1 prompt.
currently GPT 3.5Turbo only allowed Context in prompt. with 4096 single limit,
we quickly run into limit as the conversation continue ( if you want GPT to aware the conversation history )

My idea :

1st call = context
2nd call = bot instruction, limitation
3rd and Last call = summarize of previous converstation. ( limited to last 2 hours )

and get response from OpenAI.

But i have a problem.
I m not sure how to use data:[DONE] in my call.

Hopefully any senior coder here could give this stream = true a try. and share with me how to end the stream. Thanks in advance.

zhihong0321 · March 6, 2023, 7:23am

I found the more detail doc from githut

github.com

openai/openai-cookbook/blob/main/examples/How_to_stream_completions.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How to stream completions\n",
    "\n",
    "By default, when you request a completion from the OpenAI, the entire completion is generated before being sent back in a single response.\n",
    "\n",
    "If you're generating long completions, waiting for the response can take many seconds.\n",
    "\n",
    "To get responses sooner, you can 'stream' the completion as it's being generated. This allows you to start printing or processing the beginning of the completion before the full completion is finished.\n",
    "\n",
    "To stream completions, set `stream=True` when calling the chat completions or completions endpoints. This will return an object that streams back the response as [data-only server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#event_stream_format). Extract chunks from the `delta` field rather than the `message` field.\n",
    "\n",
    "## Downsides\n",
    "\n",
    "Note that using `stream=True` in a production application makes it more difficult to moderate the content of the completions, as partial completions may be more difficult to evaluate. which has implications for [approved usage](https://beta.openai.com/docs/usage-guidelines).\n",

This file has been truncated. show original

I misunderstood. the stream is actual reverse direction

meaning, if the completion is “stream” to your server,
so you can display the result like " ChatGPT " ( typewriter effects )

ruby_coder · March 6, 2023, 7:49am

As I tried to explain to you earlier

Output steaming, not input.

zhihong0321 · March 6, 2023, 7:51am

ya…

the 4096 was like the barrier of Normal Goku between SSJ Goku…

ruby_coder · March 6, 2023, 8:03am

You are learning very fast for a novice coder.

Keep it up!

zhihong0321 · March 6, 2023, 8:14am

thanks for the compliment. it boost me!!

would like to seek your advice on summarize conversation, and compile into next API call.

call GPT3.5-turbo, pass only the conversation history to summarize it.
save the summary into my DB - chat_session > summary_so_far
next call, include the summary as context.

– only send summarize request on every 5th step ( count the user input entry )

So far in my test, I creating a sales agent, with context also include most FAQ.
even without knowing previous chat, the agent still handle well.
so, I guess do the summary call every 5th step would be enough. and save some token.

What do you think?

Topic		Replies	Views
How can I write Python code such that both input prompt and output results are in the same conversation thread? API gpt-35-turbo , api	4	1102	April 26, 2024
How to maintain context with gpt-3.5-turbo API? API	20	21879	December 13, 2023
Need more than a 4097 token call from chat gpt api API	7	3372	November 28, 2023
A conversation using the API API	6	3105	December 16, 2023
Maintain the context within the 4096 max tokens API	2	2402	February 16, 2024

Edge Pass 4096 Token Limit. Systematic approach to sent API call

Related topics