Gpt-4 32k vs GPT-4 Turbo api + Legal advice on using

We want to use gpt-4 to generate Question & Answers from a book series owned entirely by our Department to use as FAQ in our ChatBot , it has roughly 6 million tokens , and after prompt engineering our prompt is roughly 1000 tokens at bare minimum .

the limits for these gpt4-32k & gpt4-turbo are very unclear for some reason , i want to know what is the input limit for either so i can pas exactly enough context to generate 4096 token output from context without having it compramise in quality of Q/A . We want to optimize the token cost as our prompt has to include domain specific terminologies not recognized by gpt .

my prompt is roughly like this atm which generate just a bit more fluff than we need & makes question answers from terminology which is wasting output limit.
max_new_token=1536 #(each context of 512 should generate 768 token) temperature=0.3
[Instrunction: Provide answers in text format based solely on the context provided from ********** by Teacher : ‘******’ below. Do not include any external information or assumptions beyond what is contained within the given context. Focus exclusively on the question presented, ensuring your answer is derived directly and entirely from the provided Context, you can use the provided terminology as reference for words in context .
Sample Term:Sample domain specific meaning

Context1:sample context chunk of 512 tokens each
Context2:sample context chunk of 512 tokens each

Question:Craft detailed and varied questions & answers based on the segment of text from ********** provided below that a student can ask . Each question should explore a different aspect or detail of the entire content, encouraging a complete comprehensive understanding. Aim for depth and insight in your inquiries, ensuring they provoke thoughtful analysis of the entire material.

Follow this format [‘Context1:\n\nQuestion1: Sample question?\nAnswer1:Sample answer\n\n Question2:Sample question?\nAnswer2:sample answer\n\nContext2:\n\nQuestion1: Sample question?\nAnswer1:Sample answer\n\n Question2:Sample question?\nAnswer2:sample answer\n\n’]

TLDR: input&output limit for Gpt-4 32k & 4-Turbo , Using FAQ generated by gpt for out published book in Chatbot Database .

Hi - is this the information you are looking for, i.e. the size of the context window by model? Apols if I am getting your question wrong.


isn’t context window the for chat histroy retentions of a session ID ? that’s different from the maximum token input limit i can provide to the api from my knowledge ? , What i need is max input token length .

I am new to openAi Api so maybe i am thinking in the wrong way , but i can’t use the chat histroy to only retain the instrunction not the conversation because the conversation will quickly fill up the context window and bear un necessary token cost

for a given API request, the context window represents the maximum tokens that are shared between the input tokens and the output tokens. so you could technically send up to 124k token as input in any API request for the GPT-4-turbo model and roughly 28k for the GPT-4-32k model to still achieve maxium output of 4096 tokens. However, in practice your output tokens will typically be much lower than 4096 tokens due to some model constraints.

What you include as part of your input tokens in an API requeset entirely depends on your use case, i.e. whether this is knowledge, or a chat history or a mix of the two.


that makes sense so i can send aprox 5 context chunk with instrunction in input and recieve output of decent quality .

can you take look at my prompt if possible and the legality of using generated FAQ

1 Like

I can’t speak to the legality aspect I’m afraid. I suggest you seek advice from your organization’s Legal Department in that regard.

As regards your prompt, I would more clearly distinguish between the system message and the user message. In the system message I would include the more general instructions including the model’s role, the task that you are expecting it to complete and the desired output format while in the user message I would include the actual question plus the question specific context.

1 Like

Just another observation: If I am interpreting it correctly, you might be using the max_tokens parameters differently than intended.

max_tokens refers to the maximum number output tokens that can be generated and can be set up to the value of 4096 (although as said in practice you won’t yield that many output tokens). While the measure in combination with the right prompt can help to somewhat control the output length, it is in practice not possible to ask the model to return a specific number of tokens (there some exceptional cases but for longer output it is not possible).

1 Like

the max i was able to get an output was 1k tokens which is a bit jarring since if i want proper Q/A i would have to run each chunk individually that will increase the input token cost 3-4x , any way to force more detailed answers since in my local llm’s i could always get it to generate around max output capacity

The conversation centers on generating Question & Answers from a book series for a ChatBot using OpenAi’s GPT-4 models. TEMPY initially asks about the input limits for GPT-4 32k & GPT-4 Turbo, with the aim to optimize token cost. A prompt example is given, highlighting the attempt to focus the AI on the text content and specific terminologies. TEMPY is particularly interested in generating 4096 tokens of output without quality loss.

jr.2509 shares a link about the context window’s size for different models. However, TEMPY clarifies that they are not interested in the context window but the maximum token input limit. jr.2509 explains that the context window represents the maximum tokens shared between input and output tokens. For GPT-4 Turbo, up to 124k tokens can be sent as input to achieve maximum output of 4096 tokens, while GPT-4 32k model allows approximately 28k tokens.

TEMPY appreciates the clarification and wonders about their prompt’s structure and the legality of the produced FAQs. jr.2509 advises to consult with a legal department concerning legality, and proposes changes to the prompt for clarity. A further observation by jr.2509 points out that the max_tokens parameter might be misused by TEMPY, as it sets a limit rather than a specific aim for output tokens.

In all: despite you specifying 4k output tokens, the AI doesn’t see this specification which would just truncate output if set too low.

The AI has been trained to output short responses, and after a certain length, cutting off the response becomes a mandatory imperative in the AI generation. Cost efficiency for limited compute.

1 Like

so there is no way to increase output size without running the questions in parts which will increase token cost , damn this sucks , our estimates are going to go way off with this change

Try to add something like this:

We need the following output format:

… 8-10 times block output … {[
… 9-10 times paragraph … {[
… 7-10 times sentence output, each sentence 5-15 words


just a guess - better use yaml, but I made this on my phone :sweat_smile:

1 Like