MAX TOKENS is 4,096 tokens for gpt-3.5-turbo should fit the the messages sent and the answer generated?

First question: Does the max tokens have to suffice both messages sent and the answer generated?

Second question: Each one of the array of messages in the "chat’ endpoint will consume a part of the tokens?

Third question: If each one of the array of messages in the "chat’ endpoint consume a part of the tokens and I sent an array of messages that exceed the max tokens what will happen?

Fourth question: How to make the model automatically cuts the part of the beginning of the conversation that exceeds the max tokens?

1 Like
  1. The model’s token limit (documented here) applies to everything, prompt and response. But the max_tokens API parameter only applies to the response.
  2. Yes
  3. You’ll get a error back from API saying you exceeded token limit
  4. By specifying max_tokens in API call you can limit the maximum length of the response, it will abruptly cut off the response at the specified limit. See docs.
2 Likes

For the fourth question I don’t want to limit the maximum length of the response and cut off the response at the specified limit, instead, I want to cut off the array of messages from it’s beginning, I mean I want the model to take the amount of messages that fit the max tokens and throw the rest. Is this possible? should I do some math and calculate the expected response tokens and calculate how much messages I can send per call to avoid getting error back?

The node.js script saves every request and completion to a JSON file. It then sends this data as conversation history with each new prompt. The script also records the number of tokens used. However, the code to calculate the amount of history that can be sent is still in development.

// Import required modules
const fs = require('fs');
const axios = require('axios');

// Your OpenAI API key
const apiKey = 'your-openai-api-key';

// Function to interact with OpenAI API
async function interactWithAI(userPrompt) {
    try {
        // Define the message data structure
        let messageData = { 'messages': [] };

        // If requests.json exists, read and parse the file
        if (fs.existsSync('requests.json')) {
            let raw = fs.readFileSync('requests.json');
            messageData = JSON.parse(raw);
        }

        // Format the conversation history and the new user request
        let systemMessage = "Conversation history:\n" + messageData['messages'].map(m => `${m.role} [${m.timestamp}]: ${m.content}`).join("\n");
        let userMessage = "New request: " + userPrompt;

        // Make a POST request to OpenAI's chat API
        let response = await axios({
            method: 'post',
            url: 'https://api.openai.com/v1/chat/completions',
            headers: { 'Authorization': `Bearer ${apiKey}`, 'Content-Type': 'application/json' },
            data: { 'model': 'gpt-4', 'messages': [ { "role": "system", "content": systemMessage }, { "role": "user", "content": userMessage } ] }
        });

        // Log the AI's response
        console.log(response.data['choices'][0]['message']['content']);

        // Get the current timestamp
        let timestamp = new Date().toISOString();

        // Add the new user request and the AI's response to the message history
        messageData['messages'].push({ 
            "role": "user", 
            "content": userPrompt, 
            "timestamp": timestamp, 
            "tokens": response.data['usage']['prompt_tokens'] // Include prompt tokens
        });

        messageData['messages'].push({ 
            "role": "assistant", 
            "content": response.data['choices'][0]['message']['content'], 
            "timestamp": timestamp, 
            "tokens": response.data['usage']['completion_tokens'] // Include completion tokens
        });

        // Write the updated message history to requests.json
        fs.writeFileSync('requests.json', JSON.stringify(messageData, null, 2));

        // Return the AI's response
        return response.data['choices'][0]['message']['content'];
    } catch (e) {
        // If an error occurred, log it to the console and return an error message
        console.error('An error occurred:', e);
        return 'An error occurred while interacting with the OpenAI API. Please check the console for more details.';
    }
}

1 Like

Ah, yes this is the way to do it. Drop messages as needed to stay within token limits. Alternatively you can make a separate request to ask GPT to summarize the history and use that in place of the actual messages.

1 Like

Ok, now if I set the max_tokens parameter, will the model try to hit that max in every response it generate? Or will it’s responses range from one token to max?
I mean, how do I expect the tokens needed by the response?

Actually I use bubble.io which is nocode platform, so now I trying to figure out how to predict the needed tokens for the response to be able to do math to determine how much history I will include per call

I don’t know why replies are posted as separate comments, I am posting a reply to a comment and it is posted as a separate comment:
Ok, now if I set the max_tokens parameter, will the model try to hit that max in every response it generate? Or will it’s responses range from one token to max?
I mean, how do I expect the tokens needed by the response?

The max_tokens parameter does not inform the AI about the type of output it should generate.

If you don’t want to limit the response output at all, want to potentially use all the available space to generate a response without premature cutoff, and will simply manage the input size so there is enough context length space remaining for that response, you can simply omit the max_tokens optional parameter when using chat completions. (AI can unexpectedly use all if the AI gets caught in a loop, repeating words)

you can tell the model what size text you would like and it will be loosely taken into account, if you ask for 50 words you might get from 25 to 100 and if you ask for 100 words you might get from 50 to 200. Assuming there is a valid response of approximately that long, asking “is the red ball, red?” will clearly not usually yield a long answer.