Assistant ignores the last message on a run

Using the API via cURL, I have set up an assistant, then created and attached a thread and a vectorised file.

I attach 41 pre-defined messages to the thread (actually questions about the contents of the file). When I execute a run, the assistant responds appropriately to each of the messages in turn (answering the questions pretty well), but it completely ignores the 41st message — it’s like I didn’t send it!

However, when I look at the response JSON object from the run, I can see that the 41st message was definitely attached before I ran the thread.

I have not set any explicit output limit, and the responses to all the messages together totals only 1000 words or so.

However, if I repeat the run with fewer messages (removing, say message 20 and 21) the final message now gets the expected response. This seems to imply that there is a length limit set somewhere, but I can’t seen anything. What am I doing wrong?

Thanks very much for your excellently helpful reply.

I realised last night that I must have hit the ‘context window’ — we’re using 4o, so have a high limit for tokens, but we do need all the messages and the file, as we’re extracting data from a report (and the 41st question is asking for a summary of the previous 40 responses). The questions can be a bit long-winded, as we need to explain exactly what we want.

So, splitting it into two separate runs seems to be the best solution for us — this takes a little bit longer to complete the overall process, but only a few seconds.

At the moment I’ve set an arbitrary limit to the number of questions attached (25) but I may look at using a token counter to set the number on the fly – although the time taken to do this may offset the usefulness in our particular case, where we won’t be changing the questions too often.

Thanks once again for taking the time to give such a clear answer.

The reply doesn’t seem to appear in this thread, so I’ll re-post it here for other people to benefit from:

bluesfingers1 November 15

This issue is likely caused by a limit on the number of tokens or messages the assistant can process in a single thread run. While you mention not having set an explicit output limit, many API systems like the OpenAI API have inherent token limits for both input and output. Let’s break down the problem and possible solutions:


Understanding the Problem

  1. Token Limit Per Run:
  • Most APIs have a maximum number of tokens (including both input and output) that can be processed in a single call. For example, if the token limit is 4,096 tokens, the sum of all tokens in the attached messages (input) plus the generated responses (output) must not exceed this limit.
  • If you are attaching 41 messages and their content or vectorized embeddings are large, it’s possible the total token count exceeds the limit.
  1. Behavior When Limit is Exceeded:
  • When the token limit is reached, the assistant may ignore the overflow (e.g., the 41st message) even if it’s attached to the thread. This is because the API processes only up to the allowed number of tokens.
  1. The Case of Fewer Messages:
  • When you remove messages, the total number of tokens is reduced, allowing the 41st message to fall within the limit and receive a response.

Diagnosing the Issue

  1. Check the Token Limit:
  • Consult the API documentation to determine the exact token limit for the endpoint you are using.
  • If you’re using the OpenAI API, typical token limits for models are:
    • GPT-3.5: 4,096 tokens.
    • GPT-4: 8,192 or 32,768 tokens, depending on the version.
  1. Count Tokens in Input and Output:
  • Use a token counter tool to calculate the number of tokens used by the 41 messages combined. You can find token counting tools in OpenAI’s documentation or libraries like tiktoken in Python.
  1. Inspect API Logs:
  • Some APIs provide detailed logs or debugging tools that specify why certain messages are ignored. Look for indications of token limits being exceeded.

Possible Solutions

  1. Reduce Token Usage in Input:
  • Summarize Messages: Pre-process your messages to reduce verbosity while retaining critical information. For example, condense long questions into shorter, concise forms.
  • Reduce Vectorized Content: If your attached file or vectorized embeddings contribute significantly to token usage, reduce their size or focus on key sections relevant to the questions.
  1. Batch the Requests:
  • Instead of attaching all 41 messages at once, divide them into smaller batches (e.g., 20 messages per batch). Run each batch sequentially and compile the results.
  1. Use a Model with a Higher Token Limit:
  • If you’re hitting the limit on a smaller model (e.g., GPT-3.5), switch to a version with a higher token capacity (e.g., GPT-4 with 32k tokens).
  1. Explicitly Handle Long Conversations:
  • Some APIs allow you to control which parts of the conversation history are kept. Consider selectively including only the most relevant messages or truncating older, less important messages.
  1. Check Thread-Specific Limits:
  • If the API has thread-specific constraints (e.g., max messages per thread), review the documentation or settings to ensure you’re not exceeding such limits.

Example Fix

Here’s how you might modify your cURL-based API request to address token limits:

1. Batch Processing

Divide the messages into smaller groups and run them sequentially:

bash

Copy code

curl -X POST "https://api.example.com/assistant/v1/run" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "assistant_id": "your_assistant_id",
  "thread_id": "your_thread_id",
  "messages": [
    {"role": "user", "content": "Question 1"},
    {"role": "user", "content": "Question 2"},
    ...
    {"role": "user", "content": "Question 20"}
  ]
}'

Repeat for the next batch.

2. Token Counter (Python with tiktoken)

To estimate token usage before sending requests:

python

Copy code

import tiktoken

encoder = tiktoken.get_encoding("gpt-3.5-turbo")
messages = ["Message 1", "Message 2", ..., "Message 41"]

# Calculate total tokens
total_tokens = sum(len(encoder.encode(message)) for message in messages)
print(f"Total tokens: {total_tokens}")

3. Using a Higher Token Model

Request a model upgrade in your API configuration:

bash

Copy code

curl -X POST "https://api.example.com/assistant/v1/run" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "gpt-4-32k",
  "assistant_id": "your_assistant_id",
  ...
}'

Conclusion

The behavior you’re observing is consistent with token or message limits. By reducing input size, batching requests, or upgrading the model, you should be able to process all 41 messages successfully. If issues persist, consult the API documentation or support team for additional insights.

This is an Engineered GPT 4o generated answer so take it in that context .