Dear all, I have the questions about the openAI assistant chunk size and chunk overlap of file search tool?

Dear all, I have the questions about the openAI assistant chunk size and chunk overlap of file search tool

In the assistant playground, the default parameters for file search’s chunk size and chunk overlap are 800 and 400, respectively.

If I upload guideline files (instructions for the AI on how to format reports and special focus items) with content sizes of 1500 and 400 tokens, what should I set the chunk size and chunk overlap parameters to?

Or should I just use the default settings?

Thank you.

Instructions for the AI should always be present.

The file_search is just that: a function that the AI must invoke with query terms to search upon.

The AI is not going to search “do you have any instructions for how I should format reports” automatically, and doing so in multiple turns of tool calls by exhaustive instruction to AI in a system message would be just as wasteful as simply always having those documents with operational methods present in instructions.

To answer the question though, if you pick a large chunk size for a file, larger than the number of language tokens within, you get one chunk per file. Then the file content is either returned in whole by a search, or other relevant chunks from other files displace it (and the search doesn’t have a threshold, it always returns results, regardless of how irrelevant). At least one huge chunk has no wasteful overlap if you are curating document sizes and contents yourself.

1 Like

hank you for your response. So the tokens for instructions are counted towards the 4096 tokens limit for each conversation with the assistant?

If I have records of conversations between a social worker and a client, and I want the assistant to generate a case report based on our social worker’s report format (including important points and content) and the mental status examination guideline, do the assistant instructions need to be this detailed, consuming many tokens?

Thank you


The user will provide you with records of conversations between a social worker and a client. “Q:” represents the social worker’s questions; “A:” represents the client’s answers. You need to write the client’s case report based on the conversation record. Before writing, you must read and understand the conversation record completely.

Refer to the following materials to write the case report:

  • Summary Guideline: file-1
  • Mental Status Examination Guideline: file-2
  • Use Vector Store: vs_(file-1 + file-2 automatically produced by assistant)

When writing the case report, pay special attention to:

  1. Follow the format and evaluation criteria outlined in the summary guideline.
  2. Refer to the mental status examination guideline for the mental status examination section.
  3. The summary should not contain unnecessary footnotes or markers, only plain text.

The limit of 4096 that you might be referring to is the maximum response length that you will get back from the model.

gpt-3.5-turbo models have a context length of 16k tokens, and gpt-4-turbo, 125k, giving you lots of room to pay for more input for each call. A typical response reservation setting subtracted from that context length is max_tokens: 2000.

Thank you very much for your response. So, you mean that the 4096 token limit is the upper limit for the OpenAI assistant’s response, and it does not include the tokens from my question and instructions???

Could you please explain what this sentence means? I don’t quite understand it:
“A typical response reservation setting subtracted from that context length is max_tokens: 2000”

Does it mean that the default upper limit for the assistant’s response is 2000 tokens? How can I set it so that the assistant answers as fully as possible?

Thank you.

First, some clarifying background: The context window length is a shared memory space for both loading the AI with prompt and context input, and for the AI to follow that with its own language generation. It is measured in encoded tokens of AI language.

max_tokens is a chat completion parameter for setting the maximum size of the response that you will get from the AI before it is cut off or truncated, with finish_reason reported as “length” instead of “stop”. The maximum the parameter can be set to with new models is 4096 (tokens) due to an artificial limitation by OpenAI, but the AI will rarely write that much due to its training, anyway, so setting it lower can reduce the expense if the AI goes nuts writing looping nonsense.

The AI model doesn’t know how this parameter is set. The length of output response will be by innate behavior (to curtail output length) and the prompting that you do.

If you do not set a max_tokens parameter, then the maximum response is the entire remaining context length - or that 4k limitation. If you do not set it, there is also no space reserved solely for forming a response (or reporting an error if you use all the context just on input). You can send 15.75k tokens of input to gpt-3.5-turbo’s 16k, and only have 250 tokens remaining for a response.

In assistants, you have little control of tokens input by tools or output - it is tuned for maximum expense. The one control now available will simply abort and produce an error after you’ve already been paying for the agent to make internal calls that grew to excess.

Thank you for your response.

To be honest, I might still not fully understand it. However, regarding the assistant file search function, I have two markdown formatted knowledge files to give the assistant for something like RAG.

One is the primary case report format file (about 1500 tokens).
The other is the detailed guideline for the mental status examination section within the case report (about 4000 tokens).

As shown in my original attached image, if I can set the chunk size and chunk overlap parameters, what would be the best settings for my situation? Should I stick with the default settings of 800 and 400?

Thank you.