Is there any way to increase chatbot output token limits? Working from a custom knowledge base (and deeply ignorant!)

Hi folks - I’ll start by apologising, I’m not a programmer or familiar with coding, just fascinated by the possibilities of ChatGPT and the API. My capablities just about start and end with “Hello World!”

I’ve been using a beebom.com online guide to creating a custom knowledge base chatbot with Python and Gradio as a chatbot UI.

As a test, I built myself a (very small) custom model for fantasy recipes (I recommend the Stormy Sky Pie!). It took me all weekend and a lot of swearing, so please ELI5: is there any way to increase the token limit of the ChatBot outputs? I’m not sure if it’s an in-built restriction or just the way I’ve set up this particular chatbot.

Any assistance or points in the right direction very much appreciated! Up to and including rebuilding the approach from the ground up and working in a different method than a chatbot; I’m here to (very slowly and painstakingly) learn!

Oh lookit - my account has gone to Level 1 and I can post links. This is the technique I’m currently using:

https://beebom.com/how-train-ai-chatbot-custom-knowledge-base-chatgpt-api/

Welcome to the community @smithsonian

Every model has a fixed context(token) length that they can process(prompt+generation).

Here’s a list of models with their max context length.

gpt-3.5-turbo and if more context length is needed gpt-3.5-turbo-16k should suffice for your use case.

You’d also have to use embeddings to retrieve semantically context data from your knowledge base, which langchain is doing for you.

Can you share what model(s) you’re using currently?

1 Like

You have two things possible:

  1. the output is cut off: you’ve set the maximum tokens that can be generated too low, or the model ran out of context length space to form an answer

  2. The composition is much shorter than you wish or instructed: The chat AIs from OpenAI have been increasingly trained to generate even smaller outputs, no matter your prompting technique, likely part of decreasing the computational resources used.

1 Like

Thanks for the speedy reply!

Looks like gpt-3.5-turbo.

But - and please forgive my very limited knowledge of code, I’m basically just copying and pasting - I’m wondering if this is a pre-set limit in the code:

def construct_index(directory_path):
max_input_size = 4096
num_outputs = 512
max_chunk_overlap = 20
chunk_size_limit = 600

And then further down:

max_tokens=num_outputs

This is where actually knowing Python might come in handy…

With a quick browse of the code I’m copying and pasting, I have a feeling that might be the case. Time to start tweaking numbers!

Whatever you’re pasting isn’t the direct interactions with the OpenAI API. It looks like a sample of using langchain from an article.

For chat completions with the gpt-3.5-turbo model, your parameters would look more like:

    # call the chat API using the openai package and model parameters
    response = openai.ChatCompletion.create(
        model       = model,
        messages    = system + chat[-turns*2:] + user,  # concatenate lists
        temperature = temperature,
        top_p       = top_p,
        n           = 1,
        stream      = True,
        max_tokens  = max_tokens,
        presence_penalty = 0.0,
        frequency_penalty = 0.1,
        # logit_bias  = {"100066": -1},  # example, '~\n\n' token
        user        = "my_service_user-id",
)

the max_tokens is both giving a maximum output that can be produced, and reserving a space that can’t be used for input prompting to the AI.

Yes you can try tinkering with that

if you’re getting truncated completions.

You can also remove it, the model will allot the rest of context to max_tokens for completion.

However if you still get incomplete completions, then you’ll have to use gpt-3.5-turbo-16k model or reduce the tokens you’re sending to the model.