Why is OpenAI API gpt-4o slow to respond?

I’m currently working on a chatbot with FastAPI using OpenAI API.

I’m going to make a chatbot to review the business plan, and it takes me almost 20-30 seconds to respond.

I tried reducing and increasing max_tokens, but there was no significant change. If I lowered the model to gpt-3, the quality of the answer was too low.

The input prompt is long I know, but I can’t reduce it further. Is the original API response speed like this? Or is it recent?

Below is the code I’m currently working on, is there a way to improve the speed?

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import requests
import openai
import os
import asyncio
import time
from dotenv import load_dotenv

load_dotenv()

app = FastAPI()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
client = openai.OpenAI(api_key=OPENAI_API_KEY)
ACCESS_TOKEN = os.getenv("ACCESS_TOKEN")

class ChatRequest(BaseModel):
    message: str

async def get_business_plan_detail(parse_url, access_token=ACCESS_TOKEN, member_no=1000):
    parse_split = parse_url.split("/")
    bbi_no = parse_split[-1]

    url = "my_api_which_get_business_plan_document"
    headers = {
        "Authorization": f"Bearer {access_token}",
        "Content-Type": "application/json",
    }
    data = {
        "bbiNo": int(bbi_no),
        "memberNo": member_no
    }

    start_time = time.time()
    response = await asyncio.to_thread(requests.post, url, headers=headers, json=data)
    end_time = time.time() 

    if response.status_code == 200:
        return response.json()
    else:
        return {"error": True, "status_code": response.status_code, "message": response.text}

async def analyze_business_plan(text):
    start_time = time.time() 

    response = await asyncio.to_thread(
    client.chat.completions.create,
    model="gpt-4o",
    response_format={"type": "text"},
    messages=[
        {"role": "system", "content": "You are an expert business consultant with deep market analysis expertise. Please provide a **highly detailed** and thorough review of the following business plan."},
        {"role": "user", "content": f"""
                You are an AI-based business consultant.  
                Please verify the market response of the following business plan and provide feedback in Korean ** as much detail as possible from the virtual customer's point of view.
                It's the most important thing to see from the customer's point of view.

                📌 **Request for detailed analysis**:
                1️⃣ **Product power and differentiators** → Deeply analyze strengths and complement points, compare them with existing markets, and explain functional differentiators
                2️⃣ **Customer Experience and Marketing** → Detailed Customer Persona Analysis, Key Considerations for Customer Purchase Journey
                3️⃣ **Market competitiveness analysis** → Compared to similar products in the current market, competitive advantage
                4️⃣ **Sustainability of business model** → Revenue model, B2C vs B2B strategy, long-term scalability
                5️⃣ **Market expansion potential and global expansion strategy** → Regional characteristics to consider when expanding overseas, localization strategy

                📄 **Contents of the business plan:**
                {text}
                """}
    ],
    max_tokens=4096, 
    temperature=0.7,
)


    end_time = time.time()  
    print(f"⏱️ Time to answer OpenAI Response: {end_time - start_time:.2f}sec") 

    return response.choices[0].message.content

@app.post("/chat")
async def chat(request: ChatRequest):
    total_start_time = time.time()

    user_message = request.message

    if not user_message:
        raise HTTPException(status_code=400, detail="Enter the message to chat.")

    # Pre-processing data if a particular URL pattern is included
    if "something-specific-url-pattern" in user_message:
        result = await get_business_plan_detail(user_message)

        if isinstance(result, dict) and "sections" in result:
            business_document = f"""
            Project Name : {result['sections'][0]['details']['bbiProjectName']}
            Team Name : {result['sections'][0]['details']['bbiTeamName']}
            Item Overview : {result['sections'][0]['details']['bbiItemOverview']}
            
            1. Item Overview
            Item Name : {result['sections'][1]['details']['bitItemName']}
            Key Features : {result['sections'][1]['details']['bitKeyFeatures']}
            Relevant Technologies : {result['sections'][1]['details'].get('bitRelevantTechnologies', 'N/A')}
            
            2. Problem Recognition
            Background Motivation : {result['sections'][2]['details']['bprBackgroundMotivation']}
            Purpose Needed : {result['sections'][2]['details']['bprPurposeNeed']}
            
            3. Feasibility
            Commercialization Strategy : {result['sections'][3]['details']['bfCommercializationStrategy']}
            Market Analysis : {result['sections'][3]['details']['bfMarketAnalysis']}
            
            4. Growth Strategy
            Growth Strategy : {result['sections'][4]['details']['bgsGrowthStrategy']}
            Market strategy : {result['sections'][4]['details'].get('bgsMarketStrategy', 'N/A')}
            """

            print("Pre-processed Document:\n", business_document)
            
            analysis = await analyze_business_plan(business_document)

            total_end_time = time.time()
            print(f"⏱️ Total Answer Response: {total_end_time - total_start_time:.2f}sec")

            return {
                "message": "📄 Finish Answer.",
                "analysis": analysis
            }

        else:
            return {"error": "Failed to get data of business plan document.", "details": result}

    return {"message": "🔍 Give me the Link of the Business Plan Document."}

# Run FastAPI
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=5000)

There is no “gpt-3” model for you to use.

You seem to be needlessly using a response format parameter, and are passing the library a BaseModel. This necessitates setting up a strict response format on the API, taking several seconds. I would omit this, and you will also get higher-quality responses.

You can rotate through specific models gpt-4o-2024-11-20, gpt-4o-2024-08-06, gpt-4o-2024-05-13, and see if one provides faster responses at a particular time.

Sending no temperature parameter can be faster. max_tokens is not necessary; you set it higher than the responses typical of the AI anyway.

You can eliminate the use of the OpenAI SDK entirely to cut down the loading on someone else’s platform. Just make RESTful requests with a preinstalled library such as `requests`` to the API.

1 Like

OMG When I modified the code with reference to advice, the response time was drastically reduced from 20-30s to 20s down, from 10s to 20s. THANK YOU SO MUCH… You’re a genius and my savior… really…

But is there any way I can improve response speed more than this? I was wondering if it’s possible to get responses down to 10s down.

1 Like

AI language models use a lot of computation for every “token” generated in a response. The longer it writes, the longer it takes.

You can use streaming, so you can at least see the AI response as it is being generated token-by-token, instead of waiting for it to be complete, for a better user experience.

Another caveat is that writing in Unicode and in Korean requires many more tokens to form a response of similar content to Latin scripts, just because of the way text is internally encoded.

If it is an automated task, you can send off several API calls at once.

Below is a quick summary table of each model’s average token rate - what I’m getting right now for a 256-token output, with two trials per model

Model Avg Token Rate (tokens/s)
gpt-4o-2024-08-06 63.80
gpt-4o-2024-05-13 64.55
gpt-4o-2024-11-20 52.45
gpt-4o-mini 139.50
1 Like