Big data analysis using gpt-4o function calling

I’m working on a project where I analyze search intent based on a given keyword and the titles of the top 10 search results. The goal is to group keywords with similar intent and generate a representative subtopic for each group. In my use case, there are 300 keywords, and along with their search result titles, the total input amounts to around 100k tokens.

I’m using GPT-4o with function calling to let 4o perform the grouping with structured output. When I input 200 keywords, 4o successfully returns a complete JSON object. However, when I input 300 keywords, the output starts off normally but then continues with repetitive text until it hits the output token limit. The result is an incomplete JSON object, which causes my code to throw an error.

I’m wondering if this issue is caused by the large input size, or if my function calling setup is incorrect. Since the total input token count does not exceed GPT-4o’s 128k token limit, this behavior is quite confusing to me.

Is my only option to split the input into smaller chunks? That would prevent the LLM from seeing the full dataset at once, which I believe would negatively impact the accuracy of the grouping.

Here is my function calling structure:

generate_subtopic_tools = {
    "type": "function",
    "function": {
        "name": "grouping_keyword",
        "description": """
            You are an experienced SEO expert. Your task is to analyze the search intent behind each keyword based on its associated search result titles and organize them into well-defined subtopics.
            
            #Constraints:
            Please strictly follow these steps:
            1. Analyze each keyword's search result titles one by one and identify one or more specific content aspects that reflect the underlying themes or intents.
            2. Group similar content aspects into fine-grained subtopics. Ensure the subtopics are specific and capture subtle differences between keywords.
            3. Ensure clear distinctions between subtopics.
            4. Output format: [[keyword, subtopic], ...]. A keyword may appear in multiple entries.
            5. If the keyword list is large, you may handle them in segments internally, but the final output must be complete and consistent.
            
            Notes:
            - A single keyword may correspond to multiple subtopics, especially if its titles cover different themes.
            - Make sure all keywords are processed and included in the output. Avoid missing or duplicating entries.
            - Before finalizing the output, check that all input keywords are accounted for.

            #Example:
            Input:
            [
                {"keyword": "digital marketing", "titles": ["What is digital marketing?", "Digital marketing strategies", "Social media in digital marketing"]},
                {"keyword": "seo", "titles": ["SEO basics", "SEO tools", "Local SEO tips"]},
                {"keyword": "ai tools", "titles": ["Top AI tools for business", "How AI tools are changing marketing", "Best AI tools for content creation"]}
            ]
            Output:
            [
                ["digital marketing", "Digital Marketing Introduction"],
                ["digital marketing", "Digital Marketing Strategies"],
                ["digital marketing", "Social Media Integration"],
                ["seo", "SEO Fundamentals"],
                ["seo", "SEO Tools and Techniques"],
                ["seo", "Local SEO Strategies"]
            ]
            In this example, "digital marketing" is assigned to three subtopics: "Digital Marketing Introduction", "Digital Marketing Strategies", and "Social Media Integration".
        """,
        "parameters": {
            "type": "object",
            "properties": {
                "result": {
                    "type": "array",
                    "items": {
                        "type": "array",
                        "items": {"type": "string"},
                    },
                    "description": "The grouped results for each keyword, formatted as [[keyword, subtopic], ...]",
                }
            },
            "required": ["result"],
        },
    },
}