RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)

lucas.py · January 31, 2025, 1:08pm

Hello, I’m reposting an issue I opened yesterday on the Github repository openai-python (github[dot]com/openai/openai-python/issues/2065). I’ve been asked to post here instead, since the issue seemed more related to the API than to the SDK.

I’ve been encountering an issue recently, without changing my codebase:

When chatting with the model, after a few turns I often get an error that I never had before, which crashes my app (see video in the GitHub issue, the crash occurs at ~1m20s).

I feel like it happens mostly during long responses, hence the tasks shown in the video, but I may be wrong about this.

Did something change recently on the OpenAI API side?

valenvivaldi · January 31, 2025, 6:39pm

Hi! I’m sure something changed, I use langchain in my projects and I had to change a method where I instantiate the OpenAI agent.

andrea.bovinelli · February 3, 2025, 8:51am

Hi, which method do you changed? We’re also using the agent with langchain and have this issue yet. Thanks a lot

valenvivaldi · February 3, 2025, 11:53am

Hi!
In my app, we change the llm instance from

       _llm = ChatOpenAI(temperature=temperature, streaming=False, model_name=open_ai_model)

to

 _llm = init_chat_model(temperature=temperature, model_provider='openai',
                               model=open_ai_model, streaming=False)

This change and updating the langchain-openai and langchain-core library to the last version.

The frequency of this error was reduced considerably, but not completely, from occurring in half of the executions to approximately 10%.

The function init_chat_model allows to use it with several providers, besides OpenAI, and I have not encountered the error with others. I am waiting to see if langchain-openai releases an update with something related to this problem.

valenvivaldi · February 3, 2025, 12:16pm

I have updated the openai library to 1.61.0 version and I can no longer reproduce the error (I can’t say for sure that it is solved, but that it is less frequent).

andrea.bovinelli · February 3, 2025, 2:39pm

Can I ask you which version you use of the HTTPX library? Thanks

lucas.py · February 3, 2025, 2:49pm

I ran a few tests and could not re-produce the bug anymore (even without updating the python package to 1.62.0).

I guess something was broken then fixed on the API side and we’ll never know the why and the how!

aleksandar.kirilov · February 3, 2025, 3:33pm

I too have this issue and tried both 1.59 and 1.61 versions

Any update is really appreciated !

aleksandar.kirilov · February 3, 2025, 4:16pm

I have changed the configuration of my client to use the model “gpt-4o-2024-11-20” which is the latest gtp-4o, at the time of my post gtp-4o points to gpt-4o-2024-08-06.
Now it works without issue

lucas.py · February 3, 2025, 6:01pm

Good catch, I had not thought about testing other models!

I also seem to have less failures using gpt-4o-2024-11-20, but I still got one during my testing… It is also worth noting that the output style changes quite much between these two versions.

Some explanation and advice from the OpenAI team would be welcome.

madina.bekbergenova · February 4, 2025, 3:04pm

I also have the same issue, updating the package to 1.62.0 did not help to eliminate the issue.

andrea.bovinelli · February 4, 2025, 3:50pm

From our side, the model update resolves the issue (model=‘gpt-4o-2024-11-20’). No errors occurred after multiple tests.

spector · February 4, 2025, 5:49pm

Has anyone found a solution while using this model version gpt-4o-2024-08-06?

valenvivaldi · February 4, 2025, 7:34pm

Hello! in my case, the error occurred again (I do not understand why), but I solved it (for now and I hope it stays that way), using the model gpt-4o-2024-11-20 . I don’t know why gpt-4o keeps pointing to the other one which is older.

gaurav10 · February 6, 2025, 12:28pm

Hello everyone,
I’m working on a PDF RAG app and I’m also getting an error like mentioned in the title of the post.

In my PDF RAG app, I use gpt-4o-mini as the LLM . I use LLM to process pdfs ( I create summaries of images and text chunks , then I store those summaries to Pinecone vector store and the original text chunks and images to MongoDB doc store ) and to facilitate QA .

See attached screenshots below :

I get this error very few times right now but why am I getting it ?

Relevant Package Versions :

langchain-openai==0.1.14
openai==1.35.10
langchain==0.2.17
langchain-community==0.2.19
langchain-core==0.2.43
aiohttp==3.9.5
httpcore==1.0.5
httplib2==0.22.0
httptools==0.6.1
httpx==0.27.2
httpx-sse==0.4.0

gaurav10 · February 6, 2025, 12:58pm

A chat interface shows an error message about creating a vector database, followed by a user's question and a response about PDF document information. (Captioned by AI)677×350 46.7 KB

My app is a Streamlit app

shanth · March 18, 2025, 9:56pm

Hello! I’m an eng at OpenAI that came across this report, tried to repro it using code + prompts in RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read) · Issue #2065 · openai/openai-python · GitHub, but not been able to (with openai 1.60.2 as reported there).

In case anyone is able to repro it now, love to get a request-id of the failing request. Or, if you’re able to run the script below, and report the request-id (req_ string) that gets printed, that would be awesome too. Thank you!

import asyncio
import os

from openai import AsyncOpenAI

client = AsyncOpenAI(api_key=os.environ.get("OPENAI_API_KEY"), max_retries=3)


PROMPTS = [
    "Hello",
    "how to cross a river in a boat with a wolf, a goat and a cabbage? Explain your thought process",
    "now explain with steps reversed",
    "invent a new complex problem like this one",
]

async def main():
   system_prompt = {
       "role": "system",
       "content": "You're a polite assistant and you always explain your reasoning.",
   }
   messages = [system_prompt]
   for user_input in PROMPTS:
       messages.append({"role": "user", "content": user_input})
       stream = await client.chat.completions.create(
           model="gpt-4o-2024-08-06",
           messages=messages,
           stream=True,
           timeout=10,
       )
       print(stream.response.headers.get("x-request-id"))
       answer = ""
       async for chunk in stream:
           token = chunk.choices[0].delta.content
           answer += token or ""
           print(".", end="", flush=True)
       print("\n", answer)
       messages.append({"role": "assistant", "content": answer})


asyncio.run(main())

Keen_Hon · May 17, 2025, 12:05am

@shanth
Not sure if it is related or a bug. I started having issues when I enabled logprobs
logprobs=true
top_logprobs=1

Does it count towards max token (16384)?

Here are my experiments and request ids, getting this error “peer closed connection without sending complete message body (incomplete chunked read)”

Different finetuned models same prompt
OpenAI Request ID: req_cd79d67111c6231ece4da879c6120181
OpenAI Request ID: req_bdd68774e518adfa6983ba73425c26cb
Upgraded to latest packages: httpcore-1.0.9 openai-1.79.0 pydantic-2.11.4
OpenAI Request ID: req_94bbc34a05590b9a9654683b0ccc0593
tried openai_client = OpenAI(timeout=600), no effect
OpenAI Request ID: req_c1c6dbbc213df52df00abb3302ca405f

-use standard model gpt-4o-mini-2024-07-18
OpenAI Request ID: req_4b93428d151f2ce8091e69ab470128c5

When I disabled logprobs, no issue
OpenAI Request ID: req_320590488da2052e1318f0148cef0220 (Okay - 3764 chunks)

Keen_Hon · May 17, 2025, 1:06am

@shanth
Here’s a sample code to reproduce the issue.

#!/usr/bin/env python
import os
import time
import json
import logging
from typing import Dict, Any, List, Optional, Tuple
import sys
from openai import OpenAI

# Setup basic logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Models to test
OPENAI_MODEL = "gpt-4o-mini-2024-07-18"  # Primary model

# Initialize client
openai_client = OpenAI()  # Uses OPENAI_API_KEY from env vars by default

def cooldown(seconds=3):
    """Apply cooldown between API calls to avoid rate limits"""
    print(f"Cooling down for {seconds} seconds...")
    time.sleep(seconds)

def test_openai_api_call(
    prompt: str,
    model: str = OPENAI_MODEL,
    enable_logprobs: bool = True
) -> Tuple[str, Optional[Dict]]:
    """
    Test OpenAI API call with streaming and logprobs
    
    Args:
        prompt: The user prompt to send
        model: The model to use
        enable_logprobs: Whether to enable log probabilities
        
    Returns:
        Tuple of (response text, confidence_data)
    """
    # Prepare messages
    messages = [{"role": "user", "content": prompt}]
    
    # Define log directory for error/output files
    log_dir = "test_logs"
    os.makedirs(log_dir, exist_ok=True)
    
    # Create timestamp for log files
    timestamp = time.strftime("%Y%m%d-%H%M%S")
    
    print(f"\n=== TEST OPENAI API CALL ===")
    print(f"Model: {model}")
    print(f"Logprobs enabled: {enable_logprobs}")
    print(f"Prompt length: {len(prompt)} chars")
    print("===========================\n")
    
    # Dictionary to hold logprob data
    confidence_data = None
    if enable_logprobs:
        confidence_data = {"raw_logprobs": []}
    
    try:
        cooldown()
        print(f"Calling OpenAI API ({model}) with streaming...")
        
        # Reset response chunks for this attempt
        response_chunks = []
        chunk_count = 0
        last_update_time = time.time()
        
        # Collect logprobs if enabled
        collected_logprobs = [] if enable_logprobs else None
        
        # Create API call parameters
        api_params = {
            "model": model,
            "messages": messages,
            "temperature": 0,
            "top_p": 1,
            "frequency_penalty": 0, 
            "presence_penalty": 0,
            "max_tokens": 16384,
            "stream": True,
            "logprobs": enable_logprobs,
            "top_logprobs": 1 if enable_logprobs else None
        }
        
        # Make the API call
        stream = openai_client.chat.completions.create(**api_params)
        
        print("  Receiving stream...", end="", flush=True)
        
        # Process the stream
        for chunk in stream:
            if chunk.choices and len(chunk.choices) > 0:
                # Extract content delta
                delta = chunk.choices[0].delta
                if hasattr(delta, 'content') and delta.content is not None:
                    response_chunks.append(delta.content)
                    chunk_count += 1
                    
                    # Collect logprobs if available
                    if enable_logprobs and hasattr(chunk.choices[0], 'logprobs') and chunk.choices[0].logprobs:
                        collected_logprobs.append(chunk.choices[0].logprobs)
                    
                    # Update progress periodically
                    current_time = time.time()
                    if current_time - last_update_time > 1:
                        print(f"\r  Receiving stream... {chunk_count} chunks", end="", flush=True)
                        last_update_time = current_time
        
        print("... stream finished.", flush=True)
        
        # Combine chunks
        response_text = "".join(response_chunks)
        
        if response_text:
            print(f"API response received ({len(response_text)} chars)")
            
            # Save response to file
            response_file = os.path.join(log_dir, f"response_{model.split(':')[-1]}_{timestamp}.txt")
            with open(response_file, 'w', encoding='utf-8') as f:
                f.write("=== OPENAI API RESPONSE ===\n")
                f.write(f"Model: {model}\n")
                f.write(f"Logprobs enabled: {enable_logprobs}\n")
                f.write(f"Response length: {len(response_text)} chars\n")
                f.write("="*30 + "\n\n")
                f.write(response_text)
            print(f"Saved response to {response_file}")
            
            # Process logprobs if available
            if enable_logprobs and collected_logprobs:
                print(f"Collected {len(collected_logprobs)} logprob chunks")
                
                # Process and store token-level logprobs
                confidence_data["raw_logprobs"] = [
                    {
                        "content": [
                            {
                                "token": token_info.token,
                                "logprob": token_info.logprob
                            }
                            for token_info in logprobs.content
                            if hasattr(logprobs, 'content')
                        ]
                    }
                    for logprobs in collected_logprobs
                    if hasattr(logprobs, 'content')
                ]
                
                # Save logprobs to file
                logprobs_file = os.path.join(log_dir, f"logprobs_{model.split(':')[-1]}_{timestamp}.json")
                with open(logprobs_file, 'w', encoding='utf-8') as f:
                    json.dump(confidence_data, f, indent=2)
                print(f"Saved logprobs to {logprobs_file}")
            
            return response_text, confidence_data
        else:
            print("Warning: Empty response received")
            return "ERROR: Empty response received", None
    
    except Exception as e:
        error_message = str(e)
        print(f"Error calling API: {error_message}")
        
        # Save error information
        error_file = os.path.join(log_dir, f"error_{model.split(':')[-1]}_{timestamp}.txt")
        with open(error_file, 'w', encoding='utf-8') as f:
            f.write(f"Error: {error_message}\n")
            f.write(f"Model: {model}\n")
            f.write(f"Timestamp: {timestamp}\n")
        return f"ERROR: {error_message}", None

def main():
    """Main test function."""
    # Simple prompt for testing
    test_prompt = """
Repeat the below text as is:
Search
Write
Sign up

Sign in



Unlocking Ultra-Long Text Generation: A Deep Dive into LongWriter and AgentWrite
Hitesh Hinduja
Hitesh Hinduja

Follow
20 min read
·
Aug 21, 2024
71





Hello Folks,

It’s been quite some time (almost 3 months) since my last blog post. But finally, I’m back, and let’s get started! Moving forward, my blogs will primarily focus on interesting research papers in the LLM and GenAI space. I’ll be discussing problem statements that I encounter in my day-to-day life, in what we like to call “story time,” as many of you might remember from my past blogs. This will be followed by a deep dive into the technical aspects of those problem statements. In addition to explaining the research papers, I’ll share experiences and practical examples, and I’ll also elaborate on technical details that the papers might skip, assuming the reader already knows them. So, let’s dive in!

Just a few days ago, one of my family friends visited our place. They have a lovely 8-year-old daughter. It was August 15th, India’s Independence Day, and her school had given her an assignment to write an essay on Independence Day with a strict requirement of “at least 10000 words”. Now that’s really a lot! I really don’t know if I should call this an essay or a mini-book for an 8-year-old child! As usual, the parents started drafting it on behalf of their child. The first thing that comes to everyone’s mind is ChatGPT or something similar. At first, the parents were very relaxed and thought, “Let’s start drafting this on August 14th, just a day before, since it’s just a matter of ‘prompting the LLM model’ and getting the output.” On the night of August 14th, they did just that, but any guesses what happened? The model, though it gave a good output, struggled to maintain the following: Relevance, Accuracy, Coherence, Clarity, Breadth and Depth, and Reading Experience. Additionally, when the model is asked to output strictly 10k words, it repeats the context and significantly goes out of context.

Now, you all might be wondering, what are these six dimensions? For that, let’s continue with the further reading and dive into the problem statement of “limitations of current long-context large language models (LLMs) in generating ultra-long outputs.” In this blog, we’ll explore an interesting research paper titled “LONGWRITER: UNLEASHING 10,000+ WORD GENERATION FROM LONG CONTEXT LLMS.” Even though these models can process inputs up to 100,000 tokens, they typically struggle to produce outputs longer than 2,000 words. The primary reason for this limitation is attributed to the supervised fine-tuning (SFT) datasets, which lack examples of long outputs, capping the models’ ability to generate extended text. So in this blog, let’s understand the intriguing technique the authors have used to improve long output responses and make sure parents’ lives become easier in the future! And what about the kids? These days, I leave that up to their destiny with the advancements in AI and the way life has become easier for them with limited use of their mental capabilities! Anyway, let’s get started.

Introduction
Now, let’s get into the borderline understanding of the paper. It kicks off by highlighting an interesting challenge with long context LLMs. These models, which can process over 100,000 tokens of input, still struggle to generate outputs longer than 2,000 words. This is a significant issue because, in some cases, more than 1% of user requests actually need longer responses.

The core problem? The supervised fine-tuning (SFT) datasets that train these models just don’t include enough examples of long outputs. So, even though the models are capable of handling long inputs, they haven’t been trained to produce long outputs effectively. This limitation has stuck around because many LLMs rely on these same datasets.

To tackle this, the authors introduce AgentWrite — a new approach that helps these models generate longer texts by breaking down the task into smaller parts. This method can push output lengths up to 20,000 words, far beyond what’s usually possible.

The paper also brings in LongWriter-6k and LongBench-Write, a dataset and benchmark created to train and test models on their ability to generate these ultra-long texts. The idea is to push the boundaries of what LLMs can do, making them more capable of handling tasks that require extended output.

Now let’s understand what is Agentwrite and how it works:
Step I: Plan
First things first, AgentWrite starts with a plan — just like how you’d outline an article before diving into writing. The model creates a detailed outline based on the given instructions, laying out the main content and specifying word counts for each section. Think of it as the model’s roadmap. For instance, if tasked with writing a 30,000-word piece on the Roman Empire, the plan might look something like this:

Paragraph 1: Introduction to the origins of the Roman Empire (700 words)

Paragraph 2: Founding of the Roman Empire (800 words)

…

Paragraph 15: Summary of the Roman Empire’s history (500 words)

This structured approach ensures the model knows exactly where it’s headed, making it easier to manage the task of generating lengthy outputs. Here you can look below how the author’s structure the input:


This detailed outline ensures that the model has a clear structure to follow, making it easier to manage the generation of long outputs.


Step II: Write
Next up, the model gets down to writing. Following the outline from Step I, the model generates the text in a serial manner — paragraph by paragraph. This sequential method ensures that each paragraph builds on the previous one, maintaining coherence throughout. By using earlier paragraphs as context, the model avoids common pitfalls like repetition or incoherence, which can be a real issue when generating longer texts. The results? Outputs extending over 20,000 words, all while keeping the quality intact. Here you see below how they motivate the input text with clear standing instructions. Imagine a pitch-perfect training to someone in their early stages of life so that they learn better for the future!


Hope you all are clear with what exactly AgentWrite method does. We will soon see the validations and results of incorporating AgentWrite method in the LongWriter-6k and LongBench-Write. But before that, let’s see how these datasets are created and how are the models fine-tuned with this dataset that has AgentWrite methodology.

LONGWRITER: TEACHING MODELS TO GENERATE ULTRA-LONG OUTPUTS
After understanding how AgentWrite works, the next big question is — can we actually teach large language models (LLMs) to generate ultra-long outputs consistently? This paper dives into just that by constructing specialized datasets, fine-tuning models, and using Direct Preference Optimization (DPO) to fine-tune their performance. Don’t worry, we will be going into each and everything in detail! Let’s start with the first step “DATA”, “DATA”, “DATA”!

4.1 Data Construction
Selection of Instructions:
The first step in this process was to handpick instructions that would naturally require long outputs. The team selected 6,000 such instructions — 3,000 from GLM-4’s SFT data (in Chinese) and another 3,000 from WildChat-1M (in English). GPT-4o automated this selection process, with additional rule-based filtering to weed out toxic or irrelevant instructions. After some manual verification, over 95% of these instructions were confirmed to genuinely need long outputs.

Generating Responses:
Once the instructions were set, it was time to generate responses using the AgentWrite pipeline powered by GPT-4o. The outputs underwent a strict post-processing phase, where any responses that were too short or that resulted from failed planning steps were filtered out. To ensure clarity, irrelevant identifiers like “paragraph 1,” “paragraph 2,” etc., were removed. The outcome was a dataset named LongWriter-6k, which offers a broad range of output lengths between 2,000 and 10,000 words — just what’s needed to supplement the general SFT data.

Combining Data:
The next move was to combine LongWriter-6k with the general SFT data, which includes 180k chat entries from GLM-4’s SFT dataset. This combination effectively addresses the scarcity of long outputs in the existing datasets, ensuring the models have ample examples to learn from.

Now once we have understood how data is getting constructed, lets understand how is the data being used in the model training.

4.2 Model Training
Supervised Fine-Tuning

Models Used:
The team focused on fine-tuning two open-source models: GLM-4–9B and Llama-3.1–8B. Both of these models are capable of handling a context window of up to 128k tokens — perfect for the ultra-long outputs we’re aiming for.

Training Process:
To fine-tune these models, a loss weighting strategy was employed. Instead of averaging the loss by sequence, the loss was averaged by token. This approach ensures that longer outputs contribute more effectively to the training process. The training setup included 8xH800 80G GPUs using DeepSpeed+ZeRO3+CPU offloading, with a batch size of 8, a learning rate of 1e-5, and a packing length of 32k. The training spanned 4 epochs, totaling approximately 2,500–3,000 steps.

Are you confused about the loss weighting strategy? If yes, let me explain it to you in detail. Else you may skip to the next section.

Token-Level Loss Averaging
How It Works:
Instead of averaging the loss at the sequence level, the model averages the loss across all tokens within the batch. This ensures that every token, regardless of its position in a short or long sequence, contributes equally to the learning process.
Example:
Assume Sequence 1 has 9 tokens and Sequence 2 has 100 tokens.
If Sequence 1 has a loss of 0.2 per token and Sequence 2 has a loss of 0.3 per token:

This approach means that the model’s learning process is more influenced by the longer sequence because it contains more tokens contributing to the loss.
Why This Matters for Long Outputs:
In the context of fine-tuning models for generating ultra-long outputs, token-level loss averaging ensures that the model pays more attention to longer sequences during training. Each token in these longer outputs contributes to the loss, encouraging the model to improve its performance on generating long, coherent text.
Resulting Models:
The fine-tuning process led to the creation of two models: LongWriter-9B (derived from GLM-4–9B) and LongWriter-8B (from Llama-3.1–8B).

Alignment (DPO)

Objective:
To further refine the quality and adherence to length constraints, the team applied Direct Preference Optimization (DPO) to the LongWriter-9B model. I will explain this to you in detail shortly. Its important for us to first understand DPO and then dive into how it is used in the paper!

Understanding Direct Preference Optimization (DPO)
Direct Preference Optimization (DPO) is a technique used to fine-tune language models by optimizing them based on preference data. The core idea is to improve the model’s ability to generate outputs that align with human preferences or specific criteria, such as following instructions closely or producing high-quality content.

1. Practical Example of DPO — For simplistic understanding
Imagine you’re building an AI that generates short stories. You want the AI to produce stories that are not only creative but also coherent and engaging.

Step 1: Generate Multiple Outputs: The AI generates several versions of a story based on the same prompt.

Version A: A highly creative but slightly confusing story.

Version B: A less creative but very clear and engaging story.

Version C: A story that is both creative and coherent but slightly lacks depth.

Step 2: Rank the Outputs: Human reviewers or another evaluation system rank these outputs based on how well they meet the desired criteria (creativity, coherence, engagement).

Ranking: Version C > Version B > Version A

Step 3: Optimize Based on Preferences: The AI is fine-tuned to generate outputs more like Version C, which best balances creativity, coherence, and engagement. This process helps the AI learn what makes a “good” story according to the given criteria.

2. Technical Explanation of DPO — Not specific to paper but in real how DPO works.
DPO involves several steps:

Data Collection: Collect pairs of outputs generated by the model. For each pair, a preference label is assigned, indicating which output is better according to specific criteria (e.g., clarity, relevance, creativity).
Preference Data: The model is trained on this preference data, where the goal is to maximize the likelihood that the model will generate the preferred output in future generations.
Objective Function: The model’s objective function is adjusted to give higher probabilities to preferred outputs. This is typically done using techniques like pairwise ranking loss or a reward model that scores outputs based on how well they align with the preferences(Please feel free to reach out to me for understanding more about pairwise ranking loss understanding)
Fine-Tuning: The model is then fine-tuned using this adjusted objective function, making it more likely to generate outputs that align with the preferences in similar future scenarios.
3. How DPO is Used in the Paper
So, we’ve talked about what Direct Preference Optimization (DPO) is, but how exactly is it applied in this research? In the context of the paper, DPO comes into play after the initial fine-tuning with the LongWriter-6k dataset. The goal here is to sharpen the model’s ability to generate long-form content that aligns more closely with user instructions.

Here’s a breakdown of how DPO is used:

Generating Outputs:
Once the model has been fine-tuned with the LongWriter-6k dataset, it’s time to see what it can produce. The model generates multiple outputs for prompts that require lengthy responses. For instance, if the prompt asks for a detailed 10,000-word article, the model will produce several versions of that output.

Creating Preference Data:
Next, these outputs are evaluated and ranked based on how well they meet the desired criteria — think of metrics like length adherence, coherence, and overall quality. For example, if the prompt asks for a 10,000-word piece, outputs that come closest to this length, while also maintaining coherence and quality, are ranked higher.

Optimizing the Model:
Now, using these rankings, the model is further fine-tuned. This step is where the magic of DPO really kicks in — the model’s parameters are adjusted so that it learns to favor outputs similar to the ones ranked highest by the preference model. Essentially, the model is being trained to understand, “These are the kinds of outputs you should aim for.” But please don’t confuse here with the objective function. Here its making sure that the model learns among the best of the best outputs when it gets fine-tuned the second time! So its not like any objective function or loss function getting optimised in this paper for DPO. I am sure you might be wondering why there is so much of manual intervention? — Bdw my boss used to tell me “Sometimes hardwork is better than smart work”

Final Evaluation:
After applying DPO, the model goes through another round of evaluation using benchmarks like LongBench-Write. This final step is crucial to ensure the model isn’t just generating long texts, but high-quality, length-appropriate content.

4. How AgentWrite Uses DPO
Now, let’s tie this back to AgentWrite and how it leverages DPO to enhance its performance.

Improving Length Adherence:
One of the main challenges AgentWrite addresses is getting the model to stick to specific length requirements — say, producing that elusive 10,000-word article without falling short. DPO helps by optimizing the model to prefer outputs that meet these length requirements more accurately. So, after going through DPO, the model is more likely to hit the length targets dead-on.

Enhancing Content Quality:
But it’s not just about hitting the word count. AgentWrite, with the help of DPO, also refines the model’s ability to produce content that is coherent, relevant, and clear. By focusing on the highest-quality outputs during DPO, AgentWrite ensures that the model isn’t just verbose but also produces text that’s worth reading — long, but also strong.

So finally, we come to an end where we understand how DPO is used additionally to improve the quality of outputs and have better version of next fine-tuned model. In the article till here, we see that there are six dimensions that have been talked about very often. Now before moving to the next section where I explain you about these 6 metrics and how are they evaluated, I would like to give you a brief understanding of those. So let’s understand this in a funny way:

Picture this: Your boss has just asked you to draft a 1,000-word report on “The Impact of Remote Work on Team Productivity.” Naturally, you turn to your trusty AI to help you out. Now, let’s see how you’d evaluate whether the AI’s output is actually going to impress the boss or get you a stern “we need to talk” email.

Relevance:
First off, the report needs to stay on point. You don’t want the AI wandering off into unrelated territory, like the benefits of beach vacations or why cats make great office companions (unless, of course, your boss is a cat enthusiast).

Accuracy:
Next, the report has to be factually correct. You’re hoping to present solid insights on how remote work affects productivity, not accidental claims like “working from home increases productivity by 200% — especially if you have Netflix on in the background.”

Coherence:
The report should be logically structured. It should flow like a well-organized meeting agenda, not like a chaotic brainstorming session where everyone’s talking at once. Each section should lead smoothly into the next, making it easy for your boss to follow along without needing a coffee break halfway through.

Clarity:
The language should be crystal clear — no jargon that requires a decoder ring or overly complex sentences that sound like they belong in a legal contract. Your boss should be able to skim through it and instantly get the main points, without scratching their head or wondering if you’ve secretly hired a lawyer to write it.

Breadth and Depth:
The report should cover all the critical angles — like how remote work affects team collaboration, productivity metrics, and maybe even the impact on employee morale. You want to make sure the report is thorough enough to avoid that dreaded “Could you add more detail here?” feedback.

Reading Experience:
Finally, the report should be engaging and easy to read. You don’t want your boss yawning halfway through or worse, feeling the need to rewrite half of it. The writing should be smooth and professional, giving your boss the impression that you’ve really put in the effort (even if you had a little help from AI on the side).

But now, you might be wondering — how do we actually evaluate these metrics? After all, your AI doesn’t come with a built-in “Boss Approval Gauge” or a “Clarity-O-Meter.” So, let’s dive into how this all works:

How Are These Metrics Calculated? Enter LLM-as-a-Judge
So, after understanding the six key dimensions for evaluating text quality, the next question that naturally comes up is, “How on earth does the AI figure all this out?” After all, the model itself doesn’t inherently “know” these metrics. That’s where the LLM-as-a-Judge method comes into play.

LLM-as-a-Judge Overview
Instead of relying on human evaluators for every single piece of content, the researchers cleverly used the LLM itself to assess the quality of the generated outputs. By leveraging the model’s extensive knowledge and contextual understanding, it can provide objective evaluations across the six dimensions we’ve discussed. Bdw my boss also used to tell me “Sometimes smart work is better than hard work” 😂

Scoring Process: How It Works
The way it works is pretty straightforward. The researchers designed specific prompts to instruct GPT-4o to evaluate the outputs. For example, the model might be asked to rate how relevant the output is to the original prompt or assess the clarity of the writing. The prompts might look something like this:

“You are an expert in evaluating text quality. Please rate the AI assistant’s response across six dimensions: Relevance, Accuracy, Coherence, Clarity, Breadth and Depth, and Reading Experience. Provide a score from 1 to 5 for each dimension.”

The model then analyzes the text based on these prompts and provides a score, usually outputting its evaluation in a structured format like JSON, which includes both the scores and brief explanations.

Use of Pre-Defined Metrics
For specific metrics like output length, the model uses something a bit more technical — a piecewise linear function. This function adjusts the score based on how closely the actual length of the output matches the required length. For example:

If the output is exactly 1,000 words when the prompt asks for 1,000, the score might be a perfect 100.

If it’s a bit off — say, 10% longer or shorter — the score will decrease slightly.

And if it’s way off, like 50% longer or shorter, the score drops significantly.

This method ensures that the model is incentivized to produce text that fits the expected length, penalizing outputs that deviate too much.

No Built-in Metrics During Training
It’s important to note that these metrics aren’t baked into the model during its initial training. The model doesn’t inherently “know” it’s being judged on Relevance or Coherence — it’s just generating text based on its training data. The evaluation happens externally, using a separate model (in this case, GPT-4o) that’s prompted to act as a judge.

Tools and Techniques
While some models can be fine-tuned on datasets labeled by human annotators, this paper primarily relies on GPT-4o’s existing capabilities to evaluate the text without additional fine-tuning. The idea is to mimic how a human would assess the quality of the writing, using the model’s vast knowledge and understanding of language.

Experiment Results: Model Evaluation on LongBench-Write
The research team evaluated a set of models — 4 proprietary and 5 open-source — using the LongBench-Write benchmark. This evaluation also included the newly trained LongWriter models. Among the existing models, the only other one specifically aligned for long-form text generation is Suri-IORPO, which is based on Mistral-7B-Instruct-v0.2 and fine-tuned using LoRA.

For the evaluation, the team configured each model with a temperature setting of 0.5 and set the maximum tokens parameter to the highest allowed by the model’s API. For open-source models, this was set at 32,768 tokens.

The key results, summarized in Table 3, include average and median response lengths, while Figure 6 visualizes how well each model’s output length aligns with the required length across 120 different instructions.

Key Findings from the Experiment Results
Let’s take a look at this table carefully!


Output Length Performance:
Previous Models’ Limitations: Most existing models struggle to meet the length requirement of over 2,000 words. For instance, in the [2k, 4k) word range, the majority scored below 70, with only Claude 3.5 Sonnet performing decently (see Table 3 above).
Severe Shortcomings in Longer Outputs: For prompts requiring 4,000 to 20,000 words, almost all previous models failed to reach the target output length, with many scoring 0 (indicating output lengths were less than one-third of the required length) (refer to Figure 6 below).

LongWriter Models’ Success: In contrast, the LongWriter models, enhanced with training data from LongWriter-6k, consistently met the required output lengths while maintaining good quality, as evidenced by the Si (output length score) and Sq (quality score) in the [2k, 20k) range (refer to Table 3 and Figure 6).
Coherence of Long Outputs:
NLL Testing for Coherence: The cumulative average negative log-likelihood (NLL) test was used to ensure that the long outputs from the LongWriter models were coherent and logically connected (see Figure 7).

Scratching your head or confused on what exactly is NLL? Don’t worry, let me explain this to you in a funny way!
    """
    
    # Test 1: No logprobs, no schema
    print("\n=== TEST 1: Basic API Call (No logprobs) ===")
    response, _ = test_openai_api_call(test_prompt, enable_logprobs=False)
    print(f"Test 1 complete: {'SUCCESS' if not response.startswith('ERROR') else 'FAILED'}")
    
    # Test 2: With logprobs, no schema
    print("\n=== TEST 2: API Call with Logprobs ===")
    response, confidence_data = test_openai_api_call(test_prompt, enable_logprobs=True)
    print(f"Test 2 complete: {'SUCCESS' if not response.startswith('ERROR') else 'FAILED'}")
    
    print("\nAll tests completed. Check logs for details.")

if __name__ == "__main__":
    main()

Topic		Replies	Views
GPT-4.5 is live in the API! Announcements	153	18038	July 17, 2025
Could "biological sleep" be an interesting inspiration for Large Language Models research? Community agi	107	983	January 12, 2025
How to Improve GPT-4 API Output Length and Structure? API gpt-4 , api	13	1787	July 4, 2025
Train (fine-tune) a model with text from books or articles API	62	28122	November 30, 2023
Validating Middle of Context in GPT-4-128K API gpt-4	16	4600	December 14, 2023

RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)

Related topics