Debugging response failure for batch action


def rate_page_relevancy_llm3(df, question):
    api_key =X
    client = instructor.patch(OpenAI(api_key=api_key))

    class PageScore(BaseModel):
        score: int = Field(..., description="Numerical rating of the page's relevance to the question (0-10)")
        reasoning: str = Field(..., description="Up to 20 words")

    class BatchPageScore(BaseModel):
        scores: List[PageScore]

    urls = df['URL'].tolist()
    messages = [{
        "role": "system",
        "content": f"""
        
        You will be given a list of URLs. For each URL, assign a score (0-10) based on the likelihood that the page will contain the answer to the question: {question}. 
        Also provide a brief reasoning (up to 20 words) for each score. Respond with a list of scores and reasonings for all URLs."""

    }, {"role": "user","content": f"URLs: {urls}"}]

    response = client.chat.completions.create(
        model="gpt-4o",  # change to 4o-mini? test!
        response_model=BatchPageScore,
        messages=messages
    )
    
    scores = [page_score.score for page_score in response.scores]
    reasonings = [page_score.reasoning for page_score in response.scores]
   
    df['Score'] = scores
    df['Reasoning'] = reasonings

    return df

As you can see from the above, I am asking gpt-4o to rate a list of different pages (contained in the variable: url) in terms of their relevancy (based on each url) to a particular input question.

In some cases, particularly when I analyse a URL variable with lots of pages at once, the total scores returned is less than the total pages in the url variable. So my guess is that for some pages in the URL variable for whatever reason, the LLM is failing to rate relevancy and hence returning no results for those pages.

How can I debug this further given that I am doing a single API call (it is not multiple calls where each would have its own response etc)? How can I figure out which page is the LLM failing for and why?

I tried doing try/except logic but since the overall API call is not failing this doesn’t give me any error messages.

Okay, it sounds like when you send a list of multiple URLs, it’s not rating all of them?

I might try sending a user/assistant pair with an example of what you want - ie score for each URL…

1 Like

There is no way to debug what is happening with a single call.

More than a couple in a single call may become unreliable.

LLM’s are not deterministic nor are they able to look at and prioritise large amounts of information at one time. This is called “attention” and is best directed to single tasks per API call.

If you get it working with a single URL, you might try 2 or 3, but much past that and you are in the land of chance.