GPT-4o OCR bad at reading names?

Hey,

so we have been using the 4o vision API for OCR. We need it to extract some information from photos of documents. We have found, however, that the model often reads names terribly wrong. It substitutes characters or straight up just removes them from names. We do it on good quality images that we even tried to pre-process and turn up their contrast, sharpness, lower noise, etc. The model still makes these mistakes, very consistently, basically almost always. We have tried specifying in prompts multiple times to focus on this, improve on it, to not make up or substitute anything. Nothing works.

It doesn’t seem to make these mistakes with other information, or at least we haven’t noticed it. It seems to be specifically names.

Anybody with a similar problem, or have an idea on how to prevent this?

Welcome to the dev forum, @vojta.siler!

Interesting observation. I haven’t personally noticed this behavior, but it is not entirely unexpected since names are proper nouns, and as such, they can have non-standard spellings or token sequences.

Could you share an example that I can use to reproduce this behavior with gpt-4o?

First: you might try gpt-4.1-mini, with its patches method and a different internal resize, using 1371 prompt tokens at a lower price instead of 976 of “tiled”.

PS Look at the undocumented user message format…

import os, base64, requests

image_path = "./2005-768w.png"
with open(image_path, "rb") as image_file:
    base64_image = base64.b64encode(image_file.read()).decode("utf-8")

system = [
    {
        "role": "system",
        "content": """
You are ImagePal, an AI assistant powered by GPT-4 with computer vision for OCR.

Built-in vision capabilities: Enabled""".strip(),
    }
]

user = [
    {
        "role": "user",
        "content": [
            f"""
Provide a full verbatim transcript of attached image's contents.""".strip(),
            {
                "image": base64_image,
            },
        ],
    }
]

params = {
    "model": "gpt-4o-2024-11-20",
    "max_tokens": 1535,
    "top_p": 0.01,
    "messages": system + user,
}

headers = {
    #"Content-Type": "application/json",
    "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}",
}

response = requests.post(
    "https://api.openai.com/v1/chat/completions", headers=headers, json=params
)

if response.status_code != 200:
    print(f"HTTP error {response.status_code}: {response.text}")
else:
    # print(response.json())
    print(response.json()["choices"][0]["message"]["content"])
    print(response.json()["usage"])
(details) Receive more tokens of output than image input

We also undertake a systematic study of “data contamination” – a growing problem when training high capacity models on datasets such as Common Crawl, which can potentially include content from test datasets simply because such content often exists on the web. In this paper we develop systematic tools to measure data contamination and quantify its distorting effects. Although we find that data contamination has a minimal effect on GPT-3’s performance on most datasets, we do identify a few datasets where it could be inflating results, and we either do not report results on these datasets or we note them with an asterisk, depending on the severity.

In addition to all the above, we also train a series of smaller models (ranging from 125 million parameters to 13 billion parameters) in order to compare their performance to GPT-3 in the zero, one and few-shot settings. Broadly, for most tasks we find relatively smooth scaling with model capacity in all three settings; one notable pattern is that the gap between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models are more proficient meta-learners.

Finally, given the broad spectrum of capabilities displayed by GPT-3, we discuss concerns about bias, fairness, and broader societal impacts, and attempt a preliminary analysis of GPT-3’s characteristics in this regard.

The remainder of this paper is organized as follows. In Section 2, we describe our approach and methods for training GPT-3 and evaluating it. Section 3 presents results on the full range of tasks in the zero-, one- and few-shot settings. Section 4 addresses questions of data contamination (train-test overlap). Section 5 discusses limitations of GPT-3. Section 6 discusses broader impacts. Section 7 reviews related work and Section 8 concludes.


2 Approach

Our basic pre-training approach, including model, data, and training, is similar to the process described in [RWC+19], with relatively straightforward scaling up of the model size, dataset size and diversity, and length of training. Our use of in-context learning is also similar to [RWC+19], but in this work we systematically explore different settings for learning within the context. Therefore, we start this section by explicitly defining and contrasting the different settings that we will be evaluating GPT-3 on or could in principle evaluate GPT-3 on. These settings can be seen as lying on a spectrum of how much task-specific data they tend to rely on. Specifically, we can identify at least four points on this spectrum (see Figure 2.1 for an illustration):

  • Fine-Tuning (FT) has been the most common approach in recent years, and involves updating the weights of a pre-trained model by training on a supervised dataset specific to the desired task. Typically thousands to hundreds of thousands of labeled examples are used. The main advantage of fine-tuning is strong performance on many benchmarks. The main disadvantages are the need for a new large dataset for every task, the potential for poor generalization out-of-distribution [MPL19], and the potential to exploit spurious features of the training data [GSL+18, NK19], potentially resulting in an unfair comparison with human performance. In this work we do not fine-tune GPT-3 because our focus is on task-agnostic performance, but GPT-3 can be fine-tuned in principle and this is a promising direction for future work.

  • Few-Shot (FS) is the term we will use in this work to refer to the setting where the model is given a few demonstrations of the task at inference time as conditioning [RWC+19], but no weight updates are allowed. As shown in Figure 2.1, for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving ( K ) examples of context and completion, and then one final example of context, with the model expected to provide the completion. We typically set ( K ) in the range of 10 to 100 as this is how many examples can fit in the model’s context window (( n_{ctx} = 2048 )). The main advantages of few-shot are a major reduction in the need for task-specific data and reduced potential to learn an overly narrow distribution from a large but narrow fine-tuning dataset. The main disadvantage is that results from this method have so far been much worse than state-of-the-art fine-tuned models. Also, a small amount of task specific data is still required. As indicated by the name, few-shot learning as described here for language models is related to few-shot learning as used in other contexts in ML [HYC01, VBL+16] – both involve learning based on a broad distribution of tasks (in this case implicit in the pre-training data) and then rapidly adapting to a new task.

  • One-Shot (1S) is the same as few-shot except that only one demonstration is allowed, in addition to a natural language description of the task, as shown in Figure 1. The reason to distinguish one-shot from few-shot and zero-shot (below) is that it most closely matches the way in which some tasks are communicated to humans. For example, when asking humans to generate a dataset on a human worker service (for example Mechanical Turk), it is common to give one demonstration of the task. By contrast it is sometimes difficult to communicate the content or format of a task if no examples are given.


{‘prompt_tokens’: 975, ‘completion_tokens’: 1134, ‘total_tokens’: 2109, ‘prompt_tokens_details’: {‘cached_tokens’: 0, ‘audio_tokens’: 0}, ‘completion_tokens_details’: {‘reasoning_tokens’: 0, ‘audio_tokens’: 0, ‘accepted_prediction_tokens’: 0, ‘rejected_prediction_tokens’: 0}}

gpt-4-turbo doesn't chat or use latex

We also undertake a systematic study of “data contamination” – a growing problem when training highly capacity models on datasets such as Common Crawl, which can potentially include content from test datasets since because such content often exists on the Web. In this paper we develop systematic tools to measure data contamination and quantify its distorting effects. Although we find that data contamination has a minimal effect on GPT-3’s performance on most datasets, we do identify a few datasets where it could be inflating results, and we either do not report results on these datasets or we note them with an asterisk, depending on the severity.

In addition to all the above, we also train a series of smaller models (ranging from 125 million parameters to 13 billion parameters) in order to compare their performance to GPT-3 in the zero, one and few-shot settings. Broadly, for most tasks we find relatively smooth scaling with model capacity in all three settings; one notable pattern is that the gap between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models are more proficient meta-learners.

Finally, given the broad spectrum of capabilities displayed by GPT-3, we discuss concerns about bias, fairness, and broader societal impacts, and attempt a preliminary analysis of GPT-3’s characteristics in this regard.

The remainder of this paper is organized as follows. In Section 2, we describe our approach and methods for training GPT-3 and evaluating it. Section 3 presents results on the full range of tasks in the zero-, one- and few-shot settings. Section 4 addresses questions of data contamination (train-test overlap). Section 5 discusses limitations of GPT-3. Section 6 discusses broader impacts. Section 7 reviews related work and Section 8 concludes.

2 Approach

Our basic pre-training approach, including model, data, and training, is similar to the process described in [RWC+19], with relatively straightforward scaling up of the model size, dataset size and diversity, and length of training. Our use of in-context learning is also similar to [RWC+19], but in this work we systematically explore different settings for learning within the context. Therefore, we start this section by explicitly defining and contrasting the different settings that we will be evaluating GPT-3 on or could in principle evaluate GPT-3 on. These settings can be seen as lying on a spectrum of how much task-specific data they tend to rely on. Specifically, we can identify at least four points on this spectrum (see Figure 2.1 for an illustration):

  • Fine-Tuning (FT) has been the most common approach in recent years, and involves updating the weights of a pre-trained model by training on a supervised dataset specific to the desired task. Typically thousands to hundreds of thousands of labeled examples are used. The main advantage of fine-tuning is strong performance on many benchmarks. The main disadvantages are the need for a new large dataset for every task, the potential for poor generalization out-of-distribution [MPL19], and the potential to exploit spurious features of the training data [GSL+18, NKJ19], potentially resulting in an unfair comparison with human performance. In this work we do not fine-tune GPT-3 because our focus is on task-agnostic performance, but GPT-3 can be fine-tuned in principle and this is a promising direction for future work.

  • Few-Shot (FS) is the term we will use in this work to refer to the setting where the model is given a few demonstrations of the task at inference time as conditioning [RWC+19], but no weight updates are allowed. As shown in Figure 2.1, for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving K examples of context and completion, and then one final example of context, with the model expected to provide the completion. We typically set K in the range of 10 to 100 as this is how many examples can fit in the model’s context window (nctx = 2048). The main advantages of few-shot are a major reduction in the need for task-specific data and reduced potential to learn an overly narrow distribution from a large but narrow fine-tuning dataset. The main disadvantage is that results from this method have so far been much worse than state-of-the-art, fine-tuned models. Also, a small amount of task specific data is still required. As indicated by the name, few-shot learning as described here for language models is related to few-shot learning as used in other contexts in ML [HYC01, VBL+16] – both involve learning based on a broad distribution of tasks (in this case implicit in the pre-training data) and then rapidly adapting to a new task.

  • One-Shot (1S) is the same as few-shot except that only one demonstration is allowed, in addition to a natural language description of the task, as shown in Figure 1. The reason to distinguish one-shot from few-shot and zero-shot (below) is that it most closely matches the way in which some tasks are communicated to humans. For example, when asking humans to generate a dataset on a human worker service (for example Mechanical Turk), it is common to give one demonstration of the task. By contrast it is sometimes difficult to communicate the content or format of a task if no examples are given.
    {‘prompt_tokens’: 976, ‘completion_tokens’: 1122, ‘total_tokens’: 2098, ‘prompt_tokens_details’: {‘cached_tokens’: 0, ‘audio_tokens’: 0}, ‘completion_tokens_details’: {‘reasoning_tokens’: 0, ‘audio_tokens’: 0, ‘accepted_prediction_tokens’: 0, ‘rejected_prediction_tokens’: 0}}

This was the path to direct placement of 85 token image without tiling or resizing on gpt-4-vision, but it seems to be billed similarly now.

Bad names? Then take your page, and slice it into 2000x500 images and test the understanding.

You could also hint names in prompting if you know who could be under discussion.