Can an assistant help me with OCR?

I have been reading about OCR capabilities with images, and all I have found that is a little helpful is a post about using Spectre.Console.dll. What is returned by Chat is a synopsis of what the images says, but doesnt return the exact text as it was read. Can someone point me to an example of how that is done.

The endpoint I am using is: https://api.openai.com/v1/chat/completions

Here is the code:

private async Task readfile(string filePath)
{
string completionsEndpoint = “https://api.openai.com/v1/chat/completions”;
JsonSerializerOptions jsonOptions = new()
{
WriteIndented = true,
DefaultIgnoreCondition = JsonIgnoreCondition.WhenWritingNull
};

// LOGIC
await AnsiConsole.Status()
.StartAsync(“Analyzing image…”, async ctx =>
{
try
{
byte imageBytes = await File.ReadAllBytesAsync(filePath);
string imageAsBase64String = Convert.ToBase64String(imageBytes);
string fileExtension = Path.GetExtension(filePath);

  		HttpClient client = new();

  		client.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Bearer", token);

  		VisionRequest visionRequest = new(new List<ChatMessage>
  			{
  		new(new List<MessageContent>
  		{
  			new("text", "What's in this image?", null),
  			new("image_url", null, new ImageUrl($"data:image/{fileExtension};base64,{imageAsBase64String}"))
  		})
  			});

  		string json = JsonSerializer.Serialize(visionRequest, jsonOptions);

  		HttpResponseMessage response = await client.PostAsync(completionsEndpoint, new StringContent(json, Encoding.UTF8, "application/json"));
  		VisionResponse? visionResponse = await response.Content.ReadFromJsonAsync<VisionResponse>();
  		string? content = visionResponse?.Choices?.FirstOrDefault()?.Message?.Content;

  		if (!string.IsNullOrEmpty(content))
  		{
  			txtdocument.Text = content;
  			Cursor.Current = Cursors.Default;
  			AnsiConsole.MarkupLine($"Here is a description of your provided image: [yellow]{content}[/]");
  			AnsiConsole.WriteLine();
  		}
  		else
  		{
  			AnsiConsole.MarkupLine("Unfortunately there is no content available to display.");
  		}
  	}
  	catch (Exception ex)
  	{
  		AnsiConsole.MarkupLine($"Something went wrong: [red]{ex.Message}[/]");
  	}
  });

}

Welcome back :slight_smile:

The vision models aren’t really built for OCR. You might have more success with dedicated OCR tools like tesseract or something, and then deal with the text. They can read or extract some things, but the issue is that they tend to hallucinate things - infer things to be there that could logically be, but aren’t.

You can think of it as being shown a picture for a second, and then being asked to desciribe in high detail what you just saw. That’s why, at the moment, it doesn’t really work the way you’re expecting.

3 Likes

I have found that as context window increases more and more applications for Document Chat mostly are turning to just pdf -->split into pages → pages converted Into images and then one by one provided to AI to extract the text.

Same as tesseract basically. However, I have tried and Assistant still isn’t capable enough to get 100% of the text using again tesseract and relies heavily to be supervised and revisions to be made.

In OpenAI cookbook I found an application similar it was named Parse PDF for GPT4o or something similar sorry I haven’t gotten the chance to find it.

Its interesting to me that I can get chat to tell me about what was written in the image but has no way of returning what it has interpreted from the image. If it can create a generalization of the document, I would think it could just return everything it “sees”.

What interpretation is it that you think is missing that cannot be instructed?

I am testing against a scanned document of a Telescope Installation Instructions. When I gave the exact instructions you listed above, it did a little better but not much. It still will not give me a verbatim transcript.

Last week, I conducted a study using photographs of Brazilian ID documents.

I also considered using GPT for OCR, but the request takes an average of 4 to 6 seconds.

I tested Tesseract with Node.js, Golang, and Python, as well as OpenAI GPT-4 and AWS Textract.

Results

image

Tesseract

  • Medium efficiency:
    • Python + Tesseract: average of 600 ms
    • Golang + Tesseract: average of 1000 ms
    • Node.js + Tesseract: average of 3000 ms

AWS Textract

  • Very efficient:
    • Python + AWS Textract: average of 4000 ms
    • Golang + AWS Textract: average of 4500 ms
    • Node.js + AWS Textract: average of 5000 ms

OpenAI GPT-4

  • Not very efficient:
    • Python + OpenAI GPT-4: average of 5500 ms
    • Golang + OpenAI GPT-4: average of 5700 ms
    • Node.js + OpenAI GPT-4: average of 6000 ms

It is worth noting that Tesseract is quite efficient with PDFs and Word documents, as long as they are converted to images.

Conclusion

For speed and cost reasons, I ended up preferring Tesseract (it’s free).

If you want greater efficiency, albeit with a cost, use AWS Textract or a similar service.

OpenAI GPT-4 was good, but it is not ideal for OCR.

1 Like

I also had a system message giving the AI its ability to repeat back verbatim the contents within any images without alteration.

Something to consider: an image is 85 tokens (or less with gpt-4o). It would be an ultimate feat of compression if that could contain more than 85 tokens of language to be extracted reliably.