JP OCR Language not working!

turkialqinan · September 20, 2024, 11:55am

Issue with OCR for Japanese Language in ChatGPT

Dear OpenAI Support Team,

I hope this message finds you well. I am currently using ChatGPTo4 and have encountered an issue when trying to use Optical Character Recognition (OCR) for extracting text from images containing Japanese characters.

Previously, I was able to successfully extract text from Japanese images using the OCR feature. However, in the current environment, it seems that the Japanese language model is no longer available or supported for OCR, which is crucial for my usage. I receive an error indicating that the Japanese language is not loaded or supported in the current setup.

As I am a paying subscriber to ChatGPT and rely on the OCR functionality, particularly for Japanese text, I would appreciate any guidance or assistance you can provide. I would like to know if there is a way to enable or access the Japanese language model for OCR in my current environment, or if there are any future plans to support this feature.

Thank you for your time and support. I look forward to your response.

Best regards,

platypus · September 20, 2024, 12:54pm

I see the same issue (I fed it a screengrab from asahi.com in Japanese):

It seems there was an issue with the Japanese language data for OCR processing. Let me fix the issue and try again using a general OCR engine.

However I then re-tried it without OCR and it worked fine, this was my prompt:

List all the headlines (in English) in the enclosed image. Do not use OCR.

Maybe their OCR tool is not working anymore?

_j · September 20, 2024, 1:56pm

There is no specific “OCR Tool”

The ChatGPT AI has either:

built-in computer vision;
a Python environment where it can execute scripts it writes.

I assume the latter might be employed by the language you are using, and we can confirm the issue, while English is successful. Ultimately:

The OCR extraction for the English text image was successful using pytesseract, and the text was correctly identified as:
Text
Summary of Findings:

Python Environment:

Python version: 3.11.8.

Platform: Linux-4.4.0-x86_64-with-glibc2.36.

pytesseract version:

pytesseract version: 5.3.0 (Tesseract leptonica-1.82.0).

OCR on English Text:

Successful, indicating that the current environment supports basic OCR for English text.

Root Cause for Japanese OCR Issues:

The problem seems specific to Japanese language OCR (lang="jpn"), which may point to one of the following:

Missing Language Data: The Japanese language pack might not be installed in the Tesseract setup.

Font Rendering Issues: Japanese text might have been hard to recognize due to font rendering or image quality.

Next Steps:

Install Japanese Language Pack: Ensure that the Japanese language data for Tesseract (jpn.traineddata) is available.

(ChatGPT platform is currently near-nonfunctional, not completing responses and even loading slow in general. With OCR attempts without specifying language parameter, the text was garbled nonsense).

platypus · September 20, 2024, 3:56pm

Sorry - when we say “OCR tool” in this context, we do actually mean the Python interpreter invoking pytesseract

turkialqinan · September 21, 2024, 7:07am

There is OCR functionality available in the current environment, but there is a limitation regarding the Japanese language. The current environment supports OCR for languages such as English, but the Japanese language (lang=“jpn”) is not supported due to the missing language data.
OCR in English- Successful extraction using pytesseract, confirming that the environment supports basic OCR for English text.
*Japanese OCR Issues- The Japanese language pack might not be installed in the Tesseract setup, causing the extraction to fail.

_j · September 21, 2024, 10:17am

download 30MB file: tessdata/jpn.traineddata at main · tesseract-ocr/tessdata · GitHub
attach to message along with your images
give instructions for using language file

Prompt

OCR Task Instruction for AI:

You’ve received an uploaded Japanese language data file (jpn.traineddata) for pyTesseract and image files from a user. Perform OCR on the images using the following steps in your Python notebook environment to enable Japanese:

Set the TESSDATA_PREFIX environment variable to the mount point path containing the uploaded jpn.traineddata file to ensure Tesseract recognizes the custom language data.
Use the pytesseract library to perform OCR on the uploaded image, specifying ‘jpn’ as the language parameter.
Return the extracted text from the image.
Use your own computer vision to extract text to see if you have understanding. Synthesize your results with that of tessaract python to make a high quality image transcription.

Upgrade to native Japanese OCR software when the results are still poor.

lau_cph · December 19, 2024, 12:14pm

I experience the same for Danish.

Topic		Replies	Views
OCR functionality now broken/unavailable? GPT builders	0	1167	January 9, 2024
Need help? OpenAI Japanese Language support API gpt-4 , text-davinci-002 , openai	7	2684	December 17, 2023
Trouble with OCR Using Multiple Photo Plugins / Actions builders gpt-4	4	179	November 28, 2024
How to solve the problem that GPT-API cannot read text using OCR? API	19	2476	July 10, 2024
How to Programmatically Extract Text from Images Using GPT-4 API gpt-4 , chatgpt , api , assistants-api	9	2693	October 14, 2024

JP OCR Language not working!

Summary of Findings:

Root Cause for Japanese OCR Issues:

Next Steps:

Related topics