JP OCR Language not working!

Issue with OCR for Japanese Language in ChatGPT

Dear OpenAI Support Team,

I hope this message finds you well. I am currently using ChatGPTo4 and have encountered an issue when trying to use Optical Character Recognition (OCR) for extracting text from images containing Japanese characters.

Previously, I was able to successfully extract text from Japanese images using the OCR feature. However, in the current environment, it seems that the Japanese language model is no longer available or supported for OCR, which is crucial for my usage. I receive an error indicating that the Japanese language is not loaded or supported in the current setup.

As I am a paying subscriber to ChatGPT and rely on the OCR functionality, particularly for Japanese text, I would appreciate any guidance or assistance you can provide. I would like to know if there is a way to enable or access the Japanese language model for OCR in my current environment, or if there are any future plans to support this feature.

Thank you for your time and support. I look forward to your response.

Best regards,

1 Like

I see the same issue (I fed it a screengrab from asahi.com in Japanese):

It seems there was an issue with the Japanese language data for OCR processing. Let me fix the issue and try again using a general OCR engine.

However I then re-tried it without OCR and it worked fine, this was my prompt:

List all the headlines (in English) in the enclosed image. Do not use OCR.

Maybe their OCR tool is not working anymore?

There is no specific “OCR Tool”

The ChatGPT AI has either:

  • built-in computer vision;
  • a Python environment where it can execute scripts it writes.

I assume the latter might be employed by the language you are using, and we can confirm the issue, while English is successful. Ultimately:

The OCR extraction for the English text image was successful using pytesseract, and the text was correctly identified as:

Text

Summary of Findings:

  1. Python Environment:

    • Python version: 3.11.8.
    • Platform: Linux-4.4.0-x86_64-with-glibc2.36.
  2. pytesseract version:

    • pytesseract version: 5.3.0 (Tesseract leptonica-1.82.0).
  3. OCR on English Text:

    • Successful, indicating that the current environment supports basic OCR for English text.

Root Cause for Japanese OCR Issues:

  • The problem seems specific to Japanese language OCR (lang="jpn"), which may point to one of the following:
    • Missing Language Data: The Japanese language pack might not be installed in the Tesseract setup.
    • Font Rendering Issues: Japanese text might have been hard to recognize due to font rendering or image quality.

Next Steps:

  • Install Japanese Language Pack: Ensure that the Japanese language data for Tesseract (jpn.traineddata) is available.

(ChatGPT platform is currently near-nonfunctional, not completing responses and even loading slow in general. With OCR attempts without specifying language parameter, the text was garbled nonsense).

1 Like

Sorry - when we say “OCR tool” in this context, we do actually mean the Python interpreter invoking pytesseract :wink:

1 Like

There is OCR functionality available in the current environment, but there is a limitation regarding the Japanese language. The current environment supports OCR for languages such as English, but the Japanese language (lang=“jpn”) is not supported due to the missing language data.
OCR in English- Successful extraction using pytesseract, confirming that the environment supports basic OCR for English text.
*Japanese OCR Issues- The Japanese language pack might not be installed in the Tesseract setup, causing the extraction to fail.

  1. download 30MB file: tessdata/jpn.traineddata at main · tesseract-ocr/tessdata · GitHub
  2. attach to message along with your images
  3. give instructions for using language file
Prompt

OCR Task Instruction for AI:

You’ve received an uploaded Japanese language data file (jpn.traineddata) for pyTesseract and image files from a user. Perform OCR on the images using the following steps in your Python notebook environment to enable Japanese:

  1. Set the TESSDATA_PREFIX environment variable to the mount point path containing the uploaded jpn.traineddata file to ensure Tesseract recognizes the custom language data.
  2. Use the pytesseract library to perform OCR on the uploaded image, specifying ‘jpn’ as the language parameter.
  3. Return the extracted text from the image.
  4. Use your own computer vision to extract text to see if you have understanding. Synthesize your results with that of tessaract python to make a high quality image transcription.
  1. Upgrade to native Japanese OCR software when the results are still poor.
1 Like

I experience the same for Danish.