DOCUMENT_TEXT_DETECTION API Japanese text recognit... - Page 2

OyaDuck · 03-12-2024 02:11 AM

Hello everyone,

We are using the DOCUMENT_TEXT_DETECTION API, an OCR service of Vision API. However, since around 9:00 AM (JST) on March 8, 2024, we have confirmed that some Japanese (JA) text recognized by the API includes old forms of Japanese characters.

For example, the modern character "内" (the standard in Japan) is now being returned as the old form character "內" in the inference results. We have also confirmed that other characters, such as "検" (standard) becoming "檢", are also being inferred as old forms of characters. It is highly likely that other characters will also be inferred as old forms of characters.

This problem has not occurred in the past, and we have confirmed that the same problem is still occurring after March 8, 2024.

We also checked the locale of the response, and at first we expected this problem to only occur when the locale is determined to be "und". However, we have confirmed that the same problem occurs even when the locale is determined to be "ja".

Has there been a change in the model or algorithm? Or is there a problem with the way we are using it?

If there is a solution, we would appreciate it if you could let us know.

UPDATE:

The situation has changed.

We changed the model of DOCUMENT_TEXT_DETECTION from "builtin/stable" to "builtin/weekly" and this problem was solved.

We think this problem is quite critical in Japanese OCR.
Does Google have any plans to reflect this to builtin/stable?

DOCUMENT_TEXT_DETECTION API Japanese text recognition and old form of character inference