Hi,
We're using DOCUMENT_TEXT_DETECTION in production to perform OCR on documents. We've found the quality of OCR of PDF documents compared to the exact same TIFF to be very poor (with missing characters, extra whitespace etc).
I've attached an example test image in both PDF and TIFF formats. You can see the text is very legible and the OCR from the TIFF is 100% correct. The OCR from the PDF has multiple missing characters.
This leads me to believe that the internal rendering of PDFs performed by the cloud vision API is buggy.
Can anyone shed any light?
Correct OCR results from TIFF:
STANDING ORDER PAYMENT
THANK YOU
RETAIL PURCHASE
INTEREST - PURCHASES
STANDING ORDER PAYMENT
THANK YOU
RETAIL PURCHASE
STANDING ORDER PAYMENT
THANK YOU
############
############
############
############
############
############
Santander UK plc. Registered Office: 2 Triton Square, Regent's Place, London NW1 3AN, United Kingdom. Registered Number 2294747. Registered in England. www.santander.co.uk. Telephone 0800
389 7000. Calls may be recorded or monitored. Authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and the Prudential Regulation Authority. Our Financial
Services Register number is 106054. Santander UK plc is also licensed by the Financial Supervision Commission of the Isle of Man for its branch in the Isle of Man. Deposits held with the Isle of Man
branch are covered by the Isle of Man Depositors' Compensation Scheme as set out in the Isle of Man Depositors' Compensation Scheme Regulations 2010. In the Isle of Man, Santander UK plc's
principal place of business is at 19/21 Prospect Hill, Douglas, Isle of Man, IM1 1ET. Santander and flame logo are registered trademarks.
Page 14 of 19
Poor read from PDF:
STANDING ORDER PAYMENT
THANK YOU
RETAIL PURCHASE
INTEREST PURCHASES
STANDING ORDER PAYMENT
THANK YOU
RETAIL PURCHASE
STANDING ORDER PAY NT
THANK YOU
############
####
###
###
###
####
####
############
Santander UK plc. Registered Office: 2 Triton Square, Regent's Place, London NW1 3AN, United Kingdom. Registered Number 2294747. Registered in England. www.santander.co.uk. Telephone 0800
389 7000. Calls may be recorded or monitored. Authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and the Prudential Regulation Authority. Our Financial
Services Register number is 106054. Santander UK plc is also licensed by the Financial Supervision Commission of the Isle of Man for its branch in the Isle of Man. Deposits held with the Isle of Man
branch are covered by the Isle of Man Depositors' Compensation Scheme as set out in the Isle of Man Depositors' Compensation Scheme Regulations 2010. In the Isle of Man, Santander UK plc's
principal place of business is at 19/21 Prospect Hill, Douglas, Isle of Man, IM1 1ET. Santander and flame logo are registered trademarks.
Page 14 of 19
See missing hyphen, missing 'ME' from 'PAYMENT', and various lost hash/pound characters with extra newlines.
The pdf and tiff can be found in this shared gdrive: https://drive.google.com/drive/folders/1M4VZ3cT3YDoEn5o565fdWP6_47Y_KISL?usp=sharing
Here's a screenshot of the PDF for ease:
Hi, phildrip,
Could you try TEXT_DETECTION instead of DOCUMENT_TEXT_DETECTION and share your results?
To update your model, simply set the 'model' value to "builtin/latest", e.g code sample:
client = vision.ImageAnnotatorClient()
feature = vision.types.Feature(
type=vision.enums.Feature.Type.TEXT_DETECTION, model="builtin/latest")
I will be awaiting your response.
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |