Python: Google Vision doesn't (cannot) read and co...

Sony34 · 06-18-2023 12:12 PM

THIS CODE HAVE NO ERROR. The OUPTUT done.

The problem is that the converted files (.txt) are 0 bytes. Seems that google cloud vision cannot read and convert photocopied books, from PDF into TXT

import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="d:/doc/doc/MY-KEY.json"

from google.cloud import vision

from google.cloud.vision_v1 import types

from google.oauth2.service_account import Credentials

# from google.cloud import storage
# client library
# storage_client = storage.Client()

# Set up the Google Cloud Vision client with service account credentials
# credentials = Credentials.from_service_account_file('d:/doc/doc/bebe-1084-992b240528be.json')
# client = vision.ImageAnnotatorClient(credentials=credentials)

#pip install google-cloud-vision

# Set up the Google Cloud Vision client

client = vision.ImageAnnotatorClient()

# Directory containing the PDF files

pdf_directory = "d:/doc/doc"

# Output directory for the TXT files

output_directory = "d:/doc/doc"

# Get a list of PDF files in the directory

pdf_files = [file for file in os.listdir(pdf_directory) if file.endswith(".pdf")]

# Process each PDF file

for pdf_file in pdf_files:

pdf_path = os.path.join(pdf_directory, pdf_file)

# Create the output TXT file path

txt_file = os.path.splitext(pdf_file)[0] + ".txt"

txt_path = os.path.join(output_directory, txt_file)

# Read the PDF file as bytes

with open(pdf_path, 'rb') as file:

content = file.read()

# Convert PDF to image using Google Cloud Vision API

input_image = types.Image(content=content)

response = client.document_text_detection(image=input_image)

# Extract text from the response and save it as TXT

text = response.full_text_annotation.text

with open(txt_path, 'w', encoding='utf-8') as file:

file.write(text)

print(f"Converted {pdf_file} to {txt_file}")

kvandres

Good day @Sony34,

Welcome to Google Cloud Community!

You are encountering this error since the pdf files that you are trying to detect are stored in your local storage, currently PDF/TIFF document detection is only available if the files are stored in Google Cloud Storage buckets and please note that the response that you will get is in JSON format, not in text format, JSON files that are created after PDF/TIFF request will be stored in your specified Cloud Storage bucket. Additionally, you need to use files:asyncBatchAnnotate function to perform an asynchronous request and it will provide the status using the operations request. The account that will be used for authentication must have one of these roles roles/editor or roles/storage.objectCreator or above, in order to save the json file in your bucket. For more information you can check this link: https://cloud.google.com/vision/docs/pdf
Here is a sample code that you can use as a guide for PDF/TIFF document detection: https://cloud.google.com/vision/docs/pdf#document_text_detection_requests

Hope this helps!

Python: Google Vision doesn't (cannot) read and convert photocopied books, from PDF into TXT