How to export dataset from Documnet AI in C#

konradpsiuk · 08-16-2024 02:54 PM

I'm trying to export the dataset from DocAI for validation but can't find how to do it in C#. Searching pointed me to DocumentProcessorServiceClient but it doesn't seem to have any relevant method

CincyAI513

-My answer was not correct-

konradpsiuk

Thanks, butDocumentProcessorServiceClient doesn't have ListProcessorDocuments method

CincyAI513

1. ⚠️ Google Cloud Console: Use the console interface to view processed documents.

✱ Ease of Use: The console provides a user-friendly interface to explore and manage your Document AI resources. It's often the simplest way to quickly view processed documents, especially for visual inspection or spot-checking.

✱ Visualization and Exploration: The console might offer features like visual document representations, highlighting of extracted entities, and easy navigation through processed documents.

2. ⚠️ Process Response and Output Location: Retrieve processed document locations from API responses.

✱ Core Functionality: When you process a document using the Document AI API, the response tells you where the processed output is stored (e.g., a Cloud Storage URI). This is fundamental to the processing workflow, as you need to know where to find the results.

✱ Flexibility: This approach gives you flexibility in how you handle the processed documents. You can directly access them in their storage location, download them, or integrate them into your own applications.

3. ⚠️ Metadata-based Querying: Use custom metadata to query and filter documents.

✱ Advanced Filtering: If you add custom metadata (key-value pairs) to your documents during processing, you can use that metadata to search and filter documents in your storage. This is useful for finding specific documents based on their content or processing attributes.

✱ Integration with Storage Services: This approach leverages the capabilities of your storage service (e.g., Cloud Storage search).

4. ⚠️ Custom Solution with a Database: Build a system to track and retrieve processed documents.

✱ Processor-Specific Tracking: If you need to specifically track which documents were processed by a particular processor, you'll need a custom solution. This involves storing document and processor information in a database and using it to retrieve documents.

✱ Complex Requirements: This approach is suitable for more complex scenarios where you need fine-grained control over document tracking, history, and retrieval.

✳️⚠️ Why no direct listing method? ⚠️✳️

✱ Scalability and Performance: Document AI is designed to handle large volumes of documents. A direct API method to list all documents processed by a processor could be computationally expensive and impact performance, especially for processors that have handled a massive number of documents.

✱ Storage Diversity: Processed documents can be stored in various locations (Cloud Storage, local filesystems, etc.). A single listing method might not be able to accommodate all these storage options efficiently.

✱ Focus on Processing: Document AI's primary focus is on document processing and extraction. Providing a comprehensive document management and listing system might be outside its core scope.

konradpsiuk

When I export a dataset from Google console (via browser) I get a collection of JSON files that includes label information. Can I do the same programmatically? I'm trying to include label validation in the pipeline.