Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Programmatically import documents into a dataset using DocumentServiceClient.ImportDocuments

I am trying to import documents into a dataset using DocumentServiceClient.ImportDocuments().  There are many examples of how to make the actual call (see code snippet below).  What I don't understand is where do I specify the actual documents to be imported? 

My understanding is that the dataset specified is the dataset I am importing the documents INTO.  But how does it know WHICH documents to import? 

I would expect to point it to another dataset containing the documents to be imported, or to provide a collection/list of some sort containing the documents, or point to a location on my hard drive, maybe.

Am I missing something here?

Thanks!

 

DocumentServiceClient dsClient = new DocumentServiceClientBuilder()
{
    Endpoint = $"{locationId}-documentai.googleapis.com"
}.Build();
ImportDocumentsRequest importDocumentsRequest = new ImportDocumentsRequest()
{
    DatasetAsDatasetName = DatasetName.FromProjectLocationProcessor(projectId, locationId, processorId),
    BatchDocumentsImportConfigs =
    {
        new ImportDocumentsRequest.Types.BatchDocumentsImportConfig()
    }
};
// Make the request
Operation<ImportDocumentsResponse, ImportDocumentsMetadata> importDocumentsResponse = dsClient.ImportDocuments(importDocumentsRequest);
// Poll until the returned long-running operation is complete
Operation<ImportDocumentsResponse, ImportDocumentsMetadata> completedResponse = importDocumentsResponse.PollUntilCompleted();
// Retrieve the operation result
ImportDocumentsResponse result = completedResponse.Result;
// Or get the name of the operation
string operationName = importDocumentsResponse.Name;
// This name can be stored, then the long-running operation retrieved later by name
Operation<ImportDocumentsResponse, ImportDocumentsMetadata> retrievedResponse = dsClient.PollOnceImportDocuments(operationName);
// Check if the retrieved long-running operation has completed
if (retrievedResponse.IsCompleted)
{
    // If it has completed, then access the result
    ImportDocumentsResponse retrievedResult = retrievedResponse.Result;
}

 

1 2 140
2 REPLIES 2

Hello @StephenElmer1,

When using Document AI's ImportDocuments(), you specify the source documents through the BatchDocumentsImportConfig in your request, not the main call itself. For bulk imports, configure a Cloud Storage URI (e.g., gs://your-bucket/documents/*) in the DocumentGcsSource property, or list individual files in the DocumentList array. The target dataset is defined separately via DatasetAsDatasetName, while the source documents are specified in these import configurations. Always ensure proper Cloud Storage permissions and supported file formats for successful ingestion. 

Some key steps you may try: 

  • Set DocumentGcsSource for bulk Cloud Storage imports 
  • Use DocumentList for individual file control 
  • Verify storage permissions and file formats 

Best regards,

Suwarna

Hi @StephenElmer1,

Welcome to Google Cloud Community!

In addition to @SuwarnaKale's insights, you can also refer to the documentation on importing documents using the console or the import documents RPC, which might be helpful as an additional reference for importing documents into a dataset.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.