I am trying to import a PDF file into the datastore of the Vertex Search application (formerly gen ai app builder) using a Python script. However, I am encountering an error message that states: "Custom Document Id (`_id`) was not found in document."
Here is the full error message I received:
error_samples {
code: 3
message: "Custom Document Id (`_id`) was not found in document."
details {
type_url: "type.googleapis.com/google.rpc.ResourceInfo"
value: "\022!gs://text-feed2/data_file.jsonl:1"
}
}
error_config {
gcs_prefix: "gs://81742227817_eu_import_custom/errors4157856705844396544"
}
create_time {
seconds: 1698827350
nanos: 74880000
}
update_time {
seconds: 1698827352
nanos: 773565000
}
failure_count: 1
I have checked the [Google Cloud documentation](https://cloud.google.com/generative-ai-app-builder/docs/prepare-data#unstructured) but I am still unclear on how to correctly import the PDF file into the datastore.
Here is the script I am using:
import os
from typing import Optional
from google.api_core.client_options import ClientOptions
from google.cloud import discoveryengine
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "application_credentials.json"
def import_documents_sample(project_id: str, location: str, data_store_id: str, gcs_uri: Optional[str] = None) -> str:
client_options = (ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com") if location != "global" else None)
client = discoveryengine.DocumentServiceClient(client_options=client_options)
parent = client.branch_path(project=project_id, location=location, data_store=data_store_id, branch="default_branch")
if gcs_uri:
request = discoveryengine.ImportDocumentsRequest(parent=parent, gcs_source=discoveryengine.GcsSource(input_uris=[gcs_uri], data_schema="custom"), reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.FULL)
operation = client.import_documents(request=request)
response = operation.result()
metadata = discoveryengine.ImportDocumentsMetadata(operation.metadata)
print(response)
print(metadata)
return operation.operation.name
import_documents_sample(
project_id='project-id',
location='global',
data_store_id='<<datatore-id>>',
gcs_uri='gs://text-feed2/data_file.jsonl'
)
And this is the content of `data_file.jsonl`:{ "id": "d001", "content": {"mimeType": "application/pdf", "uri": "gs://text-feed2/NOVA_EA.pdf"}, "jsonData": "{\\\"title\\\": \\\"First Document\\\", \\\"url\\\": \\\"https://internal.example.com/documents/first_doc.pdf\\\"}"}
Could anyone provide guidance on how to correctly format the data for import? Any help would be greatly appreciated.
Hi @l-cesaro,
Thank you for reaching out to our community.
Your situation is somewhat related to another inquiry in our community (Gen app builder - unable to import unstructured data through SDK/REST API). I noticed that you also have used "custom" as a value for your data_schema, you can try specifying it as "document" since you are working with a PDF file.
Hope this helps.
This helped - thank you
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |