Hello!
I am a new user of GCP Dataproc API and I have trying to test serverless batch processing to solve a business problem for moving data from a MySQL database via the Dataproc JDBCTOGCS template.
I have no issues creating and authenticating the requests, but every config I try returns this exception:
ClassNotFoundException: com.mysql.cj.jdbc.Driver
I have provided a snippet below, with the sensitive details obfuscated to xxx.
This code is not intended to be used in production; I'm just trying to get my head around how to make the requests at all.
From reading around the error, I think this is something to do how I'm pointing to the MySQL connector .jar file (line 42), but nothing I have tried can get me past this error. As far as I can tell, I'm doing everything the quick-start guide is asking of me.
I'm clearly doing something wrong, so any pointers would be greatly appreciated.
import requests as rq
import os
import google.auth
import google.auth.transport.requests
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = 'xxx.json'
credentials, project_id = google.auth.default(
scopes=['https://www.googleapis.com/auth/cloud-platform','https://www.googleapis.com/auth/cloud-platform.read-only'])
request = google.auth.transport.requests.Request()
credentials.refresh(request)
token = credentials.token
url = 'https://dataproc.googleapis.com/v1/projects/xxx/locations/europe-west2/batches'
headers = {'Authorization': f'Bearer {token}'}
body = {
"environmentConfig": {
"executionConfig": {
"subnetworkUri": "projects/xxx/regions/europe-west2/subnetworks/xxx",
'serviceAccount': "xxx@appspot.gserviceaccount.com"
}
},
"runtimeConfig": {
"version": "1.1"
},
"sparkBatch": {
"mainClass": "com.google.cloud.dataproc.templates.main.DataProcTemplate",
"args": [
"--template=JDBCTOGCS",
"--templateProperty","log.level=DEBUG",
"--templateProperty","project.id=xxx",
"--templateProperty","jdbctogcs.jdbc.url=jdbc:mysql://xxx.xxxx.xxxx.xxx:3306/xxx?user=xxx&password=xxx",
"--templateProperty","jdbctogcs.jdbc.driver.class.name=com.mysql.cj.jdbc.Driver",
"--templateProperty","jdbctogcs.output.location=gs://bbt-test-bucket/",
"--templateProperty","jdbctogcs.write.mode=overwrite",
"--templateProperty","jdbctogcs.output.format=json",
"--templateProperty","jdbctogcs.jdbc.fetchsize=10",
"--templateProperty","jdbctogcs.sql=xxx"
],
"jarFileUris": [
"gs://dataproc-templates-binaries/latest/java/dataproc-templates.jar",
"gs://shareable_files/drivers/mysql/mysql-connector-java-5.1.30-bin.jar"
]
}
}
res = rq.post(url=url,json=body,headers=headers)
res.json()
Hi @Tomalbon,
You need to make sure that the JDBC jar file is downloaded and hosted inside the GCS bucket. You may want to check this link for reference.
Here is a helpful article to guide you and that you may check the prerequisites in importing data from Databases into GCS(Via JDBC) using Dataproc Serverless.
Hope this helps.
Thanks, @anjelisa
Sorry for the long delay. I had to put this project down for a while.
Unfortunately, I'm still having no luck. I have hosted the .jar via in the GCS bucket in the same region as the job is running and used the quickstart guide to submit a job over the API, and I still get the error:
ClassNotFoundException: com.mysql.cj.jdbc.Driver
The job will create and submit to Dataptroc Batches and attempt to run, but the class error fails the job each time.