Im trying to build google data fusion pipeline to load data from MS SQL server to Big Query. The source (MS Sql server - 2016 standard) is running on GCP VM. I can connect to sql instance using public IP without any issue. Im using Big Query connector as sink.
As JDBC driver I'm using Microsoft SQL Server JDBC Driver v 6.0.jre7 which available on data fusion HUB
Here is Data Fusion instance details:
I gave the "Cloud Data Fusion API Service Agent" role to the Data Fusion Service account.
When i run the pipeline and it get fails with following error messages. Can anyone please help me to resolve this. What I'm missing in the configuration.
04/07/2023 0:42:40
INFO
Launch main class org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main([--driver-url, spark://CoarseGrainedScheduler@cdap-mypipe-98803685-d4b2-11ed-a81e-26b0641a83bd-w-1.us-central1-f.c.marine-fusion-270120.internal:35779, --executor-id, 1, --hostname, cdap-mypipe-98803685-d4b2-11ed-a81e-bd-w-0.us-central1-f.c.marine-fusion-2.internal, --cores, 1, --app-id, application_1680809963358_0002, --resourceProfileId, 0, --user-class-path, file:/hadoop/yarn/nm-local-dir/usercache/yarn/appcache/application_1680809963358_0002/container_1680809963358_0002_01_000002/__app__.jar])
04/07/2023 0:42:46
WARN
Cannot load filesystem: java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.web.HftpFileSystem not found
04/07/2023 0:42:46
WARN
Cannot load filesystem: java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.web.HsftpFileSystem not found
04/07/2023 0:42:48
ERROR
Aborting task
04/07/2023 0:42:48
ERROR
Task attempt_202304061942308162483986510619117_0003_r_000000_0 aborted.
04/07/2023 0:42:48
ERROR
Exception in task 0.0 in stage 0.0 (TID 0)
04/07/2023 0:42:48
WARN
Lost task 0.0 in stage 0.0 (TID 0) (cdap-mypipe-98803685-d4b2-11ed-a81e-26b0641a83bd-w-1.us-central1-f.c.marine-fusion-270120.internal executor 2): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:162) at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:88) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:505) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:508) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.avro.file.DataFileWriter$AppendWriteException: org.apache.avro.UnresolvedUnionException: Not in union ["long","null"]: 1 at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:308) at io.cdap.plugin.gcp.bigquery.sink.AvroRecordWriter.write(AvroRecordWriter.java:90) at io.cdap.plugin.gcp.bigquery.sink.AvroRecordWriter.write(AvroRecordWriter.java:37) at io.cdap.plugin.gcp.bigquery.sink.BigQueryRecordWriter.write(BigQueryRecordWriter.java:58) at io.cdap.plugin.gcp.bigquery.sink.BigQueryRecordWriter.write(BigQueryRecordWriter.java:32) at io.cdap.cdap.etl.spark.io.TrackingRecordWriter.write(TrackingRecordWriter.java:41) at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.write(SparkHadoopWriter.scala:367) at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$executeTask$1(SparkHadoopWriter.scala:137) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473) at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:134) ... 9 more Caused by: org.apache.avro.UnresolvedUnionException: Not in union ["long","null"]: 1 at org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740) at org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205) at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75) at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:159) at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166) at org.apache.avro.specific.SpecificDatumWriter.writeField(SpecificDatumWriter.java:90) at org.apache.avro.reflect.ReflectDatumWriter.writeField(ReflectDatumWriter.java:191) at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156) at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75) at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:159) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62) at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:302) ... 18 more
04/07/2023 0:42:49
ERROR
Aborting task
04/07/2023 0:42:49
ERROR
Task attempt_202304061942308162483986510619117_0003_r_000000_1 aborted.
04/07/2023 0:42:49
ERROR
Exception in task 0.1 in stage 0.0 (TID 1)
04/07/2023 0:42:51
WARN
Cannot load filesystem: java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.web.HftpFileSystem not found
04/07/2023 0:42:51
WARN
Cannot load filesystem: java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.web.HsftpFileSystem not found
04/07/2023 0:42:53
ERROR
Aborting task
Solved! Go to Solution.
Hi @binrip
Welcome back to Google Cloud Community.
There are possible reasons why these error messages occur.
The version of JDBC driver you are using may not be compatible with the version of the Data Fusion you are using. The Data type is a mismatch between the source and the target.
You may try to perform this check to see the issue.
1. Check the data type of the source if it's compatible with BigQuery.
2. Check the JDBC driver version if it's compatible with the Data Fusion version. Try updating the JDBC driver to the latest version and see if the error would still occur.
3. Check that the Avro schema used by the pipeline is correct and matches the data being written to BigQuery. You may update it to mach the data being written.
4. Check if the pipeline is properly configured.
5. Try to run a smaller dataset to run into the pipeline to see if the error would still occur.
6. Check the logs for more information about the error and the configurations.
Here are some reference that might help you:
https://cloud.google.com/bigquery/docs/quickstarts
https://cloud.google.com/bigquery/docs/reference/odbc-jdbc-drivers
https://cloud.google.com/data-fusion/docs/release-notes
https://cloud.google.com/pubsub/docs/samples/pubsub-create-avro-schema
https://cloud.google.com/bigquery/docs/datasets-intro?_ga=2.179257060.-1392753435.1676655686
Hi @binrip
Welcome back to Google Cloud Community.
There are possible reasons why these error messages occur.
The version of JDBC driver you are using may not be compatible with the version of the Data Fusion you are using. The Data type is a mismatch between the source and the target.
You may try to perform this check to see the issue.
1. Check the data type of the source if it's compatible with BigQuery.
2. Check the JDBC driver version if it's compatible with the Data Fusion version. Try updating the JDBC driver to the latest version and see if the error would still occur.
3. Check that the Avro schema used by the pipeline is correct and matches the data being written to BigQuery. You may update it to mach the data being written.
4. Check if the pipeline is properly configured.
5. Try to run a smaller dataset to run into the pipeline to see if the error would still occur.
6. Check the logs for more information about the error and the configurations.
Here are some reference that might help you:
https://cloud.google.com/bigquery/docs/quickstarts
https://cloud.google.com/bigquery/docs/reference/odbc-jdbc-drivers
https://cloud.google.com/data-fusion/docs/release-notes
https://cloud.google.com/pubsub/docs/samples/pubsub-create-avro-schema
https://cloud.google.com/bigquery/docs/datasets-intro?_ga=2.179257060.-1392753435.1676655686
Thank you @Aris_O for pointing me to correct direction.
As you have mentioned, the issue is related to data type mismatch between source and destination (Big Query)
@binrip could you maybe elaborate on your solution? I am trying to get a pipeline Redshift -> BigQuery running, but also come across those errors. Did you just use the Wrangler to match the data types?