I got this error when running a job in dataflow from a mongodb to bigquery batch template:
"Error message from worker: java.lang.IllegalArgumentException: Unable to encode element 'Document{{_id=1234567890, created_by=org.bson.BsonUndefined@0, created_date=Tue Dec 10 10:20:30 UTC 2010, last_modified_date=Tue Dec 10 12:50:30 UTC 2010, is_deleted=false}}' with coder 'SerializableCoder(org.bson.Document)'.
org.apache.beam.sdk.coders.Coder.getEncodedElementByteSize(Coder.java:300) org.apache.beam.sdk.coders.Coder.registerByteSizeObserver(Coder.java:291) org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.registerByteSizeObserver(WindowedValue.java:642) org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.registerByteSizeObserver(WindowedValue.java:558) org.apache.beam.runners.dataflow.worker.IntrinsicMapTaskExecutorFactory$ElementByteSizeObservableCoder.registerByteSizeObserver(IntrinsicMapTaskExecutorFactory.java:384) org.apache.beam.runners.dataflow.worker.util.common.worker.OutputObjectAndByteCounter.update(OutputObjectAndByteCounter.java:128) org.apache.beam.runners.dataflow.worker.DataflowOutputCounter.update(DataflowOutputCounter.java:67) org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:43) org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:218) org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:169) org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:83) org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:319) org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:291) org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:221) org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:147) org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:127) org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:114) java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) org.apache.beam.sdk.util.UnboundedScheduledExecutorService$ScheduledFutureTask.run(UnboundedScheduledExecutorService.java:163) java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.io.NotSerializableException: org.bson.BsonUndefined
java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1185) java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349) java.base/java.util.LinkedHashMap.internalWriteEntries(LinkedHashMap.java:333) java.base/java.util.HashMap.writeObject(HashMap.java:1411) java.base/jdk.internal.reflect.GeneratedMethodAccessor68.invoke(Unknown Source) java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.base/java.lang.reflect.Method.invoke(Method.java:566) java.base/java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1145) java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1497) java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433) java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179) java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1553) java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1510) java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433) java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179) java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349) org.apache.beam.sdk.coders.SerializableCoder.encode(SerializableCoder.java:192) org.apache.beam.sdk.coders.SerializableCoder.encode(SerializableCoder.java:57) org.apache.beam.sdk.coders.Coder.getEncodedElementByteSize(Coder.java:297) ... 21 more
What could be the cause?
The error message you are getting is caused by the fact that the org.bson.BsonUndefined
class is not serializable. This means that it cannot be passed between different processes or stored in a database.
To fix this error, you can either:
org.bson.BsonUndefined
class to be serializable.org.bson.BsonUndefined
class from the document that you are trying to pass.If you are not sure how to change the org.bson.BsonUndefined
class to be serializable, you can consult the documentation for the Bson
class.
If you need to keep the org.bson.BsonUndefined
class in the document, you can remove it from the document before passing it to the Dataflow job. You can do this by using a filter.
For example, the following filter will remove all org.bson.BsonUndefined
objects from a document:
Once you have fixed the error, you should be able to run your Dataflow job without any problems.