Hello!
I'm working with Google Dataflow in Apache Beam Python 3.8 SDK 2.37.0. The issue I'm facing is that one of my DoFns is partially executing. Or at least, it seems that way since, according to the lifecycle of DoFns, they go through the following states:
Out of these methods, only __init__, setup, and process are being executed. Any idea?
More context: I'm processing only 1 document from MongoDB and each of the previous steps executes successfully. I can even see my log.info for the process method in the logs. But not for the teardown where my insert to Elasticsearch takes place..
This is my DAG:
I've tested with the following machine types:
n1-standard-1: a machine too slow so I can see the results before the teardown method finishes. Or at least, that was what I thought.
n1-highcpu-96: was it a machine too fast for 1 document?
UPDATE: I tried the same on COLAB and it worked!
The teardown
method in a DoFn
is meant to be used for cleaning up any resources that were set up in the setup
method. It's not guaranteed to be called in all situations, especially in case of pipeline termination or in case the worker processing the DoFn
crashes.
Here are some things you could consider:
Environment: The Python SDK for Apache Beam has two types of environments: the classic Dataflow runner (DataflowRunner) and the newer Portable Runner (PortableRunner). If you are using DataflowRunner, you could try switching to PortableRunner which is based on the Beam model and should have better support for lifecycle methods.
Asynchronous operations: If the teardown
operation is asynchronous, it might not complete before the worker shuts down. This is especially true for high-performance machines. Try to ensure that any operations in the teardown
method are completed synchronously.
Pipeline termination: If the pipeline is terminated prematurely, the teardown
method may not be called. Ensure that the pipeline is allowed to complete normally.
Error handling: If there's an error in your teardown
method, it might fail silently. Ensure that you have proper error handling and logging in place to capture any issues.
Version: There might be a bug or issue in the Apache Beam SDK you're using. You could consider upgrading to a newer version if available, or check the issue tracker for the Apache Beam project to see if this is a known issue.
Alternative methods: If you're trying to perform an action at the end of processing (like inserting a document into Elasticsearch), you might want to consider alternative methods. You could use a separate DoFn
to perform this action, or use a stateful DoFn
with a timer that triggers after a certain amount of processing.
Debugging: To help debug the issue, you could try running the pipeline locally using the DirectRunner. This can help you to see if the teardown
method is being called as expected.