Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Migrating On Premise Apache Airflow workflow scripts to Cloud Composer

I have an on-premise environment running Airflow v2.2.0 and wish to migrate all the workflows in this instance to a Cloud Composer instance. While doing this migration, some of the operators used on the on-premise environment do not work after running the same workflow on Cloud Composer 1 (Airflow 2.1.4).

Below is a single task from such a workflow:

 

hive_task = HiveOperator(
    hql="./scripts/hive-script.sql",
    task_id="survey_data_aggregator",
    hive_cli_conn_id="new_hive_conn",
    dag=dag_data_summarizer,
    hiveconfs={"input_path": "{{ params['input_path']  }}", "output_path": "{{ params['input_path']  }}" + ts}
)

 

When the workflow is executed, it breaks at the above task and after reading the logs, the error message thrown is: 
[Errno 2] No such file or directory: 'beeline'

I am aware of the cause of this issue which is the worker nodes do not have the beeline binary in its PATH and I do not want to SSH into every worker instance and set this PATH variable.

Upon searching, I found that when I replace HiveOperator with DataProcHiveOperator, and update the arguments accordingly, the workflow works as expected.

However this approach might not work as manually editing each workflow script is not practical for us. Additionally, we have no idea how many other operators in our script require manual intervention like this.

What is the optimal way of handling situations such as these without having to amend workflow scripts manually? Furthermore, is there an official documentation to a Google recommended way to migrate Apache Airflows from on-premise instances to Cloud Composer as I am unable to find a reference to it.

1 2 1,249
2 REPLIES 2

Hello, 

I see that you have the same post on Stackoverflow[0]. As per the question by one of our GCP Support Engineers, have you tried updating PATH variable in the Composer instance?[1]

[0]https://stackoverflow.com/questions/70868968/migrating-on-premise-apache-airflow-workflow-scripts-to...

[1]https://cloud.google.com/composer/docs/how-to/managing/environment-variables

 

Hi,

I SSH into the worker instance and downloaded the binary. So if I know the location of the binary on the worker node, how do I update the path from the Google Console UI?

For e.g.: if I download the binary to "/home/mygmailaccount/binaries/apache-hive" and I set this path as $HIVE_HOME, can I update the PATH variable like this:

unnamed.png

Apologies for delayed response.