Solved: Re: Escaping comma in dataproc workflow template p...

bda · 10-04-2022 09:19 AM

I have a workflow template in Dataproc that must have a parameter NAME containing a comma. e.g "john,doe". I instanciate the template with either command:

gcloud dataproc workflow-templates instantiate my_template --parameters=NAME=john,doe
or
gcloud dataproc workflow-templates instantiate my_template --parameters=NAME=john\,doe

or
gcloud dataproc workflow-templates instantiate my_template --parameters=NAME="john\,doe"

In all cases I get the following error:

ERROR: (gcloud.dataproc.workflow-templates.instantiate) argument --parameters: Bad syntax for dict arg: [doe]

How to solve such issue ?

RC1

@bda
Can you try to create workflow template using REST API

1) Go to this link https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.locations.workflowTemplates/create

add parent as => "projects/project-name/locations/asia-south1"

add body as

 {
        "id": "data_load",
        "name": "",
        "labels": {},
        "placement": {
            "managedCluster": {
                "clusterName": "cluster-fbbf",
                "config": {
                    "configBucket": "bucket_name",
                    "gceClusterConfig": {
                        "serviceAccountScopes": [
                            "https://www.googleapis.com/auth/cloud-platform"
                        ],
                        "networkUri": "",
                        "subnetworkUri": "your subnet",
                        "internalIpOnly": false,
                        "zoneUri": "asia-south1-a",
                        "metadata": {
                            "PIP_PACKAGES": "pandas pytz"
                        },
                        "tags": [],
                        "shieldedInstanceConfig": {
                            "enableSecureBoot": false,
                            "enableVtpm": false,
                            "enableIntegrityMonitoring": false
                        }
                    },
                    "masterConfig": {
                        "numInstances": 1,
                        "machineTypeUri": "n1-standard-4",
                        "diskConfig": {
                            "bootDiskType": "pd-standard",
                            "bootDiskSizeGb": "150",
                            "numLocalSsds": 0,
                            "localSsdInterface": "SCSI"
                        },
                        "minCpuPlatform": "",
                        "imageUri": ""
                    },
                    "softwareConfig": {
                        "imageVersion": "2.0-ubuntu18",
                        "properties": {
                            "dataproc:dataproc.allow.zero.workers": "true"
                        },
                        "optionalComponents": []
                    },
                    "initializationActions": [
                        {
                            "executableFile": "gs://goog-dataproc-initialization-actions-asia-south1/python/pip-install.sh"
                        }
                    ]
                },
                "labels": {}
            }
        },
        "jobs": [
            {
                "pysparkJob": {
                    "mainPythonFileUri": "gs://bucket/demo/scripts/start_job.py",
                    "pythonFileUris": [],
                    "jarFileUris": [],
                    "fileUris": [],
                    "archiveUris": [],
                    "properties": {},
                    "args": [
                        "arg1",
                        "arg2",
                        "demo3"
                    ]
                },
                "stepId": "step_id_name",
                "labels": {},
                "prerequisiteStepIds": []
            }
        "parameters": [],
        "dagTimeout": "1800s"
    }

always try to store all config/parameter details in a json file in bucket. pass file path as argumets to spark job using args:[] and read that file in spark to get parameters

View solution in original post

Eduardo_Ortiz

Based on how to instantiate workflow-templates, on the previous link shared you might see that they send the parameter onto dataproc in the format that I’m sharing below with you.

gcloud dataproc workflow-templates instantiate (TEMPLATE : --region=REGION) [--async] [--parameters=[PARAM=VALUE,…]] [GCLOUD_WIDE_FLAG …]

As you can see, when they send the –parameters, they send them as [--parameters=[PARAM=VALUE,..].

bda

It doesn't solve my problem because I have a VALUE that contains a comma inside. So it fails when parsing the parameters because it considers this comma as a separator with the next PARAM...

Eduardo_Ortiz

Is the value being returned as a string?

bda

The value of my parameter is a Json structure with several values separated by commas. I can't change it because it is the expected business format of the Spark job running inside the workflow. It works well when I submit the Spark job manually thanks to "gcloud dataproc jobs submit " command, but it doesn't work anymore when I use workflow-template because obviously "gcloud dataproc workflow-templates instantiate" interprets incorrectly the -- parameters option. Even when escaping such commas in my Json, it considers these commas as separator of several PARAM of the job which is not the case...

Eduardo_Ortiz

Did you try to use instantiate from file, instead of just instantiate?

bda

"instanciate from file" doesn't support --parameter option. Parametrization is not possible with such command...

RC1

@bda
Can you try to create workflow template using REST API

1) Go to this link https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.locations.workflowTemplates/create

add parent as => "projects/project-name/locations/asia-south1"

add body as

 {
        "id": "data_load",
        "name": "",
        "labels": {},
        "placement": {
            "managedCluster": {
                "clusterName": "cluster-fbbf",
                "config": {
                    "configBucket": "bucket_name",
                    "gceClusterConfig": {
                        "serviceAccountScopes": [
                            "https://www.googleapis.com/auth/cloud-platform"
                        ],
                        "networkUri": "",
                        "subnetworkUri": "your subnet",
                        "internalIpOnly": false,
                        "zoneUri": "asia-south1-a",
                        "metadata": {
                            "PIP_PACKAGES": "pandas pytz"
                        },
                        "tags": [],
                        "shieldedInstanceConfig": {
                            "enableSecureBoot": false,
                            "enableVtpm": false,
                            "enableIntegrityMonitoring": false
                        }
                    },
                    "masterConfig": {
                        "numInstances": 1,
                        "machineTypeUri": "n1-standard-4",
                        "diskConfig": {
                            "bootDiskType": "pd-standard",
                            "bootDiskSizeGb": "150",
                            "numLocalSsds": 0,
                            "localSsdInterface": "SCSI"
                        },
                        "minCpuPlatform": "",
                        "imageUri": ""
                    },
                    "softwareConfig": {
                        "imageVersion": "2.0-ubuntu18",
                        "properties": {
                            "dataproc:dataproc.allow.zero.workers": "true"
                        },
                        "optionalComponents": []
                    },
                    "initializationActions": [
                        {
                            "executableFile": "gs://goog-dataproc-initialization-actions-asia-south1/python/pip-install.sh"
                        }
                    ]
                },
                "labels": {}
            }
        },
        "jobs": [
            {
                "pysparkJob": {
                    "mainPythonFileUri": "gs://bucket/demo/scripts/start_job.py",
                    "pythonFileUris": [],
                    "jarFileUris": [],
                    "fileUris": [],
                    "archiveUris": [],
                    "properties": {},
                    "args": [
                        "arg1",
                        "arg2",
                        "demo3"
                    ]
                },
                "stepId": "step_id_name",
                "labels": {},
                "prerequisiteStepIds": []
            }
        "parameters": [],
        "dagTimeout": "1800s"
    }

always try to store all config/parameter details in a json file in bucket. pass file path as argumets to spark job using args:[] and read that file in spark to get parameters

Escaping comma in dataproc workflow template parameter