Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Escaping comma in dataproc workflow template parameter

bda
Bronze 2
Bronze 2

I have a workflow template in Dataproc that must have a parameter NAME containing a comma. e.g "john,doe". I instanciate the template with either command:

gcloud dataproc workflow-templates instantiate my_template --parameters=NAME=john,doe
or 
gcloud dataproc workflow-templates instantiate my_template --parameters=NAME=john\,doe

or 
gcloud dataproc workflow-templates instantiate my_template --parameters=NAME="john\,doe"

In all cases I get the following error:

ERROR: (gcloud.dataproc.workflow-templates.instantiate) argument --parameters: Bad syntax for dict arg: [doe]

How to solve such issue ?

 

Solved Solved
1 7 1,360
1 ACCEPTED SOLUTION

@bda 
Can you try to create workflow template using REST API 

1) Go to this link https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.locations.workflowTemplates/create

add parent as => "projects/project-name/locations/asia-south1"

add body as 

 

 {
        "id": "data_load",
        "name": "",
        "labels": {},
        "placement": {
            "managedCluster": {
                "clusterName": "cluster-fbbf",
                "config": {
                    "configBucket": "bucket_name",
                    "gceClusterConfig": {
                        "serviceAccountScopes": [
                            "https://www.googleapis.com/auth/cloud-platform"
                        ],
                        "networkUri": "",
                        "subnetworkUri": "your subnet",
                        "internalIpOnly": false,
                        "zoneUri": "asia-south1-a",
                        "metadata": {
                            "PIP_PACKAGES": "pandas pytz"
                        },
                        "tags": [],
                        "shieldedInstanceConfig": {
                            "enableSecureBoot": false,
                            "enableVtpm": false,
                            "enableIntegrityMonitoring": false
                        }
                    },
                    "masterConfig": {
                        "numInstances": 1,
                        "machineTypeUri": "n1-standard-4",
                        "diskConfig": {
                            "bootDiskType": "pd-standard",
                            "bootDiskSizeGb": "150",
                            "numLocalSsds": 0,
                            "localSsdInterface": "SCSI"
                        },
                        "minCpuPlatform": "",
                        "imageUri": ""
                    },
                    "softwareConfig": {
                        "imageVersion": "2.0-ubuntu18",
                        "properties": {
                            "dataproc:dataproc.allow.zero.workers": "true"
                        },
                        "optionalComponents": []
                    },
                    "initializationActions": [
                        {
                            "executableFile": "gs://goog-dataproc-initialization-actions-asia-south1/python/pip-install.sh"
                        }
                    ]
                },
                "labels": {}
            }
        },
        "jobs": [
            {
                "pysparkJob": {
                    "mainPythonFileUri": "gs://bucket/demo/scripts/start_job.py",
                    "pythonFileUris": [],
                    "jarFileUris": [],
                    "fileUris": [],
                    "archiveUris": [],
                    "properties": {},
                    "args": [
                        "arg1",
                        "arg2",
                        "demo3"
                    ]
                },
                "stepId": "step_id_name",
                "labels": {},
                "prerequisiteStepIds": []
            }
        "parameters": [],
        "dagTimeout": "1800s"
    }

 

 always try to store all config/parameter details in a json file in bucket. pass file path as argumets to spark job using args:[] and read that file in spark to get parameters

View solution in original post

7 REPLIES 7

Based on how to instantiate workflow-templates, on the previous link shared you might see that they send the parameter onto dataproc in the format that I’m sharing below with you.

gcloud dataproc workflow-templates instantiate (TEMPLATE : --region=REGION) [--async] [--parameters=[PARAM=VALUE,…]] [GCLOUD_WIDE_FLAG …]

As you can see, when they send the –parameters, they send them as [--parameters=[PARAM=VALUE,..].

bda
Bronze 2
Bronze 2

It doesn't solve my problem because I have a VALUE that contains a comma inside. So it fails when parsing the parameters because it considers this comma as a separator with the next PARAM...

Is the value being returned as a string?

bda
Bronze 2
Bronze 2

The value of my parameter is a Json structure with several values separated by commas. I can't change it because it is the expected business format of the Spark job running inside the workflow. It works well when I submit the Spark job manually thanks to "gcloud dataproc jobs submit " command, but it doesn't work anymore when I use workflow-template because obviously "gcloud dataproc workflow-templates instantiate" interprets incorrectly the -- parameters option. Even when escaping such commas in my Json, it considers these commas as separator of several PARAM of the job which is not the case...

Did you try to use instantiate from file, instead of just instantiate?

bda
Bronze 2
Bronze 2

"instanciate  from file" doesn't support --parameter option. Parametrization is not possible with such command...

@bda 
Can you try to create workflow template using REST API 

1) Go to this link https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.locations.workflowTemplates/create

add parent as => "projects/project-name/locations/asia-south1"

add body as 

 

 {
        "id": "data_load",
        "name": "",
        "labels": {},
        "placement": {
            "managedCluster": {
                "clusterName": "cluster-fbbf",
                "config": {
                    "configBucket": "bucket_name",
                    "gceClusterConfig": {
                        "serviceAccountScopes": [
                            "https://www.googleapis.com/auth/cloud-platform"
                        ],
                        "networkUri": "",
                        "subnetworkUri": "your subnet",
                        "internalIpOnly": false,
                        "zoneUri": "asia-south1-a",
                        "metadata": {
                            "PIP_PACKAGES": "pandas pytz"
                        },
                        "tags": [],
                        "shieldedInstanceConfig": {
                            "enableSecureBoot": false,
                            "enableVtpm": false,
                            "enableIntegrityMonitoring": false
                        }
                    },
                    "masterConfig": {
                        "numInstances": 1,
                        "machineTypeUri": "n1-standard-4",
                        "diskConfig": {
                            "bootDiskType": "pd-standard",
                            "bootDiskSizeGb": "150",
                            "numLocalSsds": 0,
                            "localSsdInterface": "SCSI"
                        },
                        "minCpuPlatform": "",
                        "imageUri": ""
                    },
                    "softwareConfig": {
                        "imageVersion": "2.0-ubuntu18",
                        "properties": {
                            "dataproc:dataproc.allow.zero.workers": "true"
                        },
                        "optionalComponents": []
                    },
                    "initializationActions": [
                        {
                            "executableFile": "gs://goog-dataproc-initialization-actions-asia-south1/python/pip-install.sh"
                        }
                    ]
                },
                "labels": {}
            }
        },
        "jobs": [
            {
                "pysparkJob": {
                    "mainPythonFileUri": "gs://bucket/demo/scripts/start_job.py",
                    "pythonFileUris": [],
                    "jarFileUris": [],
                    "fileUris": [],
                    "archiveUris": [],
                    "properties": {},
                    "args": [
                        "arg1",
                        "arg2",
                        "demo3"
                    ]
                },
                "stepId": "step_id_name",
                "labels": {},
                "prerequisiteStepIds": []
            }
        "parameters": [],
        "dagTimeout": "1800s"
    }

 

 always try to store all config/parameter details in a json file in bucket. pass file path as argumets to spark job using args:[] and read that file in spark to get parameters