Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Dataproc job states via API not matching with gcloud CMD output and API specification

@jangemmar

Per doc: "https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs#State"

  • 0: STATE_UNSPECIFIED - The job state is unknown.
  • 1: PENDING - The job is pending; it has been submitted, but is not yet running.
  • 2: SETUP_DONE - Job has been received by the service and completed initial setup; it will soon be submitted to the cluster.
  • 3: RUNNING - The job is running on the cluster.
  • 4: CANCEL_PENDING - A jobs.cancel request has been received, but is pending.
  • 5: CANCEL_STARTED - Transient in-flight resources have been canceled, and the request to cancel the running job has been issued to the cluster.
  • 6: CANCELLED - The job cancellation was successful.
  • 7: DONE - The job has completed successfully.
  • 8: ERROR - The job has completed, but encountered an error.
  • 9: ATTEMPT_FAILURE - Job attempt has failed. The detail field contains failure details for this attempt.

However, gcloud CMD shows the states that do not match with API specification.

$ gcloud dataproc jobs describe job-d15fe16c --region=us-central1

...
jobUuid: 7b008734-46bf-416b-878b-2872672f18b7
reference:
  jobId: job-d15fe16c
  projectId: unravel-dataproc
status:
  state: DONE (expected 7, but got 5 instead in API)
  stateStartTime: '2024-07-12T12:40:02.037079Z'
statusHistory:
- state: PENDING (1,  matched)
  stateStartTime: '2024-07-12T12:36:19.533775Z'
- state: SETUP_DONE (expected 2, but got 8 instead) 
  stateStartTime: '2024-07-12T12:36:19.568446Z'
- details: Agent reported job success
  state: RUNNING (expected 3, but got 2 instead)
  stateStartTime: '2024-07-12T12:36:19.922179Z'
  
JSON via API:
 {
        "reference": {
            "projectId": "unravel-dataproc",
            "jobId": "job-d15fe16c"
        },
        "status": {
            "state": 5,
            "stateStartTime": "2024-07-12T12:40:02.037079Z",
            "details": "",
            "substate": 0
        },
        "statusHistory": [
            {
                "state": 1,
                "stateStartTime": "2024-07-12T12:36:19.533775Z",
                "details": "",
                "substate": 0
            },
            {
                "state": 8,
                "stateStartTime": "2024-07-12T12:36:19.568446Z",
                "details": "",
                "substate": 0
            },
            {
                "state": 2,
                "details": "Agent reported job success",
                "stateStartTime": "2024-07-12T12:36:19.922179Z",
                "substate": 0
            }
        ],
    }

 Here is my code for listing jobs:

def list_jobs(client, project_id, region, filter😞
jobs_list = []
request = dataproc_v1.ListJobsRequest(
project_id=project_id,
region=region,
filter=filter
)

jobs = client.list_jobs(request=request)
 
for job in jobs:
job_dict = json.loads(proto.Message.to_json(job))
jobs_list.append(job_dict)
return jobs_list



0 1 189
1 REPLY 1

Hi @waynez,

Welcome to Google Cloud Community!

The gcloud command-line tool relies on local definitions of API resources and states. If your gcloud SDK is outdated, it might not reflect the most recent states added or changed in the Dataproc API. 

You can try to update all installed components to latest version by issuing the following command in your cloud shell:

gcloud components update

I hope this helps.