Solved: Batch: Defining ComputeResource

jcraps · 03-14-2023 07:13 AM

Hello,

We're trying to set up GCP batch processing for jobs that require large amounts of RAM.

The problem is that when setting the ComputeResource, it's not entirely clear how these get distributed between the parallel processes. Is there any additional information on how this exactly works?

I would like for example that current jobs get about 60 GB of RAM, but when changing those values the job fails even earlier without a clear error log. Any ideas on why this might be and how to tackle it?

Current setup:

taskGroups:
  taskSpec:
    compute_resource:
      cpu_milli: 2000
      memory_mib: 120000
      boot_disk_mib: 500000
    runnables:
      - container:
          imageUri: ${imageUri}
          entrypoint: 'python3'
          commands: 'etl/main.py'
  # Run x tasks on y VMs
  taskCount: ${taskCount}
  parallelism: 4

jacksonwb

@jcraps, my understanding is that the `compute_resource` field specifies resources to provide to each task contained within the job. This means that for a job with let's say task count `4` and parallelism `4`, 4 concurrent tasks will be created each with `memory_mib` amount of RAM. Batch will attempt to create VMs and place the tasks on these VMs such that they have the correct amount of resources.

This may mean that each task gets its own VM, or perhaps a VM with enough resources to provide for multiple tasks is created and multiple tasks are executed on that VM, each getting their specified share of the total resources.

There could also be the situation where a task is created with more RAM than could be provided by any VM, which I would assume would result in an unschedulable task.

It can be rather difficult to get debugging information in these cases. You can look at the `batch_agent` logs, or sometimes there is log information captured in the job / task state transition fields when running `gcloud batch jobs describe ...`

This is only what I've gleaned from using the service, so perhaps staff will chime in with more sound information.

View solution in original post

jacksonwb

@jcraps, my understanding is that the `compute_resource` field specifies resources to provide to each task contained within the job. This means that for a job with let's say task count `4` and parallelism `4`, 4 concurrent tasks will be created each with `memory_mib` amount of RAM. Batch will attempt to create VMs and place the tasks on these VMs such that they have the correct amount of resources.

This may mean that each task gets its own VM, or perhaps a VM with enough resources to provide for multiple tasks is created and multiple tasks are executed on that VM, each getting their specified share of the total resources.

There could also be the situation where a task is created with more RAM than could be provided by any VM, which I would assume would result in an unschedulable task.

It can be rather difficult to get debugging information in these cases. You can look at the `batch_agent` logs, or sometimes there is log information captured in the job / task state transition fields when running `gcloud batch jobs describe ...`

This is only what I've gleaned from using the service, so perhaps staff will chime in with more sound information.