What happens when a batch task is preempted on a spot machine?

If the task is still running when the machine is preempted, will this task be scheduled to run again later? 

Solved Solved
1 17 1,099
1 ACCEPTED SOLUTION

Batch is considering a conditional retry capability, in which you would be able to specify which failure scenarios to retry on. You would then be able to specify to retry on preemption specifically. This sounds like it would address both of your use cases, but let us know your thoughts.

View solution in original post

17 REPLIES 17

Tasks are not rescheduled by default, but you can control that behavior with TaskSpec.maxRetryCount.

If maxRetryCount is greater than zero then a Task that fails due to machine preemption (or for any other reason) will be rescheduled at most maxRetryCount times. Retries are not tied to a specific machine, so if the first machine is stopped it just means that the retries will happen on a different machine in the TaskGroup's MIG.

This sounds like undesirable behavior to me. I think a great feature would be maxPreemptionCount as well. It would be like retries, but only on preemption. It’s often the case that you don’t want to retry jobs when they fail (for example on user error), but you very much want to retry jobs that are preempted.

I agree with Jimmy, there is a better way of dealing with preemption. In fact, rather than a "maxPreemptionCount",  a boolean value of "retryOnPreempt" would be better for my use case. In my batch tasks, I am saving checkpoints / job logs when a machine is preempted so it can be resumed easily (even from a different instance). The behavior I want is that no matter how many times the task is preempted, it tries to finish the task, resuming from where it left off. Of course, this functionality can be assisted by the timeout configuration for a task to ensure it ends eventually (one thing to consider is if we would want to cumulatively keep track of the time it takes for a task to complete across multiple preempted tries, but this isn't of a particular importance to me). 

Batch is considering a conditional retry capability, in which you would be able to specify which failure scenarios to retry on. You would then be able to specify to retry on preemption specifically. This sounds like it would address both of your use cases, but let us know your thoughts.

Yes I believe that would work for my use case. I can condition on everything except user error.

 

James

Yes, that would be great. 

+1 this would be really helpful

Hi @ryanmarten@jacksonwb ,@JimmyPinks - I'm interested in getting your perspective on the following hypothetical scenario. If there is a policy that specifies that a Batch job should be retried for Exit Code 1, but fail for Exit Code 2, what would be the preferred default behavior for all other Exit Codes (e.g. 3, 4, 5, etc.) that is not explicitly specified? In this scenario, is it acceptable to fail all other Exit Codes?

Is this assuming that `maxRetryCount` is non-zero? 
If so I would probably expect it to retry. 

I.e if I execute a program and tell it to retry, I can be pretty sure it will retry without having to track down the specific exit code emitted by said program when it failed and ensure it is 1, UNLESS I specifically issue exit code 2 which I'm interpreting to mean quit and explicitly don't retry.

Am I interpreting this situation correctly?

In this case, exit codes would be specific failure reasons such as preemption, out of memory, or another infrastructure error. When maxRetryCount is non-zero and then within the policy there are only 2 behaviors specified: 1) retry on preemption, then 2) fail on out of memory. In this scenario, we are curious if there is a preference for the default action (fail or retry) for all other failures that may occur while the job runs.

I would imagine other infrastructure errors mirror the OOM behaviour and all other user entrypoint output codes would retry on maxRetryCount!=0 and fail otherwise.

But perhaps I'm not fully understanding how infrastructure error codes are implemented and distinguished from user entrypoint error codes.

Dear @Shamel, is there any movement on this capability? (separate counter for preemtion retries). This is highly desirable in our use case, where we run hundreds of jobs in parallel with Batch, which we expect to pass. The "real" retry count is no retry, as any failure is critical. However, we are forced to allow 2-3 retries to adjust for preemption. This means that in case of a real issue, we are running the whole set of jobs 2-3x times, with all the cost that involves. What's worse, 2-3 retries might not be enough to compensate completely for preemption, which can easily happen 3-4 times in a row on one job out of hundreds.

This capability is available, but official documentation will be published soon. To use this feature, you can add the snippet below to your JSON in which the job will only be retried for the specified exitCode 50001, which is for preemption. More details in this link on LifecyclePolicy.  Let us know if this helps with this use case.

"maxRetryCount": "5",
"lifecyclePolicies": [
{
    "action": "RETRY_TASK",
   " actionCondition": {
    "exitCodes": [50001]
   }
 }
]

Thank you, I can confirm this works for us.

@Shamel Thanks for this! This is a critical improvement.

So we can now have different actions for different `exitCodes`.
Am I correct in understanding that there is still a single `maxRetryCount`, so there is not a way to allow X number of retries for a normal failure and Y number for preemption?

Hi @jacksonwb - Currently, for a single job you can only have one type of Action, which means the job only specifies a RETRY_TASK action or a FAIL_TASK action. There is not a specific retry count at the exitCode level, so this means if maxRetryCount = 5 and there is a RETRY_TASK  policy for 4 different exitCodes that all the failures will contribute to the limit of 5 retries.

Two other behaviors to be aware of is:

  1. RETRY_TASK specified with 3 different exitCodes and then the task fails for another exitCode that you did not specify. This will lead to the task being marked as failed and not retried.
  2. FAIL_TASK specified with 3 different exitCodes and then the task fails for another exitCode. All other exitCodes that occur will result in the task being retried

Let us know if you have other questions.

Hi Shamel,

Feel free to redirect me to the support teams with this, but I have a feeling this is related to the functionality we are discussing.

We have implemented the experimental feature you have suggested (we translate YAML config to JSON internally):

 

    maxRetryCount: 10
    lifecyclePolicies:
    - action: RETRY_TASK
      actionCondition:
        exitCodes: [ 50001 ]

 

but now we sometimes see the batch job failing just because one of the jobs failed on the first try, without retrying that job first. Looking at the logs, the reason for failure is given by msg="error waiting for container: unexpected EOF". The job does not download any containers, apart from the one it is running on, so I am a but puzzled by the error message; but my working assumption is that there is some other way in which job can be terminated rather than 50001.

Would you have any suggestions on this?