Hi,
I noticed that since October 18, the tasks I run on top of Batch will regularly fail due to the network, but if I run them on GKE, there will be no problem with the tasks, I don't know what is causing this? it was failing to connect to mongodb. Is there any upgrade done to the Batch service?
mongodb is deployed via gke, the network from Batch to mongodb is connected via VPC, and firewall rules have been configured
job UID: ifr-etl-2572084247-f7e92d19-a6c1-4a1d0、ifr-etl-2572734806-3e59fe2b-fbb7-4b0a0
Hi @JonYu,
I see your job is using both instance template and network fields.
For Batch, if you are providing instance template, Batch will use the network setting in the instance template instead of the setting in the network fields. May I know more detail about your instance template network settings?
Also, you mentioned that the job started to fail on 10/18/2023. Does the job with exactly same job request succeed before that time?
Thanks,
Wenyan
Hi, @wenyhu
The network configuration information and network fields are the same on the instance template. Therefore the network configuration information in the instance template can refer to the network field.
It's always been successful before, and occasionally after that
Thanks
Hi @JonYu,
There is no Batch known issues or upgrade related to network since 10/18/2023.
From your latest reply, when you say "occasionally", does that mean after 10/18/2023, the Batch tasks always failed with network issue, or Batch tasks only failed randomly?
If randomly, can you try to add maxRetryCount field in your job request to temporary improve the case on task failures?
If always failed, could you double check on your instance template's network setting? From the your network field setting, I can see you don't disable external network with noExternalIpAddress field. But if your instance template is internal network only, that would be different.
Thanks,
Wenyan
Hi @wenyhu
These tasks always fail randomly. But every day a different quest fails for this reason. Setting a retry will increase the success rate, but there's no way to guarantee success unless there are a large number of retries.
If these tasks are not performed on Batch, this problem would not occur. So we suspect either a network issue or Batch has an upgrade or something that causes it.
Thanks
Hi @JonYu,
There is no changes for network on Batch for months. Therefore, unfortunately it is hard for Batch to triage the issue without additional information.
Could you help trace more network related logs on your side when running Batch jobs?
Or as you said, if GKE always works, could you try to compare any resource or setting differences between GKE and Batch for your case? E.g. does Batch uses different VM machine types with different sizes from GKE?
Thanks!
Wenyan