Solved: Re: Always getting error 50002 on Google Batch wit...

vaslem · 07-12-2024 07:20 AM

process betaMerge {
    label 'python'
    label 'biggermem'
    maxRetries 20
    errorStrategy {if (task.attempt == 20 || task.exitStatus == 1 || task.exitStatus == 137) { s = 'ignore' } else {println "Retrying ${task.name} with more disk size"; s = "retry"}; s}
    disk {
        base = (8.B * csvs*.size().sum())
        mult = task.attempt > 3 ? 3 : task.attempt
        us = mult * base
        println "Disk used by ${task.name}: ${us}"
        us}

input:
    path csvs, arity:'1..*', stageAs: "*/*" //comma separated
    val csv_names
    path samplesheet //tab separated
    val sample_group
    val output_fname // extension should be included
    val large_mem // 1

output:
    path "*.*", includeInputs:false

publishDir "$OUTPUT_ROOT/$sample_group/merged"

shell:
'''
output_fname=!{output_fname}
sample_group=!{sample_group}
large_mem=!{large_mem}
csvs="!{csvs}"
csv_names="!{csv_names}"
csvs_array=( $csvs )
csv_names_array=( $csv_names )
mkdir -p input_csvs
# we need this as the connection with the machine is not to be trusted
cd input_csvs &&
for i in "${!csvs_array[@]}"; do
    csv="${csvs_array[i]}" &&
    fname=`basename ${csv}` &&
    csv_name="${csv_names_array[i]}" &&
    echo "Copying $csv to input_csvs/${csv_name}.$fname .." && {
        ln -sf ../$csv ${csv_name}.${fname} || cp ../$csv input_csvs/${csv_name}.${fname}
    }
done
cd .. &&
args=
if [[ $large_mem -eq 1 ]]
then
    args="$args -l -u 50000"
fi

samplesheet="!{samplesheet}"
if [[ $samplesheet == "undefined" ]]
then
    beta_merge.py -i input_csvs -o $output_fname $args
else
    beta_merge.py -i input_csvs -o $output_fname -s "!{samplesheet}" $args
fi
'''


}

Hi all. The Nextflow process above is supposed to merge multiple CSV beta files into one large file. However it keeps throwing the VM reporting timeout error 50002 . The suggested solution in Google Troubleshoot page ("To resolve this issue, retry the task either by using automated task retries or manually re-running the job.") makes no sense, as I have tried to run the process almost 30 times, getting the exact same error. Has anyone of you observed this problem and how did you mitigate it? Huge thanks in advance!

Wen_gcp

Hi @vaslem agreed, I did not see any error exposed to our end when the job failed. It seems the vm just crashed. I involved gcsfuse team to gain more insights internally.

At the same time, we will update our document to at least offer hints that using larger machines could potentially bypass the issue first.

=========

For posterity, @vaslem managed to circumvent the issue by splitting the operation even more into multiple machines, by exposing operations done within the python file to Nextflow processes.

View solution in original post

viniws

Also experiencing this. I posted a similar question here: https://www.googlecloudcommunity.com/gc/Infrastructure-Compute-Storage/Batch-no-longer-receives-VM-u...

wenyhu

Hi @vaslem,

Would you mind sharing one of your failed job example with job uid, region and any helpful log info to let Batch better help your case?

Thanks!

Wenyan

vaslem

Hi @wenyhu ,

Thanks for reaching out. Certainly, there are a lot of examples, one of the most recent is nf-6fefc5ef-172107-fd527bb4-2075-4a970 , europe-west1 . Unfortunately the logs are the least informative possible, almost all of them are iterations of this pair of logs:
1st:
report agent state: metadata:{parent:"projects/750927779528/locations/europe-west1" zone:"europe-west1-d" instance:"nf-6fefc5ef-172107-fd527bb4-2075-4a970-group0-0-brqp" instance_id:374284833611328490 creation_time:{seconds:1721079062 nanos:563528652} creator:"projects/750927779528/regions/europe-west1/instanceGroupManagers/nf-6fefc5ef-172107-fd527bb4-2075-4a970-group0-0" version:"cloud-batch-agent_20240703.00_p00" os_release:{key:"BUILD_ID" value:"18244.85.49"} os_release:{key:"ID" value:"cos"} os_release:{key:"NAME" value:"Container-Optimized OS"} os_release:{key:"VERSION" value:"113"} os_release:{key:"VERSION_ID" value:"113"} image_version:"batch-cos-stable-20240703-00-p00" machine_type:"n1-highmem-8"} agent_info:{state:AGENT_RUNNING job_id:"nf-6fefc5ef-172107-fd527bb4-2075-4a970" user_project_num:750927779528 tasks:{task_id:"task/nf-6fefc5ef-172107-fd527bb4-2075-4a970-group0-0/0/0" task_status:{state:RUNNING status_events:{type:"ASSIGNED" description:"task task/nf-6fefc5ef-172107-fd527bb4-2075-4a970-group0-0/0/0 ASSIGNED" event_time:{seconds:1721079314 nanos:583605011} task_state:ASSIGNED} status_events:{type:"RUNNING" description:"task task/nf-6fefc5ef-172107-fd527bb4-2075-4a970-group0-0/0/0 RUNNING" event_time:{seconds:1721079314 nanos:583607678} task_state:RUNNING}}} tasks:{task_id:"action/STARTUP/0/0/group0" task_status:{state:SUCCEEDED status_events:{type:"ASSIGNED" description:"task action/STARTUP/0/0/group0 ASSIGNED" event_time:{seconds:1721079062 nanos:786688189} task_state:ASSIGNED} status_events:{type:"RUNNING" description:"task action/STARTUP/0/0/group0 RUNNING" event_time:{seconds:1721079062 nanos:786697743} task_state:RUNNING} status_events:{type:"SUCCEEDED" description:"succeeded" event_time:{seconds:1721079314 nanos:541728459} task_state:SUCCEEDED}}} report_time:{seconds:1721082044 nanos:620897021} task_group_id:"group0"} agent_timing_info:{boot_time:{seconds:1721079054 nanos:518190499} script_startup_time:{seconds:1721079062 nanos:138190499} agent_startup_time:{seconds:1721079062 nanos:563528652}}

2nd:
Server response for instance 374284833611328490: tasks:{task:"action/STARTUP/0/0/group0" status:{state:SUCCEEDED status_events:{type:"ASSIGNED" description:"task action/STARTUP/0/0/group0 ASSIGNED" event_time:{seconds:1721079062 nanos:786688189} task_state:ASSIGNED} status_events:{type:"RUNNING" description:"task action/STARTUP/0/0/group0 RUNNING" event_time:{seconds:1721079062 nanos:786697743} task_state:RUNNING} status_events:{type:"SUCCEEDED" description:"succeeded" event_time:{seconds:1721079314 nanos:541728459} task_state:SUCCEEDED}} intended_state:ASSIGNED job_uid:"nf-6fefc5ef-172107-fd527bb4-2075-4a970" task_group_id:"group0"} tasks:{task:"task/nf-6fefc5ef-172107-fd527bb4-2075-4a970-group0-0/0/0" status:{state:RUNNING status_events:{type:"ASSIGNED" description:"task task/nf-6fefc5ef-172107-fd527bb4-2075-4a970-group0-0/0/0 ASSIGNED" event_time:{seconds:1721079314 nanos:583605011} task_state:ASSIGNED} status_events:{type:"RUNNING" description:"task task/nf-6fefc5ef-172107-fd527bb4-2075-4a970-group0-0/0/0 RUNNING" event_time:{seconds:1721079314 nanos:583607678} task_state:RUNNING}} intended_state:ASSIGNED job_uid:"nf-6fefc5ef-172107-fd527bb4-2075-4a970" task_group_id:"group0"} use_batch_monitored_resource:true.

Wen_gcp

Hi @vaslem ,

Could you please share more info about the files you used for the job?

1. size of each file

2. are these files in the same volume folder?

3. size of the output file

We are using gcsfuse under the hood, here is more info about its performance and best practices: https://cloud.google.com/storage/docs/gcsfuse-performance-and-best-practices.

Thanks,

Wen

vaslem

Hi @Wen_gcp

Thanks for coming in touch. There were 34 files provided, with sizes ranging from 100MBs to 4GBs. The input files were residing in the same mount, so only a single bucket was used. The expected size of the output file is approximately 35Gb . Thank you for supplying this link, I will go through it. I managed to circumvent the issue by splitting the operation even more into multiple machines, by exposing operations done within my python file to Nextflow processes. Still, I believe it would be nice to somehow be able to change the error into something more descriptive if possible, or even change the documentation, explaining to the users that the job failed possibly because of resources, not that they can rerun the job as is.

Wen_gcp

Thanks @vaslem for the information. Good suggestions, will update the doc and keep you posted.

Wen_gcp

Hi @vaslem I tried to reproduce the issue with the following commands:

for (( i=1; i<=30; i++ ))
do
    filename="/path/100M_$i.txt"
    dd if=/dev/urandom of="$filename" bs=1M count=100
    sed -i 's/1/2/g' "$filename"
    echo "Created $filename"
done
for (( i=1; i<=8; i++ ))
do
    filename="/path/4G_$i.txt"
    dd if=/dev/urandom of="$filename" bs=1M count=4000
    sed -i 's/1/2/g' "$filename"
    echo "Created $filename"
done
cat /path/*.txt > combined_file.txt

It did not triggered the 50002 exit code issue, but it did caused a high CPU usage peak.

May I know the metric of your job? You can find it by clicking into the vm -> OBSERVABILITY -> METRICS -> OVERVIEW.

Thanks!

vaslem

Hi @Wen_gcp , unfortunately Nextflow removes the VMs upon the task completion/termination, so I have no access to these metrics through that path. Is there any other way? Perhaps through the logs?

Wen_gcp

Thanks @vaslem for the prompt reply! I did not see a way to fetch this through logs after vms being deleted. If this issue happens again, that would be very helpful for us to have the metrics if possible.

vaslem

Unfortunately I won't be able to see this if I get this error, as Nextflow will always delete the VM... That's why I believe the error needs to be more descriptive, we are left in the dark if it is raised, unless there is something in the logs to help us out.

vaslem

Unfortunately I won't be able to see this if I get this error, as Nextflow will always delete the VM... That's why I believe the error needs to be more descriptive, we are left in the dark if it is raised, unless there is something in the logs to help us out.

Wen_gcp

Hi @vaslem agreed, I did not see any error exposed to our end when the job failed. It seems the vm just crashed. I involved gcsfuse team to gain more insights internally.

At the same time, we will update our document to at least offer hints that using larger machines could potentially bypass the issue first.

=========

For posterity, @vaslem managed to circumvent the issue by splitting the operation even more into multiple machines, by exposing operations done within the python file to Nextflow processes.

vaslem

Thank you @Wen_gcp , anything you do towards that direction would be much appreciated!

Always getting error 50002 on Google Batch with specific Nextflow process