Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Always getting error 50002 on Google Batch with specific Nextflow process

 

 

 

process betaMerge {
    label 'python'
    label 'biggermem'
    maxRetries 20
    errorStrategy {if (task.attempt == 20 || task.exitStatus == 1 || task.exitStatus == 137) { s = 'ignore' } else {println "Retrying ${task.name} with more disk size"; s = "retry"}; s}
    disk {
        base = (8.B * csvs*.size().sum())
        mult = task.attempt > 3 ? 3 : task.attempt
        us = mult * base
        println "Disk used by ${task.name}: ${us}"
        us}

input:
    path csvs, arity:'1..*', stageAs: "*/*" //comma separated
    val csv_names
    path samplesheet //tab separated
    val sample_group
    val output_fname // extension should be included
    val large_mem // 1

output:
    path "*.*", includeInputs:false

publishDir "$OUTPUT_ROOT/$sample_group/merged"

shell:
'''
output_fname=!{output_fname}
sample_group=!{sample_group}
large_mem=!{large_mem}
csvs="!{csvs}"
csv_names="!{csv_names}"
csvs_array=( $csvs )
csv_names_array=( $csv_names )
mkdir -p input_csvs
# we need this as the connection with the machine is not to be trusted
cd input_csvs &&
for i in "${!csvs_array[@]}"; do
    csv="${csvs_array[i]}" &&
    fname=`basename ${csv}` &&
    csv_name="${csv_names_array[i]}" &&
    echo "Copying $csv to input_csvs/${csv_name}.$fname .." && {
        ln -sf ../$csv ${csv_name}.${fname} || cp ../$csv input_csvs/${csv_name}.${fname}
    }
done
cd .. &&
args=
if [[ $large_mem -eq 1 ]]
then
    args="$args -l -u 50000"
fi

samplesheet="!{samplesheet}"
if [[ $samplesheet == "undefined" ]]
then
    beta_merge.py -i input_csvs -o $output_fname $args
else
    beta_merge.py -i input_csvs -o $output_fname -s "!{samplesheet}" $args
fi
'''


}

 

 

 

Hi all. The Nextflow process above is supposed to merge multiple CSV beta files into one large file. However it keeps throwing the VM reporting timeout error 50002  . The suggested solution in Google Troubleshoot page ("To resolve this issue, retry the task either by using automated task retries or manually re-running the job.") makes no sense, as I have tried to run the process almost 30 times, getting the exact same error. Has anyone of you observed this problem and how did you mitigate it? Huge thanks in advance!

Solved Solved
2 13 615
1 ACCEPTED SOLUTION

Hi @vaslem agreed, I did not see any error exposed to our end when the job failed. It seems the vm just crashed. I involved gcsfuse team to gain more insights internally.

At the same time, we will update our document to at least offer hints that using larger machines could potentially bypass the issue first.

=========

For posterity, @vaslem managed to circumvent the issue by splitting the operation even more into multiple machines, by exposing operations done within the python file to Nextflow processes. 

View solution in original post