Confusing, "false" error message

When executing jobs on GCP Batch, especially those jobs using GPUs, we noticed that there are many log messages are marked as "error" but they're actually normal installation logs of GPU drivers. The same issue happened to docker image layer downloading, too. Can we let Google Cloud fix this issue? These "false" error logs are fairly confusing and often overwhelms us from finding out real error logs.

 

layer pullinglayer pullingdriver installationdriver installation

1 5 413
5 REPLIES 5

Hi @r4ruixi,

Thanks for your feedback!

Batch also noticed this and we are working on an improvement on it now.

In short term, can you try to filter the log by batch_task_logs to filter out the GPU or docker related "false error" logs? It may not filter out all the "false error" logs your job meets, but it should help.

Thanks!

Wenyan

@wenyhu Hey, thanks for your response. However, it would be helpful to fix on google's side. It is fairly difficult to write a filter on stackdriver logs to exclude these driver installation and image pulling information as they don't have a consistent pattern. 

Hi @r4ruixi,

Currently we mark all the messages generated by external script in the stderr as error severity. However, the container pulling script and GPU driver installation script will also print some warning/info messages to the stderr. So the question here becomes a little tricky, Batch needs to identify which stderr is the real error messages in some way. 

To have a better solution, could you share what's your use case? are you setting some filters using the cloud console UI to debug some error in Batch job? or you want some programmatical ways to automate and surface real errors and redirect to other platforms.

Thanks!

Bob

@bobtian my use case is simple: setting some filters use the cloud console UI and expose real errors quickly rather than searching within hundreds of lines of log that were falsely marked as "Error". 

Btw, it seems that the GPU driver installation stage is still exposing `info` level log as Error in stackdriver. You can easily reproduce it by launching a Batch job with GPU.  (I used g2 machines to launch these jobs.)

image.png

This should now be fixed for batch_agent_logs category, Batch will use best effort to classify the severity of the cloud Logging.