Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Load files from Storage to BigQuery

Hello

I have log files in Storage directory and I want to load them in Bigquery. In Python I execute

 

 

 

 

 

 

blobs_all = list(bucket.list_blobs(prefix='mydir/'))
if (any(blobs_all)) :
    job_config = bigquery.LoadJobConfig(
        source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
        ignore_unknown_values=True
    )
    print("start_load")
    load_job = client.load_table_from_uri(
            'gs://path/mydir/*',
            'mytable',
            location="EU", 
            job_config=job_config,
            job_id= JOB_NAME
        )
    load_job.result()

 

 

 

 

 

 

But if I have an error in one file regarding the format, the load job exit :

  • Is there a way to tell BigQuery to bypass the error and continue to load other files
  • or do I have to load files one by one and look at the job error. As I have potentially hundreds of file I have to load so many

Thanks for your support

0 4 4,106
4 REPLIES 4

Please reply with the following.

1. Have you tried following this guide?

2. How are the files created?

3. Do they all have the same schema?

4. Do their schema match the destination table's schema?

5. Are you loading them into the same table?

6. Full error message without any sensitive, or personal information.

This will serve finding the root cause of the issue.

Also, consider these limitations regarding newline delimited JSON data.

Hello

  1. Yes I read this to topic
  2. Files are Log Files with Logs formated in Json
  3. Files and Logs inside have the same schema
  4. Logs respects the Table schema
  5. Yes I load the files in that Table Schema

My Issue is regarding the case of Files containing some "bad" Log regarding the schema. In that case all the Job Stop with the Error :

400 Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1251; errors: 1. Please look into the errors[] collection for more details.
[{'reason': 'invalid', 'location': 'gs://maxime-analytics-test/batch-elk-logs/all/tutu.log', 'message': 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1251; errors: 1. Please look into the errors[] collection for more details.'}, {'reason': 'invalid', 'message': 'Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 1251; errors: 1; max bad: 0; error percent: 0'}, {'reason': 'invalid', 'location': 'gs://maxime-analytics-test/batch-elk-logs/all/tutu.log', 'message': 'Error while reading data, error message: JSON parsing error in row starting at position 0: Could not parse value'}]

Is there a way to Configure the Job to skip this bad File ?

Regards

Thanks for the error message. 
It says

max bad: 0;

 Which, according to this guide, is set to 0 by default. 

Please set '--max_bad_records' flag to the desired value.

Ok

Thanks for the Support