Load files from Storage to BigQuery

lamenace · 12-03-2021 12:23 AM

Hello

I have log files in Storage directory and I want to load them in Bigquery. In Python I execute

blobs_all = list(bucket.list_blobs(prefix='mydir/'))
if (any(blobs_all)) :
    job_config = bigquery.LoadJobConfig(
        source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
        ignore_unknown_values=True
    )
    print("start_load")
    load_job = client.load_table_from_uri(
            'gs://path/mydir/*',
            'mytable',
            location="EU", 
            job_config=job_config,
            job_id= JOB_NAME
        )
    load_job.result()

But if I have an error in one file regarding the format, the load job exit :

Is there a way to tell BigQuery to bypass the error and continue to load other files
or do I have to load files one by one and look at the job error. As I have potentially hundreds of file I have to load so many

Thanks for your support

Fim

Please reply with the following.

1. Have you tried following this guide?

2. How are the files created?

3. Do they all have the same schema?

4. Do their schema match the destination table's schema?

5. Are you loading them into the same table?

6. Full error message without any sensitive, or personal information.

This will serve finding the root cause of the issue.

Also, consider these limitations regarding newline delimited JSON data.

lamenace

Hello

Yes I read this to topic
Files are Log Files with Logs formated in Json
Files and Logs inside have the same schema
Logs respects the Table schema
Yes I load the files in that Table Schema

My Issue is regarding the case of Files containing some "bad" Log regarding the schema. In that case all the Job Stop with the Error :

400 Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1251; errors: 1. Please look into the errors[] collection for more details.
[{'reason': 'invalid', 'location': 'gs://maxime-analytics-test/batch-elk-logs/all/tutu.log', 'message': 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1251; errors: 1. Please look into the errors[] collection for more details.'}, {'reason': 'invalid', 'message': 'Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 1251; errors: 1; max bad: 0; error percent: 0'}, {'reason': 'invalid', 'location': 'gs://maxime-analytics-test/batch-elk-logs/all/tutu.log', 'message': 'Error while reading data, error message: JSON parsing error in row starting at position 0: Could not parse value'}]

Is there a way to Configure the Job to skip this bad File ?

Regards

Fim

Thanks for the error message.
It says

max bad: 0;

Which, according to this guide, is set to 0 by default.

Please set '--max_bad_records' flag to the desired value.

lamenace

Ok

Thanks for the Support