Error loading 3GB parquet file from GCS to BigQuer... - Page 2

alec_6b12 · 08-30-2023 07:24 AM

I have a 3GB parquet file in GCS I am trying to load into a BigQuery table. I see two errors related to the job on the console:

Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 101% of limit. Top memory consumer(s): input table/file scan: 100%

Error while reading data, error message: Failed to read a column from Parquet file gs://<redacted>data.parquet: row_group_index = 0, column = 6. Exception message: Unknown error: CANCELLED: . Detail: CANCELLED: File: gs://<redacted>/data.parquet

The file is being written from a pyarrow table. I have tried adjusting `row_group_size` down to 10 thousand rows and that has not seemed to help.

Could this have to do with the single dictionary / enum column? Here is the pyarrow schema:

required group field_id=-1 schema {
  required binary field_id=-1 observation_id (String);
  required binary field_id=-1 exposure_id (String);
  required double field_id=-1 mjd;
  required double field_id=-1 ra;
  optional double field_id=-1 ra_sigma;
  required double field_id=-1 dec;
  optional double field_id=-1 dec_sigma;
  optional double field_id=-1 mag;
  optional double field_id=-1 mag_sigma;
  required binary field_id=-1 observatory_code (String);
}

The docs only state maximum row size of 50MB (https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet). There is no way a row or even a row group of 10,000 is near that size.

Any guidance would be appreciated.

Error loading 3GB parquet file from GCS to BigQuery