Solved: Bq export in parquet format

hamzasarwar · 10-22-2024 12:39 AM

I want to export the bigquery data to GCS using bq extract command.
As per the documentation, it supports the parquet format but when I try to set --destination_format PARQUET, It throws this error
--destination_format=PARQUET: value should be one of <CSV|NEWLINE_DELIMITED_JSON|AVRO|SAVED_MODEL>

Does that mean parquet is not supported by bq export yet or do we need to set additional parameters?

marckevin

Hi @hamzasarwar,

Welcome to Google Cloud Community!

BigQuery currently supports exporting data in Parquet format to Cloud Storage. I tried to replicate the bq extract command and successfully exported the Parquet file. You can refer to the command below.

bq extract --destination_format=PARQUET 'your_dataset.your_table' gs://your-bucket/your-file.parquet

The error you're seeing indicates that the Parquet format is invalid with your bq extract command and does not support exporting to Parquet files. This issue might be due to several reasons. Here are some suggestions that may help resolve the issue:

Google Cloud SDK: Ensure that you're using the latest version of gcloud components
Ensure you have the necessary permission to perform the task, and consider the location of your Cloud Storage and BigQuery dataset. You can refer to this documentation for detailed information
There are alternative methods for exporting a BigQuery dataset to Cloud Storage. You might also want to try using the Console method instead of the bq command. Please refer to this documentation for complete steps on Console and other methods.

If the issue persists, I recommend reaching out to Google Cloud Support for further assistance, as they can provide insights into whether this behavior is specific to your project.

I hope the above information is helpful.

View solution in original post

ms4446

Yes, you are correct The bq extract command itself doesn't support Parquet directly. The response you provided outlines a valid workaround, but let's refine it with some extra considerations and best practices:

Workaround to Export to Parquet

Export to a Supported Format First: As the response suggests, start by exporting your data to an intermediate format like CSV or AVRO using bq extract. AVRO is generally preferred over CSV, especially for large datasets or complex schemas, due to its efficiency and schema evolution capabilities.
```
bq extract --destination_format=AVRO your_project:your_dataset.your_table gs://your-bucket/your-output.avro
```

Convert to Parquet: Now, you'll need to convert this intermediate file to Parquet. Here are the improved conversion options:

PySpark/pandas: Excellent choice! Both PySpark and pandas provide efficient ways to read AVRO/CSV and write Parquet.

 
import pandas as pd 
# Or use  from pyspark.sql import SparkSession to initialize a SparkSession

# Read the AVRO (adjust for CSV if needed)
df = pd.read_parquet('gs://your-bucket/your-output.avro')  

# Write to Parquet
df.to_parquet('gs://your-bucket/your_output.parquet', index=False)

marckevin