Re: Most efficient way of loading a large pandas d...

FraVs · 08-26-2024 02:24 AM

Browsing the web, I found that many people have different ways of pushing or pulling data to or from big query. So the question is the following. If I have a very large pandas dataset (more than 10 GB), and I need to push this entire dataset into a big query table, what is the most efficient way of doing that? At the moment, the way I decided to do that, is to save this large DataFrame into multiple parquet files, following also the guidelines

Keep row sizes to 50 MB or less.
If your input data contains more than 100 columns, consider reducing the page size to be smaller than the default page size (1 * 1024 * 1024 bytes). This is especially helpful if you are using significant compression.

Then, I push all these parquet files into the cloud storage and I tell the cloud storage to push them into big query at batches. However:

1. I'm not sure to correctly follow the guidelines for the parquet best chuck sizes
2. I don't know if this is the best approach.

I attach my way of storing the DataFrame into multiple parquet, which at the moment saves a huge number of tiny parquet files:

def save_dataframe_to_parquet_chunks(df, output_dir, prefix, max_file_size_mb=50, max_rows_per_file=None,
                                     page_size=256 * 1024, compression='SNAPPY'):
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    # Estimate the number of rows per file based on the size limit if not provided
    if max_rows_per_file is None:
        row_size = df.memory_usage(deep=True).sum() / len(df)  # Average size of a row in bytes
        max_rows_per_file = int((max_file_size_mb * 1024 * 1024) / row_size)

    # Split the DataFrame into chunks and save each as a separate Parquet file
    for i, chunk in enumerate(range(0, len(df), max_rows_per_file)):
        df_chunk = df.iloc[chunk:chunk + max_rows_per_file]
        output_file = os.path.join(output_dir, f"{prefix}_part_{i + 1}.parquet")

        # Save the chunk to a Parquet file
        df_chunk.to_parquet(
            output_file,
            engine='pyarrow',
            compression=compression,
            row_group_size=page_size
        )

jeerahalli

Hi,

Please refer to documents [1] and [2]. I hope it will be helpful.

[1] https://cloud.google.com/bigquery/docs/samples/bigquery-pandas-gbq-to-gbq-simple

[2] https://stackoverflow.com/questions/48886761/efficiently-write-a-pandas-dataframe-to-google-bigquery

jeerahalli

Hello,

Thank you for your engagement regarding this issue. We haven’t heard back from you regarding this issue for sometime now. Hence, I'm going to close this issue which will no longer be monitored. However, if you have any new issues, Please don’t hesitate to create a new issue. We will be happy to assist you on the same.

Regards,

Sushma

Most efficient way of loading a large pandas dataframe into big query.