Browsing the web, I found that many people have different ways of pushing or pulling data to or from big query. So the question is the following. If I have a very large pandas dataset (more than 10 GB), and I need to push this entire dataset into a big query table, what is the most efficient way of doing that? At the moment, the way I decided to do that, is to save this large DataFrame into multiple parquet files, following also the guidelines
def save_dataframe_to_parquet_chunks(df, output_dir, prefix, max_file_size_mb=50, max_rows_per_file=None,
page_size=256 * 1024, compression='SNAPPY'):
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# Estimate the number of rows per file based on the size limit if not provided
if max_rows_per_file is None:
row_size = df.memory_usage(deep=True).sum() / len(df) # Average size of a row in bytes
max_rows_per_file = int((max_file_size_mb * 1024 * 1024) / row_size)
# Split the DataFrame into chunks and save each as a separate Parquet file
for i, chunk in enumerate(range(0, len(df), max_rows_per_file)):
df_chunk = df.iloc[chunk:chunk + max_rows_per_file]
output_file = os.path.join(output_dir, f"{prefix}_part_{i + 1}.parquet")
# Save the chunk to a Parquet file
df_chunk.to_parquet(
output_file,
engine='pyarrow',
compression=compression,
row_group_size=page_size
)
Hi,
Please refer to documents [1] and [2]. I hope it will be helpful.
[1] https://cloud.google.com/bigquery/docs/samples/bigquery-pandas-gbq-to-gbq-simple
[2] https://stackoverflow.com/questions/48886761/efficiently-write-a-pandas-dataframe-to-google-bigquery
Hello,
Thank you for your engagement regarding this issue. We haven’t heard back from you regarding this issue for sometime now. Hence, I'm going to close this issue which will no longer be monitored. However, if you have any new issues, Please don’t hesitate to create a new issue. We will be happy to assist you on the same.
Regards,
Sushma