Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Dataproc cluster job gets killed on pandas dataframe append

Dear Experts,

I want to append three pandas data frames each with about 120 columns are about 5 million rows. but the kernel gets killed. it fails even when i increase the number of secondary and primary workers even to 30 for each.

Increasing the works doesn't seem to improve performance, speed almost stays the same

Help please

 

Antony

Solved Solved
0 2 352
1 ACCEPTED SOLUTION

Here are a few things you can try:

  1. Reduce the number of columns. If you can, try to reduce the number of columns in the data frames. This will make the append operation faster and less likely to cause the kernel to be killed.
  2. Use a different data format. Consider using a different data format, such as Parquet or ORC. These formats are more efficient for storing large amounts of data.
  3. Use a distributed computing framework. If you have a large cluster of machines, you can use a distributed computing framework, such as Apache Spark or Dask, to append the data frames. This will allow you to process the data in parallel, which can improve performance significantly.
  4. Use a cloud-based solution. There are a number of cloud-based solutions that can help you to process large datasets. For example, Google Cloud BigQuery and Amazon Redshift are both scalable and fault-tolerant databases that can handle large data sets.

If you are still having trouble, please provide more information about your environment, such as the version of Python and Pandas that you are using, as well as the operating system and hardware that you are running on.

View solution in original post

2 REPLIES 2

Here are a few things you can try:

  1. Reduce the number of columns. If you can, try to reduce the number of columns in the data frames. This will make the append operation faster and less likely to cause the kernel to be killed.
  2. Use a different data format. Consider using a different data format, such as Parquet or ORC. These formats are more efficient for storing large amounts of data.
  3. Use a distributed computing framework. If you have a large cluster of machines, you can use a distributed computing framework, such as Apache Spark or Dask, to append the data frames. This will allow you to process the data in parallel, which can improve performance significantly.
  4. Use a cloud-based solution. There are a number of cloud-based solutions that can help you to process large datasets. For example, Google Cloud BigQuery and Amazon Redshift are both scalable and fault-tolerant databases that can handle large data sets.

If you are still having trouble, please provide more information about your environment, such as the version of Python and Pandas that you are using, as well as the operating system and hardware that you are running on.

Thank you very much, i reduced the number of columns