Great articles, Dinesh.

Dec 15, 2022

Great articles, Dinesh. However, in both the optimisation blogs, I felt that you left out the most important point i.e choosing the right file format for the storage as per the use case.

Storing intermediate output is an another important aspect that can come in handy( not always required) while dealing with complex aggregation on a heavy dataset in order to avoid the computations again.

Also, Apache Pyspark pandas is now scalable starting 3.2. So, it is a heavy boost to all the Python developers out there as well as pyspark.pandas udfs are faster. Because unlike normal Python UDFs it doesn’t process the records one-by-one. It converts in to pandas series. Below is the sample databricks link. Similar can be observed in Apache Spark latest versions.

https://www.databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

Written by Siddhesh K

No responses yet