Great article, Jas. However, I felt a few points were misleading.

1 min readAug 6, 2022

You can load a 14GB file in an 8GB memory. So, if you are performing simple show() or count() action it won't lead to memory issues. It just has the collect the data/count from different partitions and returns it. Even if it had to do some kind of processing, data would have been spilled to the disk based on transformations, shuffling of data, etc.

Also, every cloud provider provides disk storage attached to the cluster. Never worked with Databricks. But I am pretty sure they would be having this kind of option.

Spark's LRU cache policy comes into the picture here.

Also, would like to talk about the Kafka consumer architecture shown

I get that it's a local machine so therefore you might not have more consumers with you but cons could have been listed here when you are using consumers < partitions.

Ideally, a single consumer should process each partition. If you are making a single consumer read from multiple partitions ordering of the records will be affected and you have to handle it explicitly.

Discussion about offsets and a git code link (end-to-end flow)would have been more helpful too for users reading this article.

Written by Siddhesh K

No responses yet