Jun 29, 2022
Great article, James. However, would like to discuss about scalability here. If I have huge dataset for example say 1 Billion of records. Assuming that the API is capable of accepting 0.1 M records per request so I would send it in 0.1 M batches but how do you distribute this request? I didn't see repartition() in your .withColumn code. If there is any data skewness issue then your code will run slower without repartition