5 Things I Learned About Spark (That More People Should Know)

Table of Contents

In my day job, I build data products for a large consumer goods company. (It’s actually quite likely that you have one of our products in your home.) For one of our analysis products, we were transitioning a set of carefully crafted batch ETL pipelines from a legacy environment to Spark.

We are a small team and I learned many things on the way. Having never worked with Spark before, there was a learning curve and a lot of hands-on learning for me. Thanks to PySpark and environments like Databricks, the start is relatively easy for people with experience in Python + Jupyter notebooks.

1. The Global Temp View is your Friend #

Our batch ETL job is a DAG with a series of notebooks. It’s necessary to push on selected dataframes to the next notebook. This did not work out of the box in Databricks and has surprisingly sparse documentation.

The solution turned out to be global temp views. It is simple and works great, but multiple people were confused for days about global vs. local Spark scope until we found this.

2. Lazy Evaluation and Performance Bottlenecks #

Lazy evaluation is a key feature of Spark and works great in most cases. However, in complex ETL jobs it can contribute to problems when there are very complex logical/analytical plans.

It’s easy to write code that has a poor performance and lazy evaluation makes it hard for people with little Spark experience to identify where the performance bottleneck is coming from, as it typically is not when we see the poor performance when the code is executed (but somewhere before).

3. Caching, Persisting, Checkpoints 🤔 #

In complex ETL jobs, it’s easy to create scenarios where the logical plan gets very complex, especially when branching off of the main data object and then joining back to it. This can lead to scenarios where multiple objects have huge overlap in the logical/analytical plan and this plan gets executed multiple times.

Caching can help to avoid this issue and reduce processing times in such scenarios, but caching in Spark is completely manual. There are multiple options and it’s a fair bit of trial and error to navigate .cache(), .persist() and .checkpoint().

To this date, I do not understand why caching/persisting requires assignment to an object (e.g. df = df.cache()), but unpersisting has no assignment (e.g. df.unpersist()). Moreover, it’s not very easy to get an overview of all cached objects and their respective size in PySpark.

4. The Problem of Small Files #

Spark is a great solution for large datasets. However, it’s very easy to create data transformations that split your data in many small files. This is not a problem per se, but it decreases the Spark performance heavily. We had scenarios where workflows were running for 3-4 hours, often stuck in seemingly minor tasks for 10 minutes…

In our case, a key reason for this performance was the fact that Spark had created many small partitions of our dataset. At one point, we had a dataframe with 2000 partitions of only few bytes each - this is a dataframe that would fit comfortably in each Spark worker node. We also had scenarios where transformations had created more than 10000 partitions on a single dataframe. This meant an extreme overhead on the Spark driver to locate single partitions of our RDD/dataframe on the different worker nodes.

In my experience, the number of resulting partitions is not very easy to predict. Running diagnostics and as well as .coalesce()/.repartition() became necessary to ensure a reasonable performance.

5. Memory Management… 🤔 #

In our batch ETL job, we noticed that the Spark memory usage kept increasing. The reason was not entirely clear and we suspected that many unused objects were still kept in memory (e.g. objects created in previous notebooks).

It is relatively tricky to confirm this behaviour and circumvent it. We ended up manually deleting objects and calling garbage collection, but results were somewhat unpredictable and documentation on this was relatively small. We are still trying to find a way to do this better.