Limitations of Spark RDD

Although RDD is the fundamental abstraction of Apache Spark there is some limitation related to RDD which further became the reason for the evolution of DataFrame and Dataset.

1. No input optimization engine

In RDD for the purpose of automatic optimization, there is no provision present. RDD is not capable to make use of advanced optimizers like catalyst optimizer and Tungsten execution engine. But manually we can optimize each RDD. As a result, we cannot automatically optimize in RDD.

Dataset and DataFrame help to overcome the limitation of RDD. Both can use Catalyst that generates optimized logical and physical query plan. The same code optimizer is for R, Java, Scala or Python Dataset/DataFrame APIs. Thus, it provides space and Space efficiency.

2. Runtime type safety

In RDD there is no provision for Static typing and run-time type safety. It does check the error at the runtime.

To build complex data workflows DataSet provides compile-time type safety. By Compile-time type safety it means that if you are willing to add any other type of element to this list, it will give compile time error. Thus, it helps to make the code safe by detecting error at compile time.

3. Degrade when not enough memory

When there is not enough memory to store RDD in-memory or on disk the RDD degrades. It encounters storage issue when there is a lack of memory to store an RDD. The partitions that over run from RAM can be on disk and hence will provide the same level of performance. It is possible to overcome this issue by increasing the size of RAM and disk.

4. Performance limitation & Overhead of serialization & garbage collection

RDD is in-memory JVM object, it involves Garbage Collection and Java serialization overhead. So as a result, this runs expensive when the data grows.

The cost of garbage collection and the number of Java objects are proportional. As a result, by using data structures with less number of objects will lower the cost or we can persist the object in serialized form.

5. Handling structured data

The schema view of data is not available in RDD, there is no provision to handle structured data.

Dataset and DataFrame both provide the Schema view of data. It is a distributed collection of data organized into named columns.

Spark RDD Limitations