Features of RDD

Apache Spark is the most promising framework for Bigdata. It supports the processing of all kinds of the data such as graph, structured, unstructured data etc. Some of the features of RDD in Apache Spark are:

1. In-memory Computation

The data in an RDD is in-memory for as long as possible. By this, the performance of the system increases abruptly as the data is available on the go.

2. Lazy Evaluation

Any transformation on RDD does not evaluate on the go. The encounter of each transformation creates RDD. They evaluate only when Action comes. Thus this decreases the number of write to the disk. Hence, it increases the efficiency of the system.

3. Fault Tolerance

RDD can resist the failure as it can easily retrieve the lost RDD using the lineage graph.

4. Immutability

RDDs are immutable in the sense any changes made to the RDD cannot be altered. Transformation always creates a new RDD. The changes take place in new RDD.

5. Persistence

The RDD that are required frequently can be persisted or cached. So that when in need there is no delay in fetching an RDD. These persisted RDDs are in-memory than on disk.

6. Location Stickiness

RDDs can define placement preference to compute partitions. Placement preference is information about where the RDD is. The DAGScheduler places the partitions in such a way that task is close to data as much as possible. Thus speed up the computation.

7. Partitioning

The partitioning of records takes place logically. Then it distributes across various nodes in the cluster by an RDD. The logical division is for processing. In real, there is no division. It provides parallelism.

8. Parallel

Over the cluster the data processing is parallel.

9. Coarse-grained Operation

All the transformation to an RDD is coarse grained. It means the operation applies to the whole dataset not on an individual element in the data set of RDD.

10. Typed

RDD can be of various types. For example, RDD [int], RDD [long], RDD [string].

11. No-limitation

There is no limit in terms of number for creating an RDD. The limit is the size of disk and memory.

Spark RDD Features