Features of RDD
Apache Spark is the most promising framework for Bigdata. It supports the processing of all kinds of the data such as graph, structured, unstructured data etc. Some of the features of RDD in Apache Spark are:
1. In-memory Computation
The data in an RDD is in-memory for as long as possible. By this, the performance of the system increases abruptly as the data is available on the go.
2. Lazy Evaluation
Any transformation on RDD does not evaluate on the go. The encounter of each transformation creates RDD. They evaluate only when Action comes. Thus this decreases the number of write to the disk. Hence, it increases the efficiency of the system.
3. Fault Tolerance
RDD can resist the failure as it can easily retrieve the lost RDD using the lineage graph.
4. Immutability
RDDs are immutable in the sense any changes made to the RDD cannot be altered. Transformation always creates a new RDD. The changes take place in new RDD.
5. Persistence
The RDD that are required frequently can be persisted or cached. So that when in need there is no delay in fetching an RDD. These persisted RDDs are in-memory than on disk.
6. Location Stickiness
RDDs can define placement preference to compute partitions. Placement preference is information about where the RDD is. The DAGScheduler places the partitions in such a way that task is close to data as much as possible. Thus speed up the computation.
7. Partitioning
The partitioning of records takes place logically. Then it distributes across various nodes in the cluster by an RDD. The logical division is for processing. In real, there is no division. It provides parallelism.
8. Parallel
Over the cluster the data processing is parallel.
9. Coarse-grained Operation
All the transformation to an RDD is coarse grained. It means the operation applies to the whole dataset not on an individual element in the data set of RDD.
10. Typed
RDD can be of various types. For example, RDD [int], RDD [long], RDD [string].
11. No-limitation
There is no limit in terms of number for creating an RDD. The limit is the size of disk and memory.