Apache Spark

It is the lightning fast Big Data solution. It has revealing development API’s. As a result, it allows data workers to do streaming, machine learning, or SQL workload that requires continuous access to datasets. Spark can also perform batch processing and stream processing. It is a general platform for cluster computing. It contains the entire Bigdata tool. Apache Spark is capable to access any of Hadoop data source and can run on Hadoop cluster. It talks Spark all to the next level that includes iterative queries and stream processing. In Hadoop, MapReduce allows scalability across servers in Hadoop cluster. Apache Spark is scalable. It also provides simple APIs in Python, Java, Scala, and R.

It is generally said that Spark is an extension of Hadoop; in real it is not true. But Spark and Hadoop are independent bodies although Spark can run on HDFS. There are some features of Apache Spark that make it an appealing framework. The in-memory computation of Apache Spark allows storing data in RAM. Thus it increases the processing speed of the system.

Evolution of Apache Spark

At present Spark is the biggest project of Apache Software Foundation 2009: Spark emerged as sub-project of Hadoop. Developed by Matei Zaharia in UC Berkeley’s AMP Lab. 2010: Under BSD license it was open sourced. 2013: Spark became a part of Apache Software Foundation.

Why Spark?

Spark was invented to overcome the limitations of Hadoop MapReduce. Following are some of the drawbacks of Hadoop:

  • Use only Java for application building.
  • Since the maximum framework is in java there is some security concern. Java being heavily exploited by cyber criminals this may result in many security breaches.
  • Good only for batch processing. Do not support stream processing.
  • Hadoop uses disk based processing.

Audience of Spark

The two most important audience of Apache Spark are data scientist and engineers. Data Scientist scan and model the data. They use their skill to analyze and discover the data. As the Spark is Simple and has tremendous speed it is very popular among Data Scientists. The various aspects of Spark that makes it a choice among these scientists are that one can explore data using SQL Shell. As a result Data Scientist can handle problems with large dataset very efficiently. The second foremost important audience is Engineer. Using Spark these scientists can develop data processing applications. Spark provides an efficient way to parallelize applications across Clusters and hides the complexity of distributed systems programming, network developer, and fault tolerance.

results matching ""

    No results matching ""