Big Data is a collection of large sets of data. Data can be structured or unstructured. Our industry leader Gartner stated that "Big data is huge-volume, fast-velocity, and different variety information assets that demand innovative platform for enhanced insights and decision making". It is impossible to deal Big Data with traditional processing systems. Earlier data was not generated in huge amounts, so RDBMS was to store that Data.

Today we generate multi terabytes of data. Generally 80% of the data is in unstructured form. For storing petabytes of data traditional method was not sufficient. So, we need a new technology which can store such huge amount of Big data reliably then, Hadoop emerged.

Hadoop is an open source framework of Apachardwaree Software Foundation. It processes and provides massive storage for any kind of data. Hadoop framework is written in Java. Efficiently it can processes large volume of data on a cluster of commodity h. Hadoop is scalable, fault tolerant and flexible. As Hadoop has lots of advantages but it has certain disadvantages like issue with small files, its processing speed is slow, It can only access batch processing etc. Due to its disadvantages ASF(Apache Software Foundation) developed Apache Spark.

Apache Spark is an unified framework for creating, managing and implementing big data processing requirements. Spark can run in both mode batch processing as well as streaming. Spark improves the processing speed of data up to 100x faster than Hadoop. We will discuss in detail about the Spark in this book.

Overview

results matching ""

No results matching ""