Spark Overview ~ Become a Big Data

Spark Overview

What is Apache Spark?

Architecture of Apache Spark?

Spark Framework:

How is the spark built and what kinds of feature it has, what kinds of ecosystem it has and what kinds of interfaces it has etc.

Spark Engine: (SPARK CORE) The core of the spark is the Spark Engine which is actually executing any code that we write using spark. The main function of the spark core is to take data and split across the multiple clusters, perform the operation as per the asked and return the desire results/output. Over Spark Core there are bunch of libraries and interfaces that are built.

Management: To manage the spark especially when it is on cluster, we can user HADOOP YARN, Mesos or Spark Scheduler.

Library: What types of library available in Spark to manipulate/manage the data

Spark SQL: It is a SQL like interface in the spark

ML Lib: Machine Learning Library that support bunch of ML algorithm

GraphX: Used to perform graph analysis,

Streaming: Use to analyze real time data like take Twitter Data stream, use ML algorithm, do prediction and take action in real time.

Storage: Since Spark is used to manipulate data, then spark need storage/place to read the data and store the data. So the kinds of storage in Spark could be Local file system, HDFS, S3 (Amazon storage in Cloud), RDBMS, NoSQL (Cassandra, MangoDB etc) to read, write and store the data.

Programming: Scala, Python, R, Java

Key part of Spark

1. Resilient Distributed Datasets (RDD):

a. Spark is built around RDDs. You can create, transform, analyze and store RDDs in Spark program. i.e. in Spark, if you read data from any other source would be form of a RDD, transform those data also in form of another RDD etc.

Example:

Read a file that have 100 lines from any storage will create a RDD say RDD01.

Transform the data (count/filter) the records from RDD01 will create another RDD say RDD02.

Note: Datasets means, the collection of elements of any types such as strings, lines, rows, objects etc.

b. The Dataset can be partitioned and distributed across multiple nodes

Please share this post if it really help you.

Become a Big Data - Hadoop Professional

Spark Overview

0 comments:

Post a Comment

Total Pageviews

Popular Posts

Recent Posts

Categories

Unordered List

Text Widget

Pages

Blog Archive

Sample Text