Learn Big Data Hadoop
Big Data Hadoop Professional Roles and Skills

These are the FIVE major roles in Big Data Hadoop
Enter Slide 3 Title Here

This is slide 3 description. Go to Edit HTML of your blogger blog. Find these sentences. You can replace these sentences with your own words. This is a Blogger template by NewBloggerThemes.com...
Enter Slide 4 Title Here

This is slide 4 description. Go to Edit HTML of your blogger blog. Find these sentences. You can replace these sentences with your own words. This is a Blogger template by NewBloggerThemes.com...
Enter Slide 5 Title Here

This is slide 5 description. Go to Edit HTML of your blogger blog. Find these sentences. You can replace these sentences with your own words. This is a Blogger template by NewBloggerThemes.com...
Enter Slide 6 Title Here

This is slide 6 description. Go to Edit HTML of your blogger blog. Find these sentences. You can replace these sentences with your own words. This is a Blogger template by NewBloggerThemes.com...

Preparation Tips for CCA175

CCA175 (CCA Spark and Hadoop Developer) Certification

Pre Plan for CCA175 Exam

The key to any certification preparation is to have a proper plan, as the old saying says 'failing to plan is planning to fail'. In this blog I am going to show you one possible way you can prepare and obtain CCA 175 certification. My goal is to accomplish two things on this blog.

1. Identify the technologies to learn in order to accomplish the certification goals

2. Create a realistic schedule that encompasses learning and practicing before appearing for the certification.

Below table maps the required skill to technologies one needs to learn in order to solve problems during the certification exam. Remember, CCA 175 is a hands on exam, it is an open book exam but the only content you can access during the exam is API and official framework documentation. Hence, it is very important to gain a good level of comfort in using a set of Hadoop Eco system technologies, generic or specific frameworks and programming/query languages.

Exam Curriculum to Technology Mapping Table

Skills Required	Skill Description	Technology To be Used
Data Ingest	Import data from a MySQL database into HDFS using Sqoop	Sqoop
	Export data to a MySQL database from HDFS using Sqoop	Sqoop
	Change the delimiter and file format of data during import using Sqoop	Sqoop
	Ingest real-time and near-real-time streaming data into HDFS	Flume or Spark Streaming
	Process streaming data as it is loaded onto the cluster	Flume or Spark Streaming
	Load data into and out of HDFS using the Hadoop File System commands	HDFS Command Line
Transform,Stage and Store	Load RDD data from HDFS for use in Spark applications	Spark RDD and Spark DF (Data Frame)
	Write the results from an RDD back into HDFS using Spark	Spark RDD and Spark DF (Data Frame)
	Read and write files in a variety of file formats	Spark RDD and Spark DF (Data Frame)
	Perform standard extract, transform, load (ETL) processes on data	Spark RDD, Spark DF and Hive
Data Analysis	Use metastore tables as an input source or an output sink for Spark applications	Spark RDD, Spark DF, Spark SQL, Hive, Impala
	Understand the fundamentals of querying datasets in Spark	Spark RDD, Spark DF, Spark SQL
	Filter data using Spark	Spark RDD, Spark DF, Spark SQL
	Write queries that calculate aggregate statistics	Spark DF, Spark SQL, Hive and Impala
	Join disparate datasets using Spark	Spark RDD, Spark DF and Spark SQL
	Produce ranked or sorted data	Spark RDD, Spark DF, Spark SQL, Hive and Impala
Configuration	Supply command-line options to change your application configuration, such as increasing available memory	Spark Submit and options that can be used along with Spark Submit

This essentially boils down to learning below tools, frameworks, libraries and technologies. Here are the pre-requisites before you start your learning journey and also for practicing these technologies.

Basic knowledge of any programming language. If you have scala or python background then it makes it much more easier
Good Understanding of what data and database means. Some knowledge of SQL querying also helps.
Finally, the most import aspect of this practical learning is to have an environment. it may take hours or days for you to build a haoop environment with all these combination of technologies. Cloudera makes it easier for you by providing a quickstart VM that you can install on your machine. Please read the instructions carefully and watch some youtube videos on how to setup the quickstart VM for your practice. You can download the quick start VM here [Click HERE]

Technology to Language Mapping Table

Sr. No	Technology	Languages to be used	Description
1	HDFS	Unix Like Commands	The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines.
2	Sqoop	Unix Like Commands with some SQL	Framework for bulk data transfer between HDFS and structured datastores as RDBMS.
3	Spark	Scala OR Python	It is a Data analytics cluster computing framework. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). To its credit, Spark provides an easier to use alternative to Hadoop MapReduce and offers performance up to 10 times faster than previous generation systems like Hadoop MapReduce for certain applications.Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. For the certification exam, the emphasis is on Spark and not on tradition map reduce.
4	Spark RDD	Scala OR Python	RDD stands for Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
5	Spark DF	Scala OR Python	A DataFrame is a distributed collection of data, which is organized into named columns. A DataFrame can be constructed from an array of different sources such as Hive tables, Structured Data files, external databases, or existing RDDs.
6	Spark Streaming	Scala OR Python	Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data
7	Spark SQL	Scala, Python and SQL	Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.
8	Spark Submit	Unix Like Commands	It is a mechanism to run spark programs as applications by suppliying configurable parameters to optimize spark code execution
9	Flume	Unix Like commands	Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
10	Hive	SQL	Hive provides a SQL-like interface to data stored in HDP.
11	Spark Streaming	Scala and Python	Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data
12	Impala	SQL	Its a Cloudera's very own hive like interface but uses its own engines instead of relying on spark or map reduce engine for processing.

Now that we have the technology to skill mapping, lets translate this into a work schedule. I am assuming that you can spend 2 hours a day for 5 to 6 days a week. Given that Big Data is a mesmerizing world, i will not be surprised if you spend more than 2 hours a day purely out of interest and curiosity to learn. Hence a 6 week preparation should be good enough to crack the certification. I personally know people who did this in 2 weeks and hence nothing is impossible. So, there are just 6 weeks between the current "you" (possibly a no one in the big data context) to being someone i.e certified hadoop and spark developer. Are you up to the challenge? If the technologist and the curious learner inside you is urging you to shout 'YES I AM UP TO THE CHALLENGE' then i recommend that you either spend the next few weeks in equipping yourself in these technologies and come back to this blog when trying to accomplish the 15th task in the schedule below OR use the videos in this blog to gain some understanding in a more real time learning environment where you learn concepts on the fly and in a hands-on fashion.

Study plan for each Technology

Sr #	Task	Hrs of Study	Hrs of Practice
1	Setup Cloudera quickstart VM	3 Hours	NA
2	Introduction to Hadoop and Basic understanding of what Big Data is in general	3 Hours	NA
3	HDFS	1 Hour	2 Hours
4	Sqoop	2 Hours	2 Hours
5	Scala	3 Hours	3 Hours
6	Python	3 Hours	3 Hours
7	Spark RDD	3 Hours	6 Hours
8	Spark DF	1 Hours	2 Hours
9	Spark SQL	1 Hours	2 Hours
10	Spark Submit	1 Hours	1 Hour
11	Flume	1 Hours	3 Hours
12	Spark Streaming	1 Hours	1 Hour
13	Hive	3 Hours	5 Hours
14	Impala	1 Hours	2 Hours
15	Scenarios	NA	4 Hours
16	Total	27 Hours	39 Hours
17	Grant Total	66 Hours
18	Total Weeks of Prep at 2 Hours a day and 5 days a week	Around 6 Weeks

Note: The exam course content may change in future. So you may find the latest course content as well as documentation available online during examination from Cloudera Website.

************************************* Some Important Points to Remember **********************************

1. No of Questions: There are total of 10 to 12 questions you will get from the above topics.
2. Pass Mark: Need to secure 70% of mark to clear the certification.
3. Code Snippet: This will provided for Pyspark and Scala. You need to edit the snippet as per your problem statement.
4. Real Exam Environment:

All the very Best !!!

Map Reduce Overview

MR Overview No comments

MapReduce

To take the advantage of parallel processing of Hadoop, the query must be in MapReduce form. The MapReduce is a paradigm which has two phases, the mapper phase and the reducer phase. In the Mapper the input is given in the form of key value pair. The output of the mapper is fed to the reducer as input. The reducer runs only after the mapper is over. The reducer too takes input in key value format and the output of reducer is final output.

Steps in Map Reduce

Map takes a data in the form of pairs and returns a list of <key, value> pairs. The keys will not be unique in this case.
Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This sort and shuffle acts on these list of <key, value> pairs and sends out unique keys and a list of values associated with this unique key <key, list(values)>.
Output of sort and shuffle will be sent to reducer phase. Reducer will perform a defined function on list of values for unique keys and Final output will<key, value> will be stored/displayed.

How Many Maps

The size of data to be processed decides the number of maps required. For example, we have 1000 MB data and block size is 64 MB then we need 16 mappers.

Sort and Shuffle

The sort and shuffle occur on the output of mapper and before the reducer.When the mapper task is complete, the results are sorted by key, partitioned if there are multiple reducers, and then written to disk.Using the input from each mapper <k2,v2> , we collect all the values for each unique key k2. This output from the shuffle phase in the form of <k2,list(v2)> is sent as input to reducer phase.

The Algorithm

Generally MapReduce paradigm is based on sending the computer to where the data resides!
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
- Map stage : The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
- Reduce stage : This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.

Inputs and Outputs (Java Perspective)
The MapReduce framework operates on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework and hence, need to implement the Writable interface. Additionally, the key classes have to implement the Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of a MapReduce job: (Input) <k1, v1> -> map -> <k2, v2>-> reduce -> <k3, v3>(Output).

Input Output

Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

Terminology
PayLoad - Applications implement the Map and the Reduce functions, and form the core of the job.
Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
DataNode - Node where data is presented in advance before any processing takes place.
MasterNode - Node where JobTracker runs and which accepts job requests from clients.
SlaveNode - Node where Map and Reduce program runs.
JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker - Tracks the task and reports status to JobTracker.
Job - A program is an execution of a Mapper and Reducer across a dataset.
Task - An execution of a Mapper or a Reducer on a slice of data.
Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.

	Input	Output
Map	<k1, v1>	list (<k2, v2>)
Reduce	<k2, list(v2)>	list (<k3, v3>)

HDFS Architecture

HDFS Architecture No comments

HDFS Architecture

This architecture gives you a complete picture of HDFS. There is a single namenode which stores metadata and there are multiple datanodes which do actual storage work. Nodes are arranged in racks and Replicas of data blocks are stored on different racks in the cluster to provide fault tolerance. In remaining section of this tutorial we will see, how read and write operations are performed in HDFS? To read or write a file in HDFS, the client needs to interact with Namenode. HDFS applications need a write-once-read-many access model for files. A file once created and written cannot be edited.

Namenode stores metadata and datanode which stores actual data. The client interacts with namenode for any task to be performed as namenode is the centerpiece in the cluster.
There are several datanodes in the cluster which store HDFS data in the local disk. Datanode sends a heartbeat message to namenode periodically to indicate that it is alive. Also, it replicates data to other datanode as per the replication factor.

Become a Big Data - Hadoop Professional

Learn Big Data Hadoop

Big Data Hadoop Professional Roles and Skills

Enter Slide 3 Title Here

Enter Slide 4 Title Here

Enter Slide 5 Title Here

Enter Slide 6 Title Here

Preparation Tips for CCA175

CCA175 (CCA Spark and Hadoop Developer) Certification

Exam Curriculum to Technology Mapping Table

All the very Best !!!

Map Reduce Overview

MapReduce

Steps in Map Reduce

How Many Maps

Sort and Shuffle

The Algorithm

Inputs and Outputs (Java Perspective)

Terminology

HDFS Architecture

HDFS Architecture

Total Pageviews

Popular Posts

Recent Posts

Categories

Unordered List

Text Widget

Pages

Blog Archive

Sample Text