Preparation Tips for CCA175

CCA175 (CCA Spark and Hadoop Developer) Certification



Pre Plan for CCA175 Exam

The key to any certification preparation is to have a proper plan, as the old saying says 'failing to plan is planning to fail'. In this blog I am going to show you one possible way you can prepare and obtain CCA 175 certification. My goal is to accomplish two things on this blog.


1.     Identify the technologies to learn in order to accomplish the certification goals
2.     Create a realistic schedule that encompasses learning and practicing before appearing for the certification.

Below table maps the required skill to technologies one needs to learn in order to solve problems during the certification exam. Remember, CCA 175 is a hands on exam, it is an open book exam but the only content you can access during the exam is API and official framework documentation. Hence, it is very important to gain a good level of comfort in using a set of Hadoop Eco system technologies, generic or specific frameworks and programming/query languages.

Exam Curriculum to Technology Mapping Table


Skills RequiredSkill DescriptionTechnology To be Used
Data Ingest
Import data from a MySQL database into HDFS using SqoopSqoop
Export data to a MySQL database from HDFS using SqoopSqoop
Change the delimiter and file format of data during import using SqoopSqoop
Ingest real-time and near-real-time streaming data into HDFSFlume or Spark Streaming
Process streaming data as it is loaded onto the clusterFlume or Spark Streaming
Load data into and out of HDFS using the Hadoop File System commandsHDFS Command Line
Transform,Stage and Store
Load RDD data from HDFS for use in Spark applicationsSpark RDD and Spark DF (Data Frame)
Write the results from an RDD back into HDFS using SparkSpark RDD and Spark DF (Data Frame)
Read and write files in a variety of file formatsSpark RDD and Spark DF (Data Frame)
Perform standard extract, transform, load (ETL) processes on dataSpark RDD, Spark DF and Hive
Data Analysis
Use metastore tables as an input source or an output sink for Spark applicationsSpark RDD, Spark DF, Spark SQL,
Hive, Impala
Understand the fundamentals of querying datasets in SparkSpark RDD, Spark DF, Spark SQL
Filter data using SparkSpark RDD, Spark DF, Spark SQL
Write queries that calculate aggregate statisticsSpark DF, Spark SQL, Hive and Impala
Join disparate datasets using SparkSpark RDD, Spark DF and Spark SQL
Produce ranked or sorted dataSpark RDD, Spark DF, Spark SQL,
Hive and Impala
ConfigurationSupply command-line options to change your application configuration, such as increasing available memorySpark Submit and options that can be used along with Spark Submit

This essentially boils down to learning below tools, frameworks, libraries and technologies. Here are the pre-requisites before you start your learning journey and also for practicing these technologies.
  • Basic knowledge of any programming language. If you have scala or python background then it makes it much more easier
  • Good Understanding of what data and database means. Some knowledge of SQL querying also helps.
  • Finally, the most import aspect of this practical learning is to have an environment. it may take hours or days for you to build a haoop environment with all these combination of technologies. Cloudera makes it easier for you by providing a quickstart VM that you can install on your machine. Please read the instructions carefully and watch some youtube videos on how to setup the quickstart VM for your practice. You can download the quick start VM here [Click HERE] 
Technology to Language Mapping Table
Sr. NoTechnologyLanguages to be usedDescription
1HDFSUnix Like CommandsThe Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines.
2SqoopUnix Like Commands with some SQLFramework for bulk data transfer between HDFS and structured datastores as RDBMS.
3SparkScala OR PythonIt is a Data analytics cluster computing framework. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). To its credit, Spark provides an easier to use alternative to Hadoop MapReduce and offers performance up to 10 times faster than previous generation systems like Hadoop MapReduce for certain applications.Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. For the certification exam, the emphasis is on Spark and not on tradition map reduce.
4Spark RDDScala OR PythonRDD stands for Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
5Spark DFScala OR PythonA DataFrame is a distributed collection of data, which is organized into named columns. A DataFrame can be constructed from an array of different sources such as Hive tables, Structured Data files, external databases, or existing RDDs.
6Spark StreamingScala OR PythonSpark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data
7Spark SQLScala, Python and SQLSpark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.
8Spark SubmitUnix Like CommandsIt is a mechanism to run spark programs as applications by suppliying configurable parameters to optimize spark code execution
9FlumeUnix Like commandsFlume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
10HiveSQLHive provides a SQL-like interface to data stored in HDP.
11Spark StreamingScala and PythonSpark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data
12ImpalaSQLIts a Cloudera's very own hive like interface but uses its own engines instead of relying on spark or map reduce engine for processing.

Now that we have the technology to skill mapping, lets translate this into a work schedule. I am assuming that you can spend 2 hours a day for 5 to 6 days a week. Given that Big Data is a mesmerizing world, i will not be surprised if you spend more than 2 hours a day purely out of interest and curiosity to learn. Hence a 6 week preparation should be good enough to crack the certification. I personally know people who did this in 2 weeks and hence nothing is impossible. So, there are just 6 weeks between the current "you" (possibly a no one in the big data context) to being someone i.e certified hadoop and spark developer. Are you up to the challenge? If the technologist and the curious learner inside you is urging you to shout 'YES I AM UP TO THE CHALLENGE' then i recommend that you either spend the next few weeks in equipping yourself in these technologies and come back to this blog when trying to accomplish the 15th task in the schedule below OR use the videos in this blog to gain some understanding in a more real time learning environment where you learn concepts on the fly and in a hands-on fashion. 


Study plan for each Technology
Sr #TaskHrs of StudyHrs of Practice
1Setup Cloudera quickstart VM3 HoursNA
2Introduction to Hadoop and Basic understanding of what Big Data is in general3 HoursNA
3HDFS1 Hour2 Hours
4Sqoop2 Hours2 Hours
5Scala3 Hours3 Hours
6Python3 Hours3 Hours
7Spark RDD3 Hours6 Hours
8Spark DF1 Hours2 Hours
9Spark SQL1 Hours2 Hours
10Spark Submit1 Hours1 Hour
11Flume1 Hours3 Hours
12Spark Streaming1 Hours1 Hour
13Hive3 Hours5 Hours
14Impala1 Hours2 Hours
15ScenariosNA4 Hours
16Total27 Hours39 Hours
17Grant Total66 Hours
18Total Weeks of Prep at 2 Hours a day and 5 days a weekAround 6 Weeks


Note: The exam course content may change in future. So you may find the latest course content as well as documentation available online during examination from Cloudera Website

*************************************  Some Important Points to Remember  **********************************


1. No of Questions: There are total of 10 to 12 questions you will get from the above topics.
2. Pass Mark: Need to secure 70% of mark to clear the certification.
3. Code Snippet: This will provided for Pyspark and Scala. You need to edit the snippet as per your problem statement.
4. Real Exam Environment: 





All the very Best !!!












Share:

Map Reduce Overview

 

MapReduce

To take the advantage of parallel processing of Hadoop, the query must be in MapReduce form. The MapReduce is a paradigm which has two phases, the mapper phase and the reducer phase. In the Mapper the input is given in the form of key value pair. The output of the mapper is fed to the reducer as input. The reducer runs only after the mapper is over. The reducer too takes input in key value format and the output of reducer is final output.

Steps in Map Reduce

  • Map takes a data in the form of pairs and returns a list of <key, value> pairs. The keys will not be unique in this case.
  • Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This sort and shuffle acts on these list of <key, value> pairs and sends out unique keys and a list of values associated with this unique key <key, list(values)>.
  • Output of sort and shuffle will be sent to reducer phase. Reducer will perform a defined function on list of values for unique keys and Final output will<key, value> will be stored/displayed. 

How Many Maps

The size of data to be processed decides the number of maps required. For example, we have 1000 MB data and block size is 64 MB then we need 16 mappers.

Sort and Shuffle

The sort and shuffle occur on the output of mapper and before the reducer.When the mapper task is complete, the results are sorted by key, partitioned if there are multiple reducers, and then written to disk.Using the input from each mapper <k2,v2> , we collect all the values for each unique key k2. This output from the shuffle phase in the form of <k2,list(v2)> is sent as input to reducer phase.



The Algorithm

  • Generally MapReduce paradigm is based on sending the computer to where the data resides!
  • MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
    • Map stage : The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
    • Reduce stage : This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
  • During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
  • The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes.
  • Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
  • After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.


    MapReduce Algorithm

    Inputs and Outputs (Java Perspective)

    The MapReduce framework operates on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.
    The key and the value classes should be in serialized manner by the framework and hence, need to implement the Writable interface. Additionally, the key classes have to implement the Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of a MapReduce job: (Input) <k1, v1> -> map -> <k2, v2>-> reduce -> <k3, v3>(Output).

    Input Output
    Map <k1, v1> list (<k2, v2>)
    Reduce <k2, list(v2)> list (<k3, v3>)

    Terminology

  • PayLoad - Applications implement the Map and the Reduce functions, and form the core of the job.
  • Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
  • NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
  • DataNode - Node where data is presented in advance before any processing takes place.
  • MasterNode - Node where JobTracker runs and which accepts job requests from clients.
  • SlaveNode - Node where Map and Reduce program runs.
  • JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
  • Task Tracker - Tracks the task and reports status to JobTracker.
  • Job - A program is an execution of a Mapper and Reducer across a dataset.
  • Task - An execution of a Mapper or a Reducer on a slice of data.
  • Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.





Share:

HDFS Architecture


HDFS Architecture

This architecture gives you a complete picture of HDFS. There is a single namenode which stores metadata and there are multiple datanodes which do actual storage work. Nodes are arranged in racks and Replicas of data blocks are stored on different racks in the cluster to provide fault tolerance. In remaining section of this tutorial we will see, how read and write operations are performed in HDFS? To read or write a file in HDFS, the client needs to interact with Namenode. HDFS applications need a write-once-read-many access model for files. A file once created and written cannot be edited.


 


Namenode stores metadata and datanode which stores actual data. The client interacts with namenode for any task to be performed as namenode is the centerpiece in the cluster.
There are several datanodes in the cluster which store HDFS data in the local disk. Datanode sends a heartbeat message to namenode periodically to indicate that it is alive. Also, it replicates data to other datanode as per the replication factor.
Share:

Sample Text

Copyright © Become a Big Data - Hadoop Professional Distributed By ITGetup Team & Design by Hadoop Specialist Team