Preparation Tips for CCA175

CCA175 (CCA Spark and Hadoop Developer) Certification



Pre Plan for CCA175 Exam

The key to any certification preparation is to have a proper plan, as the old saying says 'failing to plan is planning to fail'. In this blog I am going to show you one possible way you can prepare and obtain CCA 175 certification. My goal is to accomplish two things on this blog.


1.     Identify the technologies to learn in order to accomplish the certification goals
2.     Create a realistic schedule that encompasses learning and practicing before appearing for the certification.

Below table maps the required skill to technologies one needs to learn in order to solve problems during the certification exam. Remember, CCA 175 is a hands on exam, it is an open book exam but the only content you can access during the exam is API and official framework documentation. Hence, it is very important to gain a good level of comfort in using a set of Hadoop Eco system technologies, generic or specific frameworks and programming/query languages.

Exam Curriculum to Technology Mapping Table


Skills RequiredSkill DescriptionTechnology To be Used
Data Ingest
Import data from a MySQL database into HDFS using SqoopSqoop
Export data to a MySQL database from HDFS using SqoopSqoop
Change the delimiter and file format of data during import using SqoopSqoop
Ingest real-time and near-real-time streaming data into HDFSFlume or Spark Streaming
Process streaming data as it is loaded onto the clusterFlume or Spark Streaming
Load data into and out of HDFS using the Hadoop File System commandsHDFS Command Line
Transform,Stage and Store
Load RDD data from HDFS for use in Spark applicationsSpark RDD and Spark DF (Data Frame)
Write the results from an RDD back into HDFS using SparkSpark RDD and Spark DF (Data Frame)
Read and write files in a variety of file formatsSpark RDD and Spark DF (Data Frame)
Perform standard extract, transform, load (ETL) processes on dataSpark RDD, Spark DF and Hive
Data Analysis
Use metastore tables as an input source or an output sink for Spark applicationsSpark RDD, Spark DF, Spark SQL,
Hive, Impala
Understand the fundamentals of querying datasets in SparkSpark RDD, Spark DF, Spark SQL
Filter data using SparkSpark RDD, Spark DF, Spark SQL
Write queries that calculate aggregate statisticsSpark DF, Spark SQL, Hive and Impala
Join disparate datasets using SparkSpark RDD, Spark DF and Spark SQL
Produce ranked or sorted dataSpark RDD, Spark DF, Spark SQL,
Hive and Impala
ConfigurationSupply command-line options to change your application configuration, such as increasing available memorySpark Submit and options that can be used along with Spark Submit

This essentially boils down to learning below tools, frameworks, libraries and technologies. Here are the pre-requisites before you start your learning journey and also for practicing these technologies.
  • Basic knowledge of any programming language. If you have scala or python background then it makes it much more easier
  • Good Understanding of what data and database means. Some knowledge of SQL querying also helps.
  • Finally, the most import aspect of this practical learning is to have an environment. it may take hours or days for you to build a haoop environment with all these combination of technologies. Cloudera makes it easier for you by providing a quickstart VM that you can install on your machine. Please read the instructions carefully and watch some youtube videos on how to setup the quickstart VM for your practice. You can download the quick start VM here [Click HERE] 
Technology to Language Mapping Table
Sr. NoTechnologyLanguages to be usedDescription
1HDFSUnix Like CommandsThe Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines.
2SqoopUnix Like Commands with some SQLFramework for bulk data transfer between HDFS and structured datastores as RDBMS.
3SparkScala OR PythonIt is a Data analytics cluster computing framework. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). To its credit, Spark provides an easier to use alternative to Hadoop MapReduce and offers performance up to 10 times faster than previous generation systems like Hadoop MapReduce for certain applications.Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. For the certification exam, the emphasis is on Spark and not on tradition map reduce.
4Spark RDDScala OR PythonRDD stands for Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
5Spark DFScala OR PythonA DataFrame is a distributed collection of data, which is organized into named columns. A DataFrame can be constructed from an array of different sources such as Hive tables, Structured Data files, external databases, or existing RDDs.
6Spark StreamingScala OR PythonSpark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data
7Spark SQLScala, Python and SQLSpark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.
8Spark SubmitUnix Like CommandsIt is a mechanism to run spark programs as applications by suppliying configurable parameters to optimize spark code execution
9FlumeUnix Like commandsFlume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
10HiveSQLHive provides a SQL-like interface to data stored in HDP.
11Spark StreamingScala and PythonSpark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data
12ImpalaSQLIts a Cloudera's very own hive like interface but uses its own engines instead of relying on spark or map reduce engine for processing.

Now that we have the technology to skill mapping, lets translate this into a work schedule. I am assuming that you can spend 2 hours a day for 5 to 6 days a week. Given that Big Data is a mesmerizing world, i will not be surprised if you spend more than 2 hours a day purely out of interest and curiosity to learn. Hence a 6 week preparation should be good enough to crack the certification. I personally know people who did this in 2 weeks and hence nothing is impossible. So, there are just 6 weeks between the current "you" (possibly a no one in the big data context) to being someone i.e certified hadoop and spark developer. Are you up to the challenge? If the technologist and the curious learner inside you is urging you to shout 'YES I AM UP TO THE CHALLENGE' then i recommend that you either spend the next few weeks in equipping yourself in these technologies and come back to this blog when trying to accomplish the 15th task in the schedule below OR use the videos in this blog to gain some understanding in a more real time learning environment where you learn concepts on the fly and in a hands-on fashion. 


Study plan for each Technology
Sr #TaskHrs of StudyHrs of Practice
1Setup Cloudera quickstart VM3 HoursNA
2Introduction to Hadoop and Basic understanding of what Big Data is in general3 HoursNA
3HDFS1 Hour2 Hours
4Sqoop2 Hours2 Hours
5Scala3 Hours3 Hours
6Python3 Hours3 Hours
7Spark RDD3 Hours6 Hours
8Spark DF1 Hours2 Hours
9Spark SQL1 Hours2 Hours
10Spark Submit1 Hours1 Hour
11Flume1 Hours3 Hours
12Spark Streaming1 Hours1 Hour
13Hive3 Hours5 Hours
14Impala1 Hours2 Hours
15ScenariosNA4 Hours
16Total27 Hours39 Hours
17Grant Total66 Hours
18Total Weeks of Prep at 2 Hours a day and 5 days a weekAround 6 Weeks


Note: The exam course content may change in future. So you may find the latest course content as well as documentation available online during examination from Cloudera Website

*************************************  Some Important Points to Remember  **********************************


1. No of Questions: There are total of 10 to 12 questions you will get from the above topics.
2. Pass Mark: Need to secure 70% of mark to clear the certification.
3. Code Snippet: This will provided for Pyspark and Scala. You need to edit the snippet as per your problem statement.
4. Real Exam Environment: 





All the very Best !!!












Share:

6 comments:

  1. I have appeared in many IT exams with the help of CCA175 braindumps and I always passed my exam with satisfied grades. I am thankful to Dumps4download.us for being so useful source of knowledge. I suggest all to use CCA175 braindumps to meet all the requirements of exam

    ReplyDelete
  2. Bigdata in hadoop is the interseting topic and to get some important information.Big data hadoop online training Hyderabad

    ReplyDelete
  3. I appreciate all the wonderful work done by Dumpsprofessor.com and I say thanks for providing me CCA175 Braindumps during my preparation for CCA Spark and Hadoop Developer Exam - Performance Based Scenarios Exam. I was really in need of reliable study stuff for quick preparation and my requirement was fulfilled only with this handy study material. I knew the answer for every question in the final and it was possible only with the help of CCA175 Dumps.

    ReplyDelete
  4. If you are short of time and want to appear in CCA Spark and Hadoop Developer Exam - Performance Based Scenarios Exam then I will suggest you to use CCA175 Braindumps. It gives to-the-point information in a very compact and handy form. I have aced my exam with guaranteed results with Realexamdumps.com CCA175 guidebook. I simply memorized all the questions and found almost same questions in the final test.

    ReplyDelete
  5. Exam4lead became the reason for this moment of happiness by helping me pass my IT exam 2020 Cloudera CCA175 dumps. At some time I could not imagine to pass this IT certification but this reliable and valid study material made it possible for me. I think anybody can go through this exam by seeing help from the expert on this platform and learning from CCA175 questions and answers.

    ReplyDelete
  6. Thanks Dumpsforsure.com for putting all the exam preparation together. I use to find difficult to pass the CCA175 Dumps, but with their dumps everything seemed possible. I prepared for my exam without going through different guides and books. All material was provided in one place in one PDF file. Their test engine is a great way to know about your weak areas so that you may tackle them easily. Thank you Dumpsforsure.com

    ReplyDelete

Sample Text

Copyright © Become a Big Data - Hadoop Professional Distributed By ITGetup Team & Design by Hadoop Specialist Team