September 2017 ~ Become a Big Data

Learn Big Data Hadoop
Big Data Hadoop Professional Roles and Skills

These are the FIVE major roles in Big Data Hadoop
Enter Slide 3 Title Here

This is slide 3 description. Go to Edit HTML of your blogger blog. Find these sentences. You can replace these sentences with your own words. This is a Blogger template by NewBloggerThemes.com...
Enter Slide 4 Title Here

This is slide 4 description. Go to Edit HTML of your blogger blog. Find these sentences. You can replace these sentences with your own words. This is a Blogger template by NewBloggerThemes.com...
Enter Slide 5 Title Here

This is slide 5 description. Go to Edit HTML of your blogger blog. Find these sentences. You can replace these sentences with your own words. This is a Blogger template by NewBloggerThemes.com...
Enter Slide 6 Title Here

This is slide 6 description. Go to Edit HTML of your blogger blog. Find these sentences. You can replace these sentences with your own words. This is a Blogger template by NewBloggerThemes.com...

Spark Overview

What is Apache Spark?

Architecture of Apache Spark?

Spark Framework:

How is the spark built and what kinds of feature it has, what kinds of ecosystem it has and what kinds of interfaces it has etc.

Spark Engine: (SPARK CORE) The core of the spark is the Spark Engine which is actually executing any code that we write using spark. The main function of the spark core is to take data and split across the multiple clusters, perform the operation as per the asked and return the desire results/output. Over Spark Core there are bunch of libraries and interfaces that are built.

Management: To manage the spark especially when it is on cluster, we can user HADOOP YARN, Mesos or Spark Scheduler.

Library: What types of library available in Spark to manipulate/manage the data

Spark SQL: It is a SQL like interface in the spark

ML Lib: Machine Learning Library that support bunch of ML algorithm

GraphX: Used to perform graph analysis,

Streaming: Use to analyze real time data like take Twitter Data stream, use ML algorithm, do prediction and take action in real time.

Storage: Since Spark is used to manipulate data, then spark need storage/place to read the data and store the data. So the kinds of storage in Spark could be Local file system, HDFS, S3 (Amazon storage in Cloud), RDBMS, NoSQL (Cassandra, MangoDB etc) to read, write and store the data.

Programming: Scala, Python, R, Java

Key part of Spark

1. Resilient Distributed Datasets (RDD):

a. Spark is built around RDDs. You can create, transform, analyze and store RDDs in Spark program. i.e. in Spark, if you read data from any other source would be form of a RDD, transform those data also in form of another RDD etc.

Example:

Read a file that have 100 lines from any storage will create a RDD say RDD01.

Transform the data (count/filter) the records from RDD01 will create another RDD say RDD02.

Note: Datasets means, the collection of elements of any types such as strings, lines, rows, objects etc.

b. The Dataset can be partitioned and distributed across multiple nodes

Please share this post if it really help you.

HBase QA

HBase QA No comments

Q. What is HBase?

Before we dive into HBase interview questions, here’s an overview of what is HBase and its features -

HBase, commonly referred to as the “Hadoop Database”, is a column oriented database based on the principles of Google Big Table. HBase does not directly use the capabilities of Hadoop MapReduce but can integrate with Hadoop to act as a source or destination for Hadoop MapReduce jobs. HBase provides real-time read or write access to data in HDFS. Data can be stored in HDFS directly or through HBase. Just like HDFS has a NameNode and Slave Node, Hadoop MapReduce has TaskTracker and JobTracker, HBase also has a Master Node and Region Server. Master node manages the cluster and region servers in HBase store portions of the HBase tables and perform data model operations.

HBase system consists of tables with rows and columns just like a traditional RDBMS. Every table must have a primary key which is used to access the data in HBase tables. HBase columns define the attributes of an object. For instance, if your HBase table stores web server logs then each row in the HBase will be a log record and the columns can be the server name from where web server log originated, the time when the log was written, etc. Several attributes can be grouped together in HBase to form column families. All the elements of a single column family are stored together. Column families should be specified when defining the table schema, however, HBase is so flexible and new columns can be added to a column family at any time based on application requirements.

Hallmark Features of HBase

Schema Flexibility
Scalability
High Reliability
Advantages of Using HBase

Provides RDBM like stored procedures and triggers in the form of coprocessors. Coprocessor in HBase is a framework that helps users run their custom code on Region Server.
Great record level consistency
In-built versioning.

Q. Compare RDBMS with HBase

Characteristic : RDBMS : HBase
Schema : Has a fixed schema : No fixed schema
Query Language : Supports structured powerful query language : Simple Query language
Transaction Processing : Support ACID transactions. : Is eventually consistent but does not support ACID transactions.

Q. What do you understand by CAP theorem and which features of CAP theorem does HBase follow?

CAP stands for Consistency, Availability and Partition Tolerance.

Consistency –At a given point of time, all nodes in a cluster will be able to see the same set of data.
Availability- Every request generates a response, regardless of whether it is a success or a failure.
Partition Tolerance – System continues to work even if there is failure of part of the system or intermittent message loss.
HBase is a column oriented databases providing features like partition tolerance and consistency.
3) Name few other popular column oriented databases like HBase.

CouchDB, MongoDB, Cassandra

Q. When should you use HBase and what are the key components of HBase?

HBase should be used when the big data application has –

1)A variable schema

2)When data is stored in the form of collections

3)If the application demands key based access to data while retrieving.

Key components of HBase are –

Region- This component contains memory data store and Hfile.

Region Server-This monitors the Region.

HBase Master-It is responsible for monitoring the region server.

Zookeeper- It takes care of the coordination between the HBase Master component and the client.

Catalog Tables-The two important catalog tables are ROOT and META.ROOT table tracks where the META table is and META table stores all the regions in the system.

2. What are the different operational commands in HBase at record level and table level?

Record Level Operational Commands in HBase are –put, get, increment, scan and delete.

Table Level Operational Commands in HBase are-describe, list, drop, disable and scan.

3. What is Row Key?

Every row in an HBase table has a unique identifier known as RowKey. It is used for grouping cells logically and it ensures that all cells that have the same RowKeys are co-located on the same server. RowKey is internally regarded as a byte array.

4. Explain the difference between RDBMS data model and HBase data model.

RDBMS is a schema based database whereas HBase is schema less data model.

RDBMS does not have support for in-built partitioning whereas in HBase there is automated partitioning.

RDBMS stores normalized data whereas HBase stores de-normalized data.

5. Explain about the different catalog tables in HBase?

The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system.

6. What is column families? What happens if you alter the block size of ColumnFamily on an already populated database?

The logical deviation of data is represented through a key known as column Family. Column families consist of the basic unit of physical storage on which compression features can be applied. In an already populated database, when the block size of column family is altered, the old data will remain within the old block size whereas the new data that comes in will take the new block size. When compaction takes place, the old data will take the new block size so that the existing data is read correctly.

7. Explain the difference between HBase and Hive.

HBase and Hive both are completely different hadoop based technologies-Hive is a data warehouse infrastructure on top of Hadoop whereas HBase is a NoSQL key value store that runs on top of Hadoop. Hive helps SQL savvy people to run MapReduce jobs whereas HBase supports 4 primary operations-put, get, scan and delete. HBase is ideal for real time querying of big data where Hive is an ideal choice for analytical querying of data collected over period of time.

8. Explain the process of row deletion in HBase.

On issuing a delete command in HBase through the HBase client, data is not actually deleted from the cells but rather the cells are made invisible by setting a tombstone marker. The deleted cells are removed at regular intervals during compaction.

9. What are the different types of tombstone markers in HBase for deletion?

There are 3 different types of tombstone markers in HBase for deletion-

1)Family Delete Marker- This markers marks all columns for a column family.

2)Version Delete Marker-This marker marks a single version of a column.

3)Column Delete Marker-This markers marks all the versions of a column.

10. Explain about HLog and WAL in HBase.

All edits in the HStore are stored in the HLog. Every region server has one HLog. HLog contains entries for edits of all regions performed by a particular Region Server.WAL abbreviates to Write Ahead Log (WAL) in which all the HLog edits are written immediately.WAL edits remain in the memory till the flush period in case of deferred log flush.

4) What do you understand by Filters in HBase?

HBase filters enhance the effectiveness of working with large data stored in tables by allowing users to add limiting selectors to a query and eliminate the data that is not required. Filters have access to the complete row to which they are applied. HBase has 18 filters –

TimestampsFilter
PageFilter
MultipleColumnPrefixFilter
FamilyFilter
ColumnPaginationFilter
SingleColumnValueFilter
RowFilter
QualifierFilter
ColumnRangeFilter
ValueFilter
PrefixFilter
SingleColumnValueExcludeFilter
ColumnCountGetFilter
InclusiveStopFilter
DependentColumnFilter
FirstKeyOnlyFilter
KeyOnlyFilter

5) Explain about the data model operations in HBase.

Put Method – To store data in HBase

Get Method – To retrieve data stored in HBase.

Delete Method- To delete the data from HBase tables.

Scan Method –To iterate over the data with larger key ranges or the entire table.

6) How will you back up an HBase cluster?

HBase cluster backups are performed in 2 ways-

Live Cluster Backup

Full Shutdown Backup

In live cluster backup strategy, copy table utility is used to copy the data from one table to another on the same cluster or another cluster. Export utility can also be used to dump the contents of the table onto HDFS on the same cluster.

In full shutdown backup approach, a periodic complete shutdown of the HBase cluster is performed so that the Master and Region Servers go down and if there are hardly any chances of losing out the in-flight changes happening to metadata or StoreFiles. However, this kind of approach can be used only for back-end analytic capacity and not for applications that serve front end webpages.

7) Does HBase support SQL like syntax?

SQL like support for HBase is not yet available. With the use of Apache Phoenix, user can retrieve data from HBase through SQL queries.

8) Is it possible to iterate through the rows of HBase table in reverse order?

No.

Column values are put on disk and the length of the value is written first and then the actual value is written. To iterate through these values in reverse order-the bytes of the actual value should be written twice.

9) Should the region server be located on all DataNodes?

Yes. Region Servers run on the same servers as DataNodes.

10) Suppose that your data is stored in collections, for instance some binary data, message data or metadata is all keyed on the same value. Will you use HBase for this?

Yes, it is ideal to use HBase whenever key based access to data is required for storing and retrieving.

11) Assume that an HBase table Student is disabled. Can you tell me how will I access the student table using Scan command once it is disabled?

Any HBase table that is disabled cannot be accessed using Scan command.

12) What do you understand by compaction?

During periods of heavy incoming writes, it is not possible to achieve optimal performance by having one file per store. Thus, HBase combines all these HFiles to reduce the number of disk seeds for every read. This process is referred to as Compaction in HBase.

13) Explain about the various table design approaches in HBase.

Tall-Narrow and Flat-Wide are the two HBase table design approaches that can be used. However, which approach should be used when merely depends on what you want to achieve and how you want to use the data. The performance of HBase completely depends on the RowKey and hence on directly on how data is accessed.

On a high level the major difference between flat-wide and tall-narrow approach is similar to the difference between get and scan. Full scans are costly in HBase because of ordered RowKey storage policy. Tall-narrow approach can be used when there is a complex RowKey so that focused scans can be performed on logical group of entries.

Ideally, tall-narrow approach is used when there are less number of rows and large number of columns whereas flat-wide approach is used when there are less number of columns and large number of rows.

14) Which one would you recommend for HBase table design approach – tall-narrow or flat wide?

There are several factors to be considered when deciding between flat-wide (millions of columns and limited keys) and tall-narrow (millions of keys with limited columns), however, a tall-narrow approach is often recommended because of the following reasons –

Under extreme scenarios, a flat-wide approach might end up with a single row per region, resulting in poor performance and scalability.
Table scans are often efficient over multiple reads. Considering that only a subset of the row data will be required, tall-narrow table design approach will provide better performance over flat-wide approach.
15) What is the best practice on deciding the number of column families for HBase table?

It is ideal not to exceed the number of columns families per HBase table by 15 because every column family in HBase is stored as a single file, so large number of columns families will be required to read and merge multiple files.

16) How will you implement joins in HBase?

HBase does not support joins directly but by using MapReduce jobs join queries can be implemented to retrieve data from various HBase tables.

17) What is the difference between HBase and HDFS?

HDFS is a local file system in Hadoop for storing large files but it does not provide tabular form of storage. HDFS is more like a local file system (NTFS or FAT). Data in HDFS is accessed through MapReduce jobs and is well suited for high latency batch processing operations.

HBase is a column oriented database on Hadoop that runs on top of HDFS and stores data in tabular format. HBase is like a database management system that communicates with HDFS to write logical tabular data to physical file system. One can access single rows using HBase from billions of records it has and is well-suited for low latency operations. HBase puts data in indexed StoreFiles present on HDFS for high speed lookups.

HBase Interview Questions for Experienced

1) How will you design the HBase Schema for Twitter data?

2) You want to fetch data from HBase to create a REST API. Which is the best way to read HBase data using a Spark Job or a Java program?

3) Design a HBase table for many to many relationship between two entities, for example employee and department.

4) Explain an example that demonstrates good de-normalization in HBase with consistency.

5) Should your HBase and MapReduce cluster be the same or they should be run on separate clusters?

If there are any other HBase interview questions that you have been asked in your Hadoop Job interview, then feel free to share it in the comments below.

YARN QA

YARN QA No comments

Q. What are the stable versions of Hadoop?

Release 2.7.1 (stable)

Release 2.4.1

Release 1.2.1 (stable)

Q. What is Apache Hadoop YARN?

YARN is a powerful and efficient feature rolled out as a part of Hadoop 2.0.YARN is a large scale distributed system for running big data applications.

3) Is YARN a replacement of Hadoop MapReduce?

YARN is not a replacement of Hadoop but it is a more powerful and efficient technology that supports MapReduce and is also referred to as Hadoop 2.0 or MapReduce 2.

4) What are the additional benefits YARN brings in to Hadoop?

Effective utilization of the resources as multiple applications can be run in YARN all sharing a common resource.In Hadoop MapReduce there are seperate slots for Map and Reduce tasks whereas in YARN there is no fixed slot. The same container can be used for Map and Reduce tasks leading to better utilization.
YARN is backward compatible so all the existing MapReduce jobs.
Using YARN, one can even run applications that are not based on the MaReduce model
5) How can native libraries be included in YARN jobs?

There are two ways to include native libraries in YARN jobs-

1) By setting the -Djava.library.path on the command line but in this case there are chances that the native libraries might not be loaded correctly and there is possibility of errors.

2) The better option to include native libraries is to the set the LD_LIBRARY_PATH in the .bashrc file.

6) Explain the differences between Hadoop 1.x and Hadoop 2.x

In Hadoop 1.x, MapReduce is responsible for both processing and cluster management whereas in Hadoop 2.x processing is taken care of by other processing models and YARN is responsible for cluster management.
Hadoop 2.x scales better when compared to Hadoop 1.x with close to 10000 nodes per cluster.
Hadoop 1.x has single point of failure problem and whenever the NameNode fails it has to be recovered manually. However, in case of Hadoop 2.x StandBy NameNode overcomes the SPOF problem and whenever the NameNode fails it is configured for automatic recovery.
Hadoop 1.x works on the concept of slots whereas Hadoop 2.x works on the concept of containers and can also run generic tasks.
7) What are the core changes in Hadoop 2.0?

Hadoop 2.x provides an upgrade to Hadoop 1.x in terms of resource management, scheduling and the manner in which execution occurs. In Hadoop 2.x the cluster resource management capabilities work in isolation from the MapReduce specific programming logic. This helps Hadoop to share resources dynamically between multiple parallel processing frameworks like Impala and the core MapReduce component. Hadoop 2.x Hadoop 2.x allows workable and fine grained resource configuration leading to efficient and better cluster utilization so that the application can scale to process larger number of jobs.

8) Differentiate between NFS, Hadoop NameNode and JournalNode.

HDFS is a write once file system so a user cannot update the files once they exist either they can read or write to it. However, under certain scenarios in the enterprise environment like file uploading, file downloading, file browsing or data streaming –it is not possible to achieve all this using the standard HDFS. This is where a distributed file system protocol Network File System (NFS) is used. NFS allows access to files on remote machines just similar to how local file system is accessed by applications.

Namenode is the heart of the HDFS file system that maintains the metadata and tracks where the file data is kept across the Hadoop cluster.

StandBy Nodes and Active Nodes communicate with a group of light weight nodes to keep their state synchronized. These are known as Journal Nodes.

9) What are the modules that constitute the Apache Hadoop 2.0 framework?

Hadoop 2.0 contains four important modules of which 3 are inherited from Hadoop 1.0 and a new module YARN is added to it.

Hadoop Common – This module consists of all the basic utilities and libraries that required by other modules.
HDFS- Hadoop Distributed file system that stores huge volumes of data on commodity machines across the cluster.
MapReduce- Java based programming model for data processing.
YARN- This is a new module introduced in Hadoop 2.0 for cluster resource management and job scheduling.

10) How is the distance between two nodes defined in Hadoop?

Measuring bandwidth is difficult in Hadoop so network is denoted as a tree in Hadoop. The distance between two nodes in the tree plays a vital role in forming a Hadoop cluster and is defined by the network topology and java interface DNStoSwitchMapping. The distance is equal to the sum of the distance to the closest common ancestor of both the nodes. The method getDistance(Node node1, Node node2) is used to calculate the distance between two nodes with the assumption that the distance from a node to its parent node is always 1.

ZooKeeper QA

ZooKeeper QA 1 comment

Q. Can Apache Kafka be used without Zookeeper?

It is not possible to use Apache Kafka without Zookeeper because if the Zookeeper is down Kafka cannot serve client request.

2) Name a few companies that use Zookeeper.

Yahoo, Solr, Helprace, Neo4j, Rackspace

3) What is the role of Zookeeper in HBase architecture?

In HBase architecture, ZooKeeper is the monitoring server that provides different services like –tracking server failure and network partitions, maintaining the configuration information, establishing communication between the clients and region servers, usability of ephemeral nodes to identify the available servers in the cluster.

4) Explain about ZooKeeper in Kafka

Apache Kafka uses ZooKeeper to be a highly distributed and scalable system. Zookeeper is used by Kafka to store various configurations and use them across the hadoop cluster in a distributed manner. To achieve distributed-ness, configurations are distributed and replicated throughout the leader and follower nodes in the ZooKeeper ensemble. We cannot directly connect to Kafka by bye-passing ZooKeeper because if the ZooKeeper is down it will not be able to serve the client request.

5) Explain how Zookeeper works

ZooKeeper is referred to as the King of Coordination and distributed applications use ZooKeeper to store and facilitate important configuration information updates. ZooKeeper works by coordinating the processes of distributed applications. ZooKeeper is a robust replicated synchronization service with eventual consistency. A set of nodes is known as an ensemble and persisted data is distributed between multiple nodes.

3 or more independent servers collectively form a ZooKeeper cluster and elect a master. One client connects to any of the specific server and migrates if a particular node fails. The ensemble of ZooKeeper nodes is alive till the majority of nods are working. The master node in ZooKeeper is dynamically selected by the consensus within the ensemble so if the master node fails then the role of master node will migrate to another node which is selected dynamically. Writes are linear and reads are concurrent in ZooKeeper.

6) List some examples of Zookeeper use cases.

Found by Elastic uses Zookeeper comprehensively for resource allocation, leader election, high priority notifications and discovery. The entire service of Found built up of various systems that read and write to Zookeeper.
Apache Kafka that depends on ZooKeeper is used by LinkedIn
Storm that relies on ZooKeeper is used by popular companies like Groupon and Twitter.
7) How to use Apache Zookeeper command line interface?

ZooKeeper has a command line client support for interactive use. The command line interface of ZooKeeper is similar to the file and shell system of UNIX. Data in ZooKeeper is stored in a hierarchy of Znodes where each znode can contain data just similar to a file. Each znode can also have children just like directories in the UNIX file system.

Zookeeper-client command is used to launch the command line client. If the initial prompt is hidden by the log messages after entering the command, users can just hit ENTER to view the prompt.

8) What are the different types of Znodes?

There are 2 types of Znodes namely- Ephemeral and Sequential Znodes.

The Znodes that get destroyed as soon as the client that created it disconnects are referred to as Ephemeral Znodes.
Sequential Znode is the one in which sequential number is chosen by the ZooKeeper ensemble and is pre-fixed when the client assigns name to the znode.
9) What are watches?

Client disconnection might be troublesome problem especially when we need to keep a track on the state of Znodes at regular intervals. ZooKeeper has an event system referred to as watch which can be set on Znode to trigger an event whenever it is removed, altered or any new children are created below it.

10) What problems can be addressed by using Zookeeper?

In the development of distributed systems, creating own protocols for coordinating the hadoop cluster results in failure and frustration for the developers. The architecture of a distributed system can be prone to deadlocks, inconsistency and race conditions. This leads to various difficulties in making the hadoop cluster fast, reliable and scalable. To address all such problems, Apache ZooKeeper can be used as a coordination service to write correct distributed applications without having to reinvent the wheel from the beginning.

Flume QA

Flume QA No comments

Flume QA

Q. Explain about the core components of Flume.

The core components of Flume are –

Event- The single log entry or unit of data that is transported.

Source- This is the component through which data enters Flume workflows.

Sink-It is responsible for transporting data to the desired destination.

Channel- it is the duct between the Sink and Source.

Agent- Any JVM that runs Flume.

Client- The component that transmits event to the source that operates with the agent.

2) Does Flume provide 100% reliability to the data flow?

Yes, Apache Flume provides end to end reliability because of its transactional approach in data flow.

3) How can Flume be used with HBase?

Apache Flume can be used with HBase using one of the two HBase sinks –

HBaseSink (org.apache.flume.sink.hbase.HBaseSink) supports secure HBase clusters and also the novel HBase IPC that was introduced in the version HBase 0.96.
AsyncHBaseSink (org.apache.flume.sink.hbase.AsyncHBaseSink) has better performance than HBase sink as it can easily make non-blocking calls to HBase.
Working of the HBaseSink –

In HBaseSink, a Flume Event is converted into HBase Increments or Puts. Serializer implements the HBaseEventSerializer which is then instantiated when the sink starts. For every event, sink calls the initialize method in the serializer which then translates the Flume Event into HBase increments and puts to be sent to HBase cluster.

Working of the AsyncHBaseSink-

AsyncHBaseSink implements the AsyncHBaseEventSerializer. The initialize method is called only once by the sink when it starts. Sink invokes the setEvent method and then makes calls to the getIncrements and getActions methods just similar to HBase sink. When the sink stops, the cleanUp method is called by the serializer.

4) Explain about the different channel types in Flume. Which channel type is faster?

The 3 different built in channel types available in Flume are-

MEMORY Channel – Events are read from the source into memory and passed to the sink.

JDBC Channel – JDBC Channel stores the events in an embedded Derby database.

FILE Channel –File Channel writes the contents to a file on the file system after reading the event from a source. The file is deleted only after the contents are successfully delivered to the sink.

MEMORY Channel is the fastest channel among the three however has the risk of data loss. The channel that you choose completely depends on the nature of the big data application and the value of each event.

5) Which is the reliable channel in Flume to ensure that there is no data loss?

FILE Channel is the most reliable channel among the 3 channels JDBC, FILE and MEMORY.

6) Explain about the replication and multiplexing selectors in Flume.

Channel Selectors are used to handle multiple channels. Based on the Flume header value, an event can be written just to a single channel or to multiple channels. If a channel selector is not specified to the source then by default it is the Replicating selector. Using the replicating selector, the same event is written to all the channels in the source’s channels list. Multiplexing channel selector is used when the application has to send different events to different channels.

7) How multi-hop agent can be setup in Flume?

Avro RPC Bridge mechanism is used to setup Multi-hop agent in Apache Flume.

8) Does Apache Flume provide support for third party plug-ins?

Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from external sources and transfer it to external destinations.

9) Is it possible to leverage real time analysis on the big data collected by Flume directly? If yes, then explain how.

Data from Flume can be extracted, transformed and loaded in real-time into Apache Solr servers using MorphlineSolrSink

10) Differentiate between FileSink and FileRollSink

The major difference between HDFS FileSink and FileRollSink is that HDFS File Sink writes the events into the Hadoop Distributed File System (HDFS) whereas File Roll Sink stores the events into the local file system.

HDFS QA

HDFS QA No comments

Q. What is a block and block scanner in HDFS?

Block - The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB.

Block Scanner - Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.

Q. Explain the difference between NameNode, Backup Node and Checkpoint NameNode. Click here to Tweet

NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespace-

fsimage file- It keeps track of the latest checkpoint of the namespace.

edits file-It is a log of changes that have been made to the namespace since checkpoint.

Checkpoint Node-

Checkpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally. The new image is then again updated back to the active NameNode.

BackupNode:

Backup Node also provides check pointing functionality like that of the checkpoint node but it also maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.

Q. What is commodity hardware? Click here to Tweet

Commodity Hardware refers to inexpensive systems that do not have high availability or high quality. Commodity Hardware consists of RAM because there are specific services that need to be executed on RAM. Hadoop can be run on any commodity hardware and does not require any super computer s or high end hardware configuration to execute jobs.

Q. What is the port number for NameNode, Task Tracker and Job Tracker?

NameNode 50070

Job Tracker 50030

Task Tracker 50060

Q. Explain about the process of inter cluster data copying.

HDFS provides a distributed data copying facility through the DistCP from source to destination. If this data copying is within the hadoop cluster then it is referred to as inter cluster data copying. DistCP requires both source and destination to have a compatible or same version of hadoop.

Q. How can you overwrite the replication factors in HDFS?

The replication factor in HDFS can be modified or overwritten in 2 ways-

1)Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command-

$hadoop fs –setrep –w 2 /my/test_file (test_file is the filename whose replication factor will be set to 2)

2)Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below command-

3)$hadoop fs –setrep –w 5 /my/test_dir (test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5)

Q. Explain the difference between NAS and HDFS.

NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS runs on a cluster of different machines thus there is data redundancy because of the replication protocol.

NAS stores data on a dedicated hardware whereas in HDFS all the data blocks are distributed across local drives of the machines.

In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are moved to data.

Q. Explain what happens if during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3.

Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust the number of times the blocks are to be replicated to ensure high data availability. For every block that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data. Under these circumstances when the replication factor is set to 1 ,if the DataNode crashes under any circumstances, then only single copy of the data would be lost.

Q. What is the process to change the files at arbitrary locations in HDFS?

HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in append only format i.e. writes to a file in HDFS are always made at the end of the file.

Q. Can you Explain about the indexing process in HDFS?

Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.

Q. What is a rack awareness and on what basis is data stored in a rack?

All the data nodes put together form a storage area i.e. the physical location of the data nodes is referred to as Rack in HDFS. The rack information i.e. the rack id of each data node is acquired by the NameNode. The process of selecting closer data nodes depending on the rack information is known as Rack Awareness.

The contents present in the file are divided into data block as soon as the client is ready to load the file into the hadoop cluster. After consulting with the NameNode, client allocates 3 data nodes for each data block. For each data block, there exists 2 copies in one rack and the third copy is present in another rack. This is generally referred to as the Replica Placement Policy.

Q. What happens to a NameNode that has no data?

There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.

Q. What happens when a user submits a Hadoop job when the NameNode is down- does the job get in to hold or does it fail.

The Hadoop job fails when the NameNode is down.

Q. What happens when a user submits a Hadoop job when the Job Tracker is down- does the job get in to hold or does it fail.

The Hadoop job fails when the Job Tracker is down.

Q. Whenever a client submits a hadoop job, who receives it?

NameNode receives the Hadoop job which then looks for the data requested by the client and provides the block information. JobTracker takes care of resource allocation of the hadoop job to ensure timely completion.

Q. How will you measure HDFS space consumed?

The two popular utilities or commands to measure HDFS space consumed are hdfs dfs –du and hdfs dfsadmin –report. HDFS provides reliable storage by copying data to multiple nodes. The number of copies it creates is usually referred to as the replication factor which is greater than one.

hdfs dfs –du –This command shows the space consumed by data without replications.
hdfs dfsadmin –report- This command shows the real disk usage by considering data replication. Therefore, the output of hdfs dfsadmin –report will always be greater than the output of hdfs dfs –du command.

Q. Is it a good practice to use HDFS for multiple small files?

It is not a good practice to use HDFS for multiple small files because NameNode is an expensive high performance system. Occupying the NameNode space with the unnecessary amount of metadata generated for each of the small multiple files is not sensible. If there is a large file with loads of data, then it is always a wise move to use HDFS because it will occupy less space for metadata and provide optimized performance.

Q. I have a file “Sample” on HDFS. How can I copy this file to the local file system?

This can be accomplished using the following command -

bin/hadoop fs -copyToLocal /hdfs/source/path /localfs/destination/path

Q. What do you understand by Inodes?

HDFS namespace consists of files and directories. Inodes are used to represent these files and directories on the NameNode. Inodes record various attributes like the namespace quota, disk space quota, permissions, modified time and access time.

Q. Replication causes data redundancy then why is it still preferred in HDFS?

As we know that Hadoop works on commodity hardware, so there is an increased probability of getting crashed. Thus to make the entire Hadoop system highly tolerant, replication factor is preferred even though it creates multiple copies of the same data at different locations. Data on HDFS is stored in at least 3 different locations. Whenever one copy of the data is corrupted and the other copy of the data is not available due to some technical glitches then the data can be accessed from the third location without any data loss.

Q. Data is replicated at least thrice on HDFS. Does it imply that any alterations or calculations done on one copy of the data will be reflected in the other two copies also?

Calculations or any transformations are performed on the original data and do not get reflected to all the copies of data. Master node identifies where the original data is located and performs the calculations. Only if the node is not responding or data is corrupted then it will perform the desired calculations on the second replica.

Q. How will you compare two HDFS files?

UNIX has a diff command to compare two HDFS files but there is no diff command with Hadoop. However, redirections can be used in the shell with the diff command as follows-

diff < (hadoop fs -cat /path/to/file) < (hadoop fs -cat /path/to/file2)

If the goal is just to find whether the two files are similar or not without having to know the exact differences, then a checksum-based approach can also be followed to compare two files. Get the checksums for both files and compare them.

Q. How will you copy a huge file of 80GB size into HDFS parallelly?

Using the distCP tools huge files can be copied within or in between various Hadoop clusters.

Q. Are Job Tracker and Task Tracker present on the same machine?

No, they are present on separate machines as Job Tracker is a single point of failure in Hadoop MapReduce and if the Job Tracker goes down all the running Hadoop jobs will halt.

Q. Can you create multiple files in HDFS with varying block sizes?

Yes, it is possible to create multiple files in HDFS with different block sizes using an API. The block size can be specified during the time of file creation. Below is the signature of the method that helps achieve this –

public FSDataOutputStream (Path f, boolean overwrite, int buffersize, short replication, long blocksize) throws IO Exception

Q. What happens if two clients try writing into the same HDFS file?

HDFS provides support only for exclusive writes so when one client is already writing the file, the other client cannot open the file in write mode. When the client requests the NameNode to open the file for writing, NameNode provides lease to the client for writing to the file. So, if another client requests for lease on the same it will be rejected.

Q. What do you understand by Active and Passive NameNodes?

The NameNode that works and runs in the Hadoop cluster is often referred to as the Active NameNode. Passive NameNode also known as Standby NameNode is the similar to an active NameNode but it comes into action only when the active NameNode fails. Whenever the active NameNode fails, the passive NameNode or the standby NameNode replaces the active NameNode, to ensure that the Hadoop cluster is never without a NameNode.

Q. How will you balance the disk space usage on a HDFS cluster?

Balancer tool helps achieve this by taking a threshold value as input parameter which is always a fraction between 0 and 1. The HDFS cluster is said to be balanced, if, for every DataNode, the ratio of used space at the node to total capacity of the node differs from the ratio of used space in the cluster to total capacity of the cluster - is not greater than the threshold value.

Q. If a DataNode is marked as decommissioned, can it be chosen for replica placement?

Whenever a DataNode is marked as decommissioned it cannot be considered for replication but it continues to serve read request until the node enters the decommissioned state completely i.e. till all the blocks on the decommissioning DataNode are replicated.

Q. How will you reduce the size of large cluster by removing a few nodes?

A set of existing nodes can be removed using the decommissioning feature to reduce the size of a large cluster. The nodes that have to be removed should be added to the exclude file. The name of the exclude file should be stated as a config parameter dfs.hosts.exclude. By editing the exclude files or the configuration file, the decommissioning process can be ended.

Q. What do you understand by Safe Mode in Hadoop?

The state in which NameNode does not perform replication or deletion of blocks is referred to as Safe Mode in Hadoop. In safe mode, NameNode only collects block reports information from the DataNodes.

Q. How will you manually enter and leave Safe Mode in Hadoop?

Below command is used to enter Safe Mode manually –

$ Hdfs dfsadmin -safe mode enter

Once the safe mode is entered manually, it should be removed manually.

Below command is used to leave Safe Mode manually –

$ hdfs dfsadmin -safe mode leave

Q. What are the advantages of a block transfer?

The size of a file can be larger than the size of a single disk within the network. Blocks from a single file need not be stored on the same disk and can make use of different disks present in the Hadoop cluster. This simplifies the entire storage subsystem providing fault tolerance and high availability.

Q. How will you empty the trash in HDFS?

Just like many desktop operating systems handle deleted files without a key, HDFS also moves all the deleted files into trash folder stored at /user/hdfs/.Trash. The trash can be emptied by running the following command-

hdfs –dfs expunge

Q. What does the HDFS error “File could only be replicated to 0 nodes, instead of 1” mean?

This exception occurs when the DataNode is not available to the NameNode (i.e. the client is not able to communicate with the DataNode) due to one of the following reasons –

In hdfs-site.xml file, if the block size is a negative value.
If there are any network fluctuations between the DataNode and NameNode, as a result of which the primary DataNode goes down whilst write is in progress.
Disk of DataNode is full.
DataNode is eventful and occupied with block reporting and scanning.

In this post, if HDFS Interview Questions and Answers are helpful, then please spend a minute from your valuable time to share it with the social media icons above and help the Hadoop community at large.

Help Others to Help Yourself.

Become a Big Data - Hadoop Professional

Learn Big Data Hadoop

Big Data Hadoop Professional Roles and Skills

Enter Slide 3 Title Here

Enter Slide 4 Title Here

Enter Slide 5 Title Here

Enter Slide 6 Title Here

Spark Overview

HBase QA

YARN QA

ZooKeeper QA

Flume QA

HDFS QA

Total Pageviews

Popular Posts

Recent Posts

Categories

Unordered List

Text Widget

Pages

Blog Archive

Sample Text