Q. What is a block and block scanner in HDFS?
Block - The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB.
Block Scanner - Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.
Q. Explain the difference between NameNode, Backup Node and Checkpoint NameNode. Click here to Tweet
NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespace-
fsimage file- It keeps track of the latest checkpoint of the namespace.
edits file-It is a log of changes that have been made to the namespace since checkpoint.
Checkpoint Node-
Checkpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally. The new image is then again updated back to the active NameNode.
BackupNode:
Backup Node also provides check pointing functionality like that of the checkpoint node but it also maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.
Q. What is commodity hardware? Click here to Tweet
Commodity Hardware refers to inexpensive systems that do not have high availability or high quality. Commodity Hardware consists of RAM because there are specific services that need to be executed on RAM. Hadoop can be run on any commodity hardware and does not require any super computer s or high end hardware configuration to execute jobs.
Q. What is the port number for NameNode, Task Tracker and Job Tracker?
NameNode 50070
Job Tracker 50030
Task Tracker 50060
Q. Explain about the process of inter cluster data copying.
HDFS provides a distributed data copying facility through the DistCP from source to destination. If this data copying is within the hadoop cluster then it is referred to as inter cluster data copying. DistCP requires both source and destination to have a compatible or same version of hadoop.
Q. How can you overwrite the replication factors in HDFS?
The replication factor in HDFS can be modified or overwritten in 2 ways-
1)Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command-
$hadoop fs –setrep –w 2 /my/test_file (test_file is the filename whose replication factor will be set to 2)
2)Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below command-
3)$hadoop fs –setrep –w 5 /my/test_dir (test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5)
Q. Explain the difference between NAS and HDFS.
NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS runs on a cluster of different machines thus there is data redundancy because of the replication protocol.
NAS stores data on a dedicated hardware whereas in HDFS all the data blocks are distributed across local drives of the machines.
In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are moved to data.
Q. Explain what happens if during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3.
Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust the number of times the blocks are to be replicated to ensure high data availability. For every block that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data. Under these circumstances when the replication factor is set to 1 ,if the DataNode crashes under any circumstances, then only single copy of the data would be lost.
Q. What is the process to change the files at arbitrary locations in HDFS?
HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in append only format i.e. writes to a file in HDFS are always made at the end of the file.
Q. Can you Explain about the indexing process in HDFS?
Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.
Q. What is a rack awareness and on what basis is data stored in a rack?
All the data nodes put together form a storage area i.e. the physical location of the data nodes is referred to as Rack in HDFS. The rack information i.e. the rack id of each data node is acquired by the NameNode. The process of selecting closer data nodes depending on the rack information is known as Rack Awareness.
The contents present in the file are divided into data block as soon as the client is ready to load the file into the hadoop cluster. After consulting with the NameNode, client allocates 3 data nodes for each data block. For each data block, there exists 2 copies in one rack and the third copy is present in another rack. This is generally referred to as the Replica Placement Policy.
Q. What happens to a NameNode that has no data?
There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.
Q. What happens when a user submits a Hadoop job when the NameNode is down- does the job get in to hold or does it fail.
The Hadoop job fails when the NameNode is down.
Q. What happens when a user submits a Hadoop job when the Job Tracker is down- does the job get in to hold or does it fail.
The Hadoop job fails when the Job Tracker is down.
Q. Whenever a client submits a hadoop job, who receives it?
NameNode receives the Hadoop job which then looks for the data requested by the client and provides the block information. JobTracker takes care of resource allocation of the hadoop job to ensure timely completion.
Q. How will you measure HDFS space consumed?
The two popular utilities or commands to measure HDFS space consumed are hdfs dfs –du and hdfs dfsadmin –report. HDFS provides reliable storage by copying data to multiple nodes. The number of copies it creates is usually referred to as the replication factor which is greater than one.
hdfs dfs –du –This command shows the space consumed by data without replications.
hdfs dfsadmin –report- This command shows the real disk usage by considering data replication. Therefore, the output of hdfs dfsadmin –report will always be greater than the output of hdfs dfs –du command.
Q. Is it a good practice to use HDFS for multiple small files?
It is not a good practice to use HDFS for multiple small files because NameNode is an expensive high performance system. Occupying the NameNode space with the unnecessary amount of metadata generated for each of the small multiple files is not sensible. If there is a large file with loads of data, then it is always a wise move to use HDFS because it will occupy less space for metadata and provide optimized performance.
Q. I have a file “Sample” on HDFS. How can I copy this file to the local file system?
This can be accomplished using the following command -
bin/hadoop fs -copyToLocal /hdfs/source/path /localfs/destination/path
Q. What do you understand by Inodes?
HDFS namespace consists of files and directories. Inodes are used to represent these files and directories on the NameNode. Inodes record various attributes like the namespace quota, disk space quota, permissions, modified time and access time.
Q. Replication causes data redundancy then why is it still preferred in HDFS?
As we know that Hadoop works on commodity hardware, so there is an increased probability of getting crashed. Thus to make the entire Hadoop system highly tolerant, replication factor is preferred even though it creates multiple copies of the same data at different locations. Data on HDFS is stored in at least 3 different locations. Whenever one copy of the data is corrupted and the other copy of the data is not available due to some technical glitches then the data can be accessed from the third location without any data loss.
Q. Data is replicated at least thrice on HDFS. Does it imply that any alterations or calculations done on one copy of the data will be reflected in the other two copies also?
Calculations or any transformations are performed on the original data and do not get reflected to all the copies of data. Master node identifies where the original data is located and performs the calculations. Only if the node is not responding or data is corrupted then it will perform the desired calculations on the second replica.
Q. How will you compare two HDFS files?
UNIX has a diff command to compare two HDFS files but there is no diff command with Hadoop. However, redirections can be used in the shell with the diff command as follows-
diff < (hadoop fs -cat /path/to/file) < (hadoop fs -cat /path/to/file2)
If the goal is just to find whether the two files are similar or not without having to know the exact differences, then a checksum-based approach can also be followed to compare two files. Get the checksums for both files and compare them.
Q. How will you copy a huge file of 80GB size into HDFS parallelly?
Using the distCP tools huge files can be copied within or in between various Hadoop clusters.
Q. Are Job Tracker and Task Tracker present on the same machine?
No, they are present on separate machines as Job Tracker is a single point of failure in Hadoop MapReduce and if the Job Tracker goes down all the running Hadoop jobs will halt.
Q. Can you create multiple files in HDFS with varying block sizes?
Yes, it is possible to create multiple files in HDFS with different block sizes using an API. The block size can be specified during the time of file creation. Below is the signature of the method that helps achieve this –
public FSDataOutputStream (Path f, boolean overwrite, int buffersize, short replication, long blocksize) throws IO Exception
Q. What happens if two clients try writing into the same HDFS file?
HDFS provides support only for exclusive writes so when one client is already writing the file, the other client cannot open the file in write mode. When the client requests the NameNode to open the file for writing, NameNode provides lease to the client for writing to the file. So, if another client requests for lease on the same it will be rejected.
Q. What do you understand by Active and Passive NameNodes?
The NameNode that works and runs in the Hadoop cluster is often referred to as the Active NameNode. Passive NameNode also known as Standby NameNode is the similar to an active NameNode but it comes into action only when the active NameNode fails. Whenever the active NameNode fails, the passive NameNode or the standby NameNode replaces the active NameNode, to ensure that the Hadoop cluster is never without a NameNode.
Q. How will you balance the disk space usage on a HDFS cluster?
Balancer tool helps achieve this by taking a threshold value as input parameter which is always a fraction between 0 and 1. The HDFS cluster is said to be balanced, if, for every DataNode, the ratio of used space at the node to total capacity of the node differs from the ratio of used space in the cluster to total capacity of the cluster - is not greater than the threshold value.
Q. If a DataNode is marked as decommissioned, can it be chosen for replica placement?
Whenever a DataNode is marked as decommissioned it cannot be considered for replication but it continues to serve read request until the node enters the decommissioned state completely i.e. till all the blocks on the decommissioning DataNode are replicated.
Q. How will you reduce the size of large cluster by removing a few nodes?
A set of existing nodes can be removed using the decommissioning feature to reduce the size of a large cluster. The nodes that have to be removed should be added to the exclude file. The name of the exclude file should be stated as a config parameter dfs.hosts.exclude. By editing the exclude files or the configuration file, the decommissioning process can be ended.
Q. What do you understand by Safe Mode in Hadoop?
The state in which NameNode does not perform replication or deletion of blocks is referred to as Safe Mode in Hadoop. In safe mode, NameNode only collects block reports information from the DataNodes.
Q. How will you manually enter and leave Safe Mode in Hadoop?
Below command is used to enter Safe Mode manually –
$ Hdfs dfsadmin -safe mode enter
Once the safe mode is entered manually, it should be removed manually.
Below command is used to leave Safe Mode manually –
$ hdfs dfsadmin -safe mode leave
Q. What are the advantages of a block transfer?
The size of a file can be larger than the size of a single disk within the network. Blocks from a single file need not be stored on the same disk and can make use of different disks present in the Hadoop cluster. This simplifies the entire storage subsystem providing fault tolerance and high availability.
Q. How will you empty the trash in HDFS?
Just like many desktop operating systems handle deleted files without a key, HDFS also moves all the deleted files into trash folder stored at /user/hdfs/.Trash. The trash can be emptied by running the following command-
hdfs –dfs expunge
Q. What does the HDFS error “File could only be replicated to 0 nodes, instead of 1” mean?
This exception occurs when the DataNode is not available to the NameNode (i.e. the client is not able to communicate with the DataNode) due to one of the following reasons –
In hdfs-site.xml file, if the block size is a negative value.
If there are any network fluctuations between the DataNode and NameNode, as a result of which the primary DataNode goes down whilst write is in progress.
Disk of DataNode is full.
DataNode is eventful and occupied with block reporting and scanning.
In this post, if HDFS Interview Questions and Answers are helpful, then please spend a minute from your valuable time to share it with the social media icons above and help the Hadoop community at large.
Help Others to Help Yourself.
Block - The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB.
Block Scanner - Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.
Q. Explain the difference between NameNode, Backup Node and Checkpoint NameNode. Click here to Tweet
NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespace-
fsimage file- It keeps track of the latest checkpoint of the namespace.
edits file-It is a log of changes that have been made to the namespace since checkpoint.
Checkpoint Node-
Checkpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally. The new image is then again updated back to the active NameNode.
BackupNode:
Backup Node also provides check pointing functionality like that of the checkpoint node but it also maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.
Q. What is commodity hardware? Click here to Tweet
Commodity Hardware refers to inexpensive systems that do not have high availability or high quality. Commodity Hardware consists of RAM because there are specific services that need to be executed on RAM. Hadoop can be run on any commodity hardware and does not require any super computer s or high end hardware configuration to execute jobs.
Q. What is the port number for NameNode, Task Tracker and Job Tracker?
NameNode 50070
Job Tracker 50030
Task Tracker 50060
Q. Explain about the process of inter cluster data copying.
HDFS provides a distributed data copying facility through the DistCP from source to destination. If this data copying is within the hadoop cluster then it is referred to as inter cluster data copying. DistCP requires both source and destination to have a compatible or same version of hadoop.
Q. How can you overwrite the replication factors in HDFS?
The replication factor in HDFS can be modified or overwritten in 2 ways-
1)Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command-
$hadoop fs –setrep –w 2 /my/test_file (test_file is the filename whose replication factor will be set to 2)
2)Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below command-
3)$hadoop fs –setrep –w 5 /my/test_dir (test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5)
Q. Explain the difference between NAS and HDFS.
NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS runs on a cluster of different machines thus there is data redundancy because of the replication protocol.
NAS stores data on a dedicated hardware whereas in HDFS all the data blocks are distributed across local drives of the machines.
In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are moved to data.
Q. Explain what happens if during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3.
Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust the number of times the blocks are to be replicated to ensure high data availability. For every block that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data. Under these circumstances when the replication factor is set to 1 ,if the DataNode crashes under any circumstances, then only single copy of the data would be lost.
Q. What is the process to change the files at arbitrary locations in HDFS?
HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in append only format i.e. writes to a file in HDFS are always made at the end of the file.
Q. Can you Explain about the indexing process in HDFS?
Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.
Q. What is a rack awareness and on what basis is data stored in a rack?
All the data nodes put together form a storage area i.e. the physical location of the data nodes is referred to as Rack in HDFS. The rack information i.e. the rack id of each data node is acquired by the NameNode. The process of selecting closer data nodes depending on the rack information is known as Rack Awareness.
The contents present in the file are divided into data block as soon as the client is ready to load the file into the hadoop cluster. After consulting with the NameNode, client allocates 3 data nodes for each data block. For each data block, there exists 2 copies in one rack and the third copy is present in another rack. This is generally referred to as the Replica Placement Policy.
Q. What happens to a NameNode that has no data?
There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.
Q. What happens when a user submits a Hadoop job when the NameNode is down- does the job get in to hold or does it fail.
The Hadoop job fails when the NameNode is down.
Q. What happens when a user submits a Hadoop job when the Job Tracker is down- does the job get in to hold or does it fail.
The Hadoop job fails when the Job Tracker is down.
Q. Whenever a client submits a hadoop job, who receives it?
NameNode receives the Hadoop job which then looks for the data requested by the client and provides the block information. JobTracker takes care of resource allocation of the hadoop job to ensure timely completion.
Q. How will you measure HDFS space consumed?
The two popular utilities or commands to measure HDFS space consumed are hdfs dfs –du and hdfs dfsadmin –report. HDFS provides reliable storage by copying data to multiple nodes. The number of copies it creates is usually referred to as the replication factor which is greater than one.
hdfs dfs –du –This command shows the space consumed by data without replications.
hdfs dfsadmin –report- This command shows the real disk usage by considering data replication. Therefore, the output of hdfs dfsadmin –report will always be greater than the output of hdfs dfs –du command.
Q. Is it a good practice to use HDFS for multiple small files?
It is not a good practice to use HDFS for multiple small files because NameNode is an expensive high performance system. Occupying the NameNode space with the unnecessary amount of metadata generated for each of the small multiple files is not sensible. If there is a large file with loads of data, then it is always a wise move to use HDFS because it will occupy less space for metadata and provide optimized performance.
Q. I have a file “Sample” on HDFS. How can I copy this file to the local file system?
This can be accomplished using the following command -
bin/hadoop fs -copyToLocal /hdfs/source/path /localfs/destination/path
Q. What do you understand by Inodes?
HDFS namespace consists of files and directories. Inodes are used to represent these files and directories on the NameNode. Inodes record various attributes like the namespace quota, disk space quota, permissions, modified time and access time.
Q. Replication causes data redundancy then why is it still preferred in HDFS?
As we know that Hadoop works on commodity hardware, so there is an increased probability of getting crashed. Thus to make the entire Hadoop system highly tolerant, replication factor is preferred even though it creates multiple copies of the same data at different locations. Data on HDFS is stored in at least 3 different locations. Whenever one copy of the data is corrupted and the other copy of the data is not available due to some technical glitches then the data can be accessed from the third location without any data loss.
Q. Data is replicated at least thrice on HDFS. Does it imply that any alterations or calculations done on one copy of the data will be reflected in the other two copies also?
Calculations or any transformations are performed on the original data and do not get reflected to all the copies of data. Master node identifies where the original data is located and performs the calculations. Only if the node is not responding or data is corrupted then it will perform the desired calculations on the second replica.
Q. How will you compare two HDFS files?
UNIX has a diff command to compare two HDFS files but there is no diff command with Hadoop. However, redirections can be used in the shell with the diff command as follows-
diff < (hadoop fs -cat /path/to/file) < (hadoop fs -cat /path/to/file2)
If the goal is just to find whether the two files are similar or not without having to know the exact differences, then a checksum-based approach can also be followed to compare two files. Get the checksums for both files and compare them.
Q. How will you copy a huge file of 80GB size into HDFS parallelly?
Using the distCP tools huge files can be copied within or in between various Hadoop clusters.
Q. Are Job Tracker and Task Tracker present on the same machine?
No, they are present on separate machines as Job Tracker is a single point of failure in Hadoop MapReduce and if the Job Tracker goes down all the running Hadoop jobs will halt.
Q. Can you create multiple files in HDFS with varying block sizes?
Yes, it is possible to create multiple files in HDFS with different block sizes using an API. The block size can be specified during the time of file creation. Below is the signature of the method that helps achieve this –
public FSDataOutputStream (Path f, boolean overwrite, int buffersize, short replication, long blocksize) throws IO Exception
Q. What happens if two clients try writing into the same HDFS file?
HDFS provides support only for exclusive writes so when one client is already writing the file, the other client cannot open the file in write mode. When the client requests the NameNode to open the file for writing, NameNode provides lease to the client for writing to the file. So, if another client requests for lease on the same it will be rejected.
Q. What do you understand by Active and Passive NameNodes?
The NameNode that works and runs in the Hadoop cluster is often referred to as the Active NameNode. Passive NameNode also known as Standby NameNode is the similar to an active NameNode but it comes into action only when the active NameNode fails. Whenever the active NameNode fails, the passive NameNode or the standby NameNode replaces the active NameNode, to ensure that the Hadoop cluster is never without a NameNode.
Q. How will you balance the disk space usage on a HDFS cluster?
Balancer tool helps achieve this by taking a threshold value as input parameter which is always a fraction between 0 and 1. The HDFS cluster is said to be balanced, if, for every DataNode, the ratio of used space at the node to total capacity of the node differs from the ratio of used space in the cluster to total capacity of the cluster - is not greater than the threshold value.
Q. If a DataNode is marked as decommissioned, can it be chosen for replica placement?
Whenever a DataNode is marked as decommissioned it cannot be considered for replication but it continues to serve read request until the node enters the decommissioned state completely i.e. till all the blocks on the decommissioning DataNode are replicated.
Q. How will you reduce the size of large cluster by removing a few nodes?
A set of existing nodes can be removed using the decommissioning feature to reduce the size of a large cluster. The nodes that have to be removed should be added to the exclude file. The name of the exclude file should be stated as a config parameter dfs.hosts.exclude. By editing the exclude files or the configuration file, the decommissioning process can be ended.
Q. What do you understand by Safe Mode in Hadoop?
The state in which NameNode does not perform replication or deletion of blocks is referred to as Safe Mode in Hadoop. In safe mode, NameNode only collects block reports information from the DataNodes.
Q. How will you manually enter and leave Safe Mode in Hadoop?
Below command is used to enter Safe Mode manually –
$ Hdfs dfsadmin -safe mode enter
Once the safe mode is entered manually, it should be removed manually.
Below command is used to leave Safe Mode manually –
$ hdfs dfsadmin -safe mode leave
Q. What are the advantages of a block transfer?
The size of a file can be larger than the size of a single disk within the network. Blocks from a single file need not be stored on the same disk and can make use of different disks present in the Hadoop cluster. This simplifies the entire storage subsystem providing fault tolerance and high availability.
Q. How will you empty the trash in HDFS?
Just like many desktop operating systems handle deleted files without a key, HDFS also moves all the deleted files into trash folder stored at /user/hdfs/.Trash. The trash can be emptied by running the following command-
hdfs –dfs expunge
Q. What does the HDFS error “File could only be replicated to 0 nodes, instead of 1” mean?
This exception occurs when the DataNode is not available to the NameNode (i.e. the client is not able to communicate with the DataNode) due to one of the following reasons –
In hdfs-site.xml file, if the block size is a negative value.
If there are any network fluctuations between the DataNode and NameNode, as a result of which the primary DataNode goes down whilst write is in progress.
Disk of DataNode is full.
DataNode is eventful and occupied with block reporting and scanning.
In this post, if HDFS Interview Questions and Answers are helpful, then please spend a minute from your valuable time to share it with the social media icons above and help the Hadoop community at large.
Help Others to Help Yourself.
0 comments:
Post a Comment