HBase QA

Q. What is HBase?

Before we dive into HBase interview questions, here’s an overview of what is HBase and its features -

HBase, commonly referred to as the “Hadoop Database”, is a column oriented database based on the principles of Google Big Table. HBase does not directly use the capabilities of Hadoop MapReduce but can integrate with Hadoop to act as a source or destination for Hadoop MapReduce jobs. HBase provides real-time read or write access to data in HDFS. Data can be stored in HDFS directly or through HBase. Just like HDFS has a NameNode and Slave Node, Hadoop MapReduce has TaskTracker and JobTracker, HBase also has a Master Node and Region Server. Master node manages the cluster and region servers in HBase store portions of the HBase tables and perform data model operations.

HBase system consists of tables with rows and columns just like a traditional RDBMS. Every table must have a primary key which is used to access the data in HBase tables. HBase columns define the attributes of an object. For instance, if your HBase table stores web server logs then each row in the HBase will be a log record and the columns can be the server name from where web server log originated, the time when the log was written, etc. Several attributes can be grouped together in HBase to form column families. All the elements of a single column family are stored together. Column families should be specified when defining the table schema, however, HBase is so flexible and new columns can be added to a column family at any time based on application requirements.

Hallmark Features of HBase

Schema Flexibility
Scalability
High Reliability
Advantages of Using HBase

Provides RDBM like stored procedures and triggers in the form of coprocessors. Coprocessor in HBase is a framework that helps users run their custom code on Region Server.
Great record level consistency
In-built versioning.

Q. Compare RDBMS with HBase

Characteristic : RDBMS : HBase
Schema : Has a fixed schema : No fixed schema
Query Language : Supports structured powerful query language : Simple Query language
Transaction Processing : Support ACID transactions. : Is eventually consistent but does not support ACID transactions.

Q. What do you understand by CAP theorem and which features of CAP theorem does HBase follow?

CAP stands for Consistency, Availability and Partition Tolerance.

Consistency –At a given point of time, all nodes in a cluster will be able to see the same set of data.
Availability- Every request generates a response, regardless of whether it is a success or a failure.
Partition Tolerance – System continues to work even if there is failure of part of the system or intermittent message loss.
HBase is a column oriented databases providing features like partition tolerance and consistency.
3) Name few other popular column oriented databases like HBase.

CouchDB, MongoDB, Cassandra

Q. When should you use HBase and what are the key components of HBase?

HBase should be used when the big data application has –

1)A variable schema

2)When data is stored in the form of collections

3)If the application demands key based access to data while retrieving.

Key components of HBase are –

Region- This component contains memory data store and Hfile.

Region Server-This monitors the Region.

HBase Master-It is responsible for monitoring the region server.

Zookeeper- It takes care of the coordination between the HBase Master component and the client.

Catalog Tables-The two important catalog tables are ROOT and META.ROOT table tracks where the META table is and META table stores all the regions in the system.

2. What are the different operational commands in HBase at record level and table level?

Record Level Operational Commands in HBase are –put, get, increment, scan and delete.

Table Level Operational Commands in HBase are-describe, list, drop, disable and scan.

3. What is Row Key?

Every row in an HBase table has a unique identifier known as RowKey. It is used for grouping cells logically and it ensures that all cells that have the same RowKeys are co-located on the same server. RowKey is internally regarded as a byte array.

4. Explain the difference between RDBMS data model and HBase data model.

RDBMS is a schema based database whereas HBase is schema less data model.

RDBMS does not have support for in-built partitioning whereas in HBase there is automated partitioning.

RDBMS stores normalized data whereas HBase stores de-normalized data.

5. Explain about the different catalog tables in HBase?

The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system.

6. What is column families? What happens if you alter the block size of ColumnFamily on an already populated database?

The logical deviation of data is represented through a key known as column Family. Column families consist of the basic unit of physical storage on which compression features can be applied. In an already populated database, when the block size of column family is altered, the old data will remain within the old block size whereas the new data that comes in will take the new block size. When compaction takes place, the old data will take the new block size so that the existing data is read correctly.

7. Explain the difference between HBase and Hive.

HBase and Hive both are completely different hadoop based technologies-Hive is a data warehouse infrastructure on top of Hadoop whereas HBase is a NoSQL key value store that runs on top of Hadoop. Hive helps SQL savvy people to run MapReduce jobs whereas HBase supports 4 primary operations-put, get, scan and delete. HBase is ideal for real time querying of big data where Hive is an ideal choice for analytical querying of data collected over period of time.

8. Explain the process of row deletion in HBase.

On issuing a delete command in HBase through the HBase client, data is not actually deleted from the cells but rather the cells are made invisible by setting a tombstone marker. The deleted cells are removed at regular intervals during compaction.

9. What are the different types of tombstone markers in HBase for deletion?

There are 3 different types of tombstone markers in HBase for deletion-

1)Family Delete Marker- This markers marks all columns for a column family.

2)Version Delete Marker-This marker marks a single version of a column.

3)Column Delete Marker-This markers marks all the versions of a column.

10. Explain about HLog and WAL in HBase.

All edits in the HStore are stored in the HLog. Every region server has one HLog. HLog contains entries for edits of all regions performed by a particular Region Server.WAL abbreviates to Write Ahead Log (WAL) in which all the HLog edits are written immediately.WAL edits remain in the memory till the flush period in case of deferred log flush.

4) What do you understand by Filters in HBase?

HBase filters enhance the effectiveness of working with large data stored in tables by allowing users to add limiting selectors to a query and eliminate the data that is not required. Filters have access to the complete row to which they are applied. HBase has 18 filters –

TimestampsFilter
PageFilter
MultipleColumnPrefixFilter
FamilyFilter
ColumnPaginationFilter
SingleColumnValueFilter
RowFilter
QualifierFilter
ColumnRangeFilter
ValueFilter
PrefixFilter
SingleColumnValueExcludeFilter
ColumnCountGetFilter
InclusiveStopFilter
DependentColumnFilter
FirstKeyOnlyFilter
KeyOnlyFilter

5) Explain about the data model operations in HBase.

Put Method – To store data in HBase

Get Method – To retrieve data stored in HBase.

Delete Method- To delete the data from HBase tables.

Scan Method –To iterate over the data with larger key ranges or the entire table.

6) How will you back up an HBase cluster?

HBase cluster backups are performed in 2 ways-

Live Cluster Backup

Full Shutdown Backup

In live cluster backup strategy, copy table utility is used to copy the data from one table to another on the same cluster or another cluster. Export utility can also be used to dump the contents of the table onto HDFS on the same cluster.

In full shutdown backup approach, a periodic complete shutdown of the HBase cluster is performed so that the Master and Region Servers go down and if there are hardly any chances of losing out the in-flight changes happening to metadata or StoreFiles. However, this kind of approach can be used only for back-end analytic capacity and not for applications that serve front end webpages.

7) Does HBase support SQL like syntax?

SQL like support for HBase is not yet available. With the use of Apache Phoenix, user can retrieve data from HBase through SQL queries.

8) Is it possible to iterate through the rows of HBase table in reverse order?

No.

Column values are put on disk and the length of the value is written first and then the actual value is written. To iterate through these values in reverse order-the bytes of the actual value should be written twice.

9) Should the region server be located on all DataNodes?

Yes. Region Servers run on the same servers as DataNodes.

10) Suppose that your data is stored in collections, for instance some binary data, message data or metadata is all keyed on the same value. Will you use HBase for this?

Yes, it is ideal to use HBase whenever key based access to data is required for storing and retrieving.

11) Assume that an HBase table Student is disabled. Can you tell me how will I access the student table using Scan command once it is disabled?

Any HBase table that is disabled cannot be accessed using Scan command.

12) What do you understand by compaction?

During periods of heavy incoming writes, it is not possible to achieve optimal performance by having one file per store. Thus, HBase combines all these HFiles to reduce the number of disk seeds for every read. This process is referred to as Compaction in HBase.

13) Explain about the various table design approaches in HBase.

Tall-Narrow and Flat-Wide are the two HBase table design approaches that can be used. However, which approach should be used when merely depends on what you want to achieve and how you want to use the data. The performance of HBase completely depends on the RowKey and hence on directly on how data is accessed.

On a high level the major difference between flat-wide and tall-narrow approach is similar to the difference between get and scan. Full scans are costly in HBase because of ordered RowKey storage policy. Tall-narrow approach can be used when there is a complex RowKey so that focused scans can be performed on logical group of entries.

Ideally, tall-narrow approach is used when there are less number of rows and large number of columns whereas flat-wide approach is used when there are less number of columns and large number of rows.

14) Which one would you recommend for HBase table design approach – tall-narrow or flat wide?

There are several factors to be considered when deciding between flat-wide (millions of columns and limited keys) and tall-narrow (millions of keys with limited columns), however, a tall-narrow approach is often recommended because of the following reasons –

Under extreme scenarios, a flat-wide approach might end up with a single row per region, resulting in poor performance and scalability.
Table scans are often efficient over multiple reads. Considering that only a subset of the row data will be required, tall-narrow table design approach will provide better performance over flat-wide approach.
15) What is the best practice on deciding the number of column families for HBase table?

It is ideal not to exceed the number of columns families per HBase table by 15 because every column family in HBase is stored as a single file, so large number of columns families will be required to read and merge multiple files.

16) How will you implement joins in HBase?

HBase does not support joins directly but by using MapReduce jobs join queries can be implemented to retrieve data from various HBase tables.

17) What is the difference between HBase and HDFS?

HDFS is a local file system in Hadoop for storing large files but it does not provide tabular form of storage. HDFS is more like a local file system (NTFS or FAT). Data in HDFS is accessed through MapReduce jobs and is well suited for high latency batch processing operations.

HBase is a column oriented database on Hadoop that runs on top of HDFS and stores data in tabular format. HBase is like a database management system that communicates with HDFS to write logical tabular data to physical file system. One can access single rows using HBase from billions of records it has and is well-suited for low latency operations. HBase puts data in indexed StoreFiles present on HDFS for high speed lookups.

HBase Interview Questions for Experienced

1) How will you design the HBase Schema for Twitter data?

2) You want to fetch data from HBase to create a REST API. Which is the best way to read HBase data using a Spark Job or a Java program?

3) Design a HBase table for many to many relationship between two entities, for example employee and department.

4) Explain an example that demonstrates good de-normalization in HBase with consistency.

5) Should your HBase and MapReduce cluster be the same or they should be run on separate clusters?

If there are any other HBase interview questions that you have been asked in your Hadoop Job interview, then feel free to share it in the comments below.

Become a Big Data - Hadoop Professional

HBase QA

0 comments:

Post a Comment

Total Pageviews

Popular Posts

Recent Posts

Categories

Unordered List

Text Widget

Pages

Blog Archive

Sample Text