Q. What is the difference between Pig and Hive ?
Hive:
- Type of Data : Used for Structured Data
- Schema : Hive requires a well-defined Schema.
- Language : Follows SQL Dialect and is a declarative language.
- Purpose : It is mainly used for reporting.
- General Usage : Usually used on the server side of the hadoop cluster.
- Coding Style : More like SQL
- File Support : Hive does not support it.
- Mainly Used by: Data Analysts
Pig:
- Type of Data : Apache Pig is usually used for semi structured data.
- Schema : Schema is optional.
- Language : It is a procedural data flow language.
- Purpose : Mainly used for programming.
- General Usage : Usually used on the client side of the hadoop cluster.
- Coding Style : Verbose
- File Support : Pig supports Avro file format.
- Mainly Used by: Researchers and Programmers
Q. What is the difference between HBase and Hive ?
Hive:
- Hive allows execution of most SQL queries.
- Hive runs on top of Hadoop MapReduce.
- Hive is a datawarehouse framework.
- Does not support record level insert, update and delete.
HBase:
- HBase does not allow execution of SQL queries.
- HBase runs on top of HDFS.
- HBase is a NoSQL database.
- Supports record level insert, updated and delete operations.
Q. I do not need the index created in the first question anymore. How can I delete the above index named index_bonuspay?
DROP INDEX index_bonuspay ON employee;
Q. Can you list few commonly used Hive services?
Command Line Interface (cli)
Hive Web Interface (hwi)
HiveServer (hiveserver)
Printing the contents of an RC file using the tool rcfilecat.
Jar
Metastore
Q. Suppose that I want to monitor all the open and aborted transactions in the system along with the transaction id and the transaction state. Can this be achieved using Apache Hive?
Hive 0.13.0 and above version support SHOW TRANSACTIONS command that helps administrators monitor various hive transactions.
Q. What is the use of Hcatalog?
Hcatalog can be used to share data structures with external systems. Hcatalog provides access to hive metastore to users of other tools on Hadoop so that they can read and write data to hive’s data warehouse.
Q. Write a query to rename a table Student to Student_New.
Alter Table Student RENAME to Student_New
Q. Where is table data stored in Apache Hive by default?
hdfs: //namenode_server/user/hive/warehouse
Q. Explain the difference between partitioning and bucketing.
Partitioning and Bucketing of tables is done to improve the query performance. Partitioning helps execute queries faster, only if the partitioning scheme has some common range filtering i.e. either by timestamp ranges, by location, etc. Bucketing does not work by default.
Partitioning helps eliminate data when used in WHERE clause. Bucketing helps organize data inside the partition into multiple files so that same set of data will always be written in the same bucket. Bucketing helps in joining various columns.
In partitioning technique, a partition is created for every unique value of the column and there could be a situation where several tiny partitions may have to be created. However, with bucketing, one can limit it to a specific number and the data can then be decomposed in those buckets.
Basically, a bucket is a file in Hive whereas partition is a directory.
Q. Explain about the different types of partitioning in Hive?
Partitioning in Hive helps prune the data when executing the queries to speed up processing. Partitions are created when data is inserted into the table. In static partitions, the name of the partition is hardcoded into the insert statement whereas in a dynamic partition, Hive automatically identifies the partition based on the value of the partition field.
Based on how data is loaded into the table, requirements for data and the format in which data is produced at source- static or dynamic partition can be chosen. In dynamic partitions the complete data in the file is read and is partitioned through a MapReduce job based into the tables based on a particular field in the file. Dynamic partitions are usually helpful during ETL flows in the data pipeline.
When loading data from huge files, static partitions are preferred over dynamic partitions as they save time in loading data. The partition is added to the table and then the file is moved into the static partition. The partition column value can be obtained from the file name without having to read the complete file.
Q. When executing Hive queries in different directories, why is metastore_db created in all places from where Hive is launched?
When running Hive in embedded mode, it creates a local metastore. When you run the query, it first checks whether a metastore already exists or not. The property javax.jdo.option.ConnectionURL defined in the hive-site.xml has a default value jdbc: derby: databaseName=metastore_db; create=true.
The value implies that embedded derby will be used as the Hive metastore and the location of the metastore is metastore_db which will be created only if it does not exist already. The location metastore_db is a relative location so when you run queries from different directories it gets created at all places from wherever you launch hive. This property can be altered in the hive-site.xml file to an absolute path so that it can be used from that particular location instead of creating multiple metastore_db subdirectory multiple times.
Q. How will you read and write HDFS files in Hive?
i) TextInputFormat- This class is used to read data in plain text file format.
ii) HiveIgnoreKeyTextOutputFormat- This class is used to write data in plain text file format.
iii) SequenceFileInputFormat- This class is used to read data in hadoop SequenceFile format.
iv) SequenceFileOutputFormat- This class is used to write data in hadoop SequenceFile format.
Q. What are the components of a Hive query processor?
Query processor in Apache Hive converts the SQL to a graph of MapReduce jobs with the execution time framework so that the jobs can be executed in the order of dependencies. The various components of a query processor are-
Parser
Semantic Analyser
Type Checking
Logical Plan Generation
Optimizer
Physical Plan Generation
Execution Engine
Operators
UDF’s and UDAF’s.
Q. Differentiate between describe and describe extended.
Describe database/schema- This query displays the name of the database, the root location on the file system and comments if any.
Describe extended database/schema- Gives the details of the database or schema in a detailed manner.
Q. Is it possible to overwrite Hadoop MapReduce configuration in Hive?
Yes, hadoop MapReduce configuration can be overwritten by changing the hive conf settings file.
Q. I want to see the present working directory in UNIX from hive. Is it possible to run this command from hive?
Hive allows execution of UNIX commands with the use of exclamatory (!) symbol. Just use the ! Symbol before the command to be executed at the hive prompt. To see the present working directory in UNIX from hive run !pwd at the hive prompt.
Q. What is the use of explode in Hive?
Explode in Hive is used to convert complex data types into desired table formats. explode UDTF basically emits all the elements in an array into multiple rows.
Q. Explain about SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY in Hive.
SORT BY – Data is ordered at each of ‘N’ reducers where the reducers can have overlapping range of data.
ORDER BY- This is similar to the ORDER BY in SQL where total ordering of data takes place by passing it to a single reducer.
DISTRUBUTE BY – It is used to distribute the rows among the reducers. Rows that have the same distribute by columns will go to the same reducer.
CLUSTER BY- It is a combination of DISTRIBUTE BY and SORT BY where each of the N reducers gets non overlapping range of data which is then sorted by those ranges at the respective reducers.
Q. Difference between HBase and Hive.
HBase is a NoSQL database whereas Hive is a data warehouse framework to process Hadoop jobs.
HBase runs on top of HDFS whereas Hive runs on top of Hadoop MapReduce.
Q. Write a hive query to view all the databases whose name begins with “db”
SHOW DATABASES LIKE ‘db.*’
Q. How can you prevent a large job from running for a long time?
This can be achieved by setting the MapReduce jobs to execute in strict mode set hive.mapred.mode=strict;
The strict mode ensures that the queries on partitioned tables cannot execute without defining a WHERE clause.
Q. What is a Hive Metastore?
Hive Metastore is a central repository that stores metadata in external database.
Q. Are multiline comments supported in Hive?
No
Q. What is ObjectInspector functionality?
ObjectInspector is used to analyse the structure of individual columns and the internal structure of the row objects. ObjectInspector in Hive provides access to complex objects which can be stored in multiple formats.
Q. Explain about the different types of join in Hive.
HiveQL has 4 different types of joins –
JOIN- Similar to Outer Join in SQL
FULL OUTER JOIN – Combines the records of both the left and right outer tables that fulfil the join condition.
LEFT OUTER JOIN- All the rows from the left table are returned even if there are no matches in the right table.
RIGHT OUTER JOIN-All the rows from the right table are returned even if there are no matches in the left table.
Q. How can you configure remote metastore mode in Hive?
To configure metastore in Hive, hive-site.xml file has to be configured with the below property –
hive.metastore.uris
thrift: //node1 (or IP Address):9083
IP address and port of the metastore host
Q. Is it possible to change the default location of Managed Tables in Hive, if so how?
Yes, we can change the default location of Managed tables using the LOCATION keyword while creating the managed table. The user has to specify the storage path of the managed table as the value to the LOCATION keyword.
Q. How data transfer happens from HDFS to Hive?
If data is already present in HDFS then the user need not LOAD DATA that moves the files to the /user/hive/warehouse/. So the user just has to define the table using the keyword external that creates the table definition in the hive metastore.
Create external table table_name (
id int,
myfields string
)
location '/my/location/in/hdfs';
Q. In case of embedded Hive, can the same metastore be used by multiple users?
We cannot use metastore in sharing mode. It is suggested to use standalone real database like PostGreSQL and MySQL.
Q. The partition of hive table has been modified to point to a new directory location. Do I have to move the data to the new location or the data will be moved automatically to the new location?
Changing the point of partition will not move the data to the new location. It has to be moved manually to the new location from the old one.
Q. What will be the output of cast (‘XYZ’ as INT)?
It will return a NULL value.
Q. What are the different components of a Hive architecture?
Hive Architecture consists of a –
User Interface – UI component of the Hive architecture calls the execute interface to the driver.
Driver create a session handle to the query and sends the query to the compiler to generate an execution plan for it.
Metastore - Sends the metadata to the compiler for the execution of the query on receiving the sendMetaData request.
Compiler- Compiler generates the execution plan which is a DAG of stages where each stage is either a metadata operation, a map or reduce job or an operation on HDFS.
Execute Engine- Execution engine is responsible for submitting each of these stages to the relevant components by managing the dependencies between the various stages in the execution plan generated by the compiler.
Q. What happens on executing the below query? After executing the below query, if you modify the column –how will the changes be tracked?
Hive> CREATE INDEX index_bonuspay ON TABLE employee (bonus)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
The query creates an index named index_bonuspay which points to the bonus column in the employee table. Whenever the value of bonus is modified it will be stored using an index value.
Q. What is the default database provided by Hive for Metastore ?
Derby is the default database.
Q. Explain about the SMB Join in Hive.
In SMB join in Hive, each mapper reads a bucket from the first table and the corresponding bucket from the second table and then a merge sort join is performed. Sort Merge Bucket (SMB) join in hive is mainly used as there is no limit on file or partition or table join. SMB join can best be used when the tables are large. In SMB join the columns are bucketed and sorted using the join columns. All tables should have the same number of buckets in SMB join.
Q. How can you connect an application, if you run Hive as a server?
When running Hive as a server, the application can be connected in one of the 3 ways-
ODBC Driver-This supports the ODBC protocol
JDBC Driver- This supports the JDBC protocol
Thrift Client- This client can be used to make calls to all hive commands using different programming language like PHP, Python, Java, C++ and Ruby.
Q. What does the overwrite keyword denote in Hive load statement?
Overwrite keyword in Hive load statement deletes the contents of the target table and replaces them with the files referred by the file path i.e. the files that are referred by the file path will be added to the table when using the overwrite keyword.
Q. What is SerDe in Hive? How can you write your own custom SerDe?
SerDe is a Serializer DeSerializer. Hive uses SerDe to read and write data from tables. Generally, users prefer to write a Deserializer instead of a SerDe as they want to read their own data format rather than writing to it. If the SerDe supports DDL i.e. basically SerDe with parameterized columns and different column types, the users can implement a Protocol based DynamicSerDe rather than writing the SerDe from scratch.
Scenario based or Real-Time Interview Questions on Hadoop Hive
How will you optimize Hive performance?
Will the reducer work or not if you use “Limit 1” in any HiveQL query?
Why you should choose Hive instead of Hadoop MapReduce?
I create a table which contains transaction details of customers for the year 2017.
CREATE TABLE transaction_details (cust_id INT, amount FLOAT, month STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;
I have inserted 60K tuples in this table and now want to know the total revenue that has been generated for each month. However, Hive takes too much time to process this query. List all the steps that you would follow to solve this problem.
Other Interview Questions on Hadoop Hive
Explain the difference between SQL and Apache Hive.
Why mapreduce will not run if you run select * from table in hive?
Please share your interview experience on HIVE questions asked in your interview in the comments below to help the big data community.
"HELP OTHERS TO HELP YOURSELF"
Really Good blog post.provided a helpful information.I hope that you will post more updates like this Big data hadoop online Course Hyderabad
ReplyDeleteThe article is so appealing. You should read this article before choosing the AWS big data consultant you want to learn.
ReplyDelete