Other QA

What is Default replication factor and how will you change it at file level?

At File Level: hadoop fs –setrep –w 3 /my/file

At Directory level: hadoop fs –setrep –w 3 -R /my/dir

How many demons are there in hadoop?

There are 5 demons process which runs on HDFS as below :

A ) Name Node.

B ) Seconday Name Node.

C ) Data Node.

D) Task Tracker.

E ) Job Tracker.

What is ls -lrt?

Order Files Based on Last Modified Time (In Reverse Order)

L-list

R-reverse

T-last modified time

How to copy file content to another file using cat?

cat filename >> newfilename

What is the default block size HDFS? Is it configurable?

The default block size is in Apache Hadoop is 64MB (Hadoop 2.x- 128MB) where as

the default block size in Cloudera/Hortonworks is 128MB.

Yes it is configurable, but the block size should be like 64,128,512,1024 MB in bytes only such as 134217728 bytes = 128 MB.

Example: hadoop fs -D dfs.block.size=134217728 -put local_name remote_location

How to load file recursively in hive ?

SET hive.mapred.supports.subdirectories=TRUE;

SET mapred.input.dir.recursive=TRUE;

How to handle fault tolerance in RDD?

Difference between copy and move?

MV is used to move or rename the files but it will delete the original file while moving.

CP is used to copy the files but like mv it's not delete the original file means original file remain as it is.

Select 2nd highest Salary form table in HIVE?

SELECT max(salary) FROM emptable WHERE salary < (SELECT max(salary) FROM emptable);

Find Max date using Group By in HIVE

Select order_productid,ordername, max(date) from order groupby order_productid

How to get the Flume source file and where to use in code ?
Ans:

What are the default HDFS/MR ports available in Hadoop?

What are the channel types available in flume?

FILE Channel –File Channel writes the contents to a file on the file system after reading the event from a source. The file is deleted only after the contents are successfully delivered to the sink.

MEMORY Channel – Events are read from the source into memory and passed to the sink.

What is source and sink in flume?

A source is the component of an Agent which receives data from the data generators and transfers it to one or more channels in the form of Flume events.

A sink stores the data into centralized stores like HBase and HDFS. It consumes the data (events) from the channels and delivers it to the destination. The destination of the sink might be another agent or the central stores.

Difference between External table and Managed table ?

There are two types of tables in Hive ,one is Managed table and second is external table. the difference is , when you drop a table, if it is managed table hive deletes both data and meta data, if it is external table Hive only deletes metadata.

Explain Project flow/architecture ?

Ans:
Write a Query to get Max and Min salary for each department?

select deptid,max(sal),min(sal) from emp group by deptid;

Write a Query to get 3 Max salary for each department?

What is Hive UDF?
Ans:

What is the Cluster size in your project?

20 nodes each data node is 2TB ram 16 gb

Name node is 6TB ram 16 gb

What is difference between mongodb and hbase?

HBASE:-HBase is a open source database created by Apache for saving billions of rows with millions of columns.It is basically column based database in which data is stored in key value pairs.The column database structure based on key value is very performant for reporting and dashboarding usages

MONGODB:-It is a document oriented database.All data in mongodb is treated in JSON/BSON format.It is a schema less database which goes over tera bytes of data in database.

Difference between hadoop 1.0 and 2.0?

Hadoop1.x

Hadoop 1.x Supports only MapReduce (MR) processing model.it Does not support non-MR tools.
MR does both processing and cluster resource management.
1.x Has limited scaling of nodes. Limited to 4000 nodes per cluster.
Works on concepts of slots – slots can run either a Map task or a Reduce task only.
A single Namenode to manage the entire namespace.
1.x Has Single-Point-of-Failure (SPOF) – because of single Namenode- and in case of Namenode failure, needs manual intervention to overcome.
MR API is compatible with Hadoop 1x. A program written in Hadoop1 executes in Hadoop1x without any additional files.
1.x Has a limitation to serve as a platform for event processing, streaming and real-time operations.

Hadoop2.x

Hadoop 2.x Allows to work in MR as well as other distributed computing models like Spark, Hama, Giraph, Message Passing Interface) MPI & HBase coprocessors.
YARN (Yet Another Resource Negotiator) does cluster resource management and processing is done using different processing models.
2.x Has better scalability. Scalable up to 10000 nodes per cluster.
Works on concepts of containers. Using containers can run generic tasks.
Multiple Namenode servers manage multiple namespace.
2.x Has feature to overcome SPOF with a standby Namenode and in case of Namenode failure, it is configured for automatic recovery.
MR API requires additional files for a program written in Hadoop1x to execute in Hadoop2x.
Can serve as a platform for a wide variety of data analytics-possible to run event processing, streaming and real time operations.

What is spark and explain its architecture?

Apache Spark is a powerful open source processing engine built around speed,ease of use,and sophisticated analytics. It was originally developed at UC Berkeley in 2009.

https://spark.apache.org/docs/latest/cluster-overview.html

Word count program in SPARK using Scala?

val textFile = sc.textFile("hdfs://...")

val counts = textFile.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)

counts.collect().foreach(println)

Difference between HDFS and HIVE ?

Hadoop is a framework that helps processing large data sets across multiple computers. It includes Map/Reduce (parallel processing) and HDFS (distributed file system).

Hive is a data warehouse built on top of HDFS and Map/Reduce. It provides a SQL-like query engine that converts queries into Map/Reduce jobs and run them on the cluster.

Performance tuning in hive?

What is narrow and wide dependencies?

How you schedule the jobs?

By using oozie

How many Mappers and Reducers are being called in "Select * from table"?

No reducer will be call and no mapper will be called because, its same like CAT command in HDFS to view file

Example: hdfs dfs -cat <file name>

What are the File formats in hive?

Text File.
SequenceFile.
RCFile.
Avro Files.
ORC Files.
Parquet.
Custom INPUTFORMAT and OUTPUTFORMAT

What is the difference between Orc & rc file format?
Ans:

Difference between partition and bucketing?
Ans:

Difference between static & dynamic partition?

Static Partition in Hive

How internally copy cmd work while transferring data from local to hdfs?
Ans:

What are the file format in flume?
Ans:

Command to copy local file to hdfs

Hdfs dfs copyFromLocal <source><destination>

Reg_Ex function in hive?
Ans:

How do you do indexing in hive?
Ans:

What is partition and types?

· Hive Partitioning dividing the large amount of data into number pieces of folders based on table columns value.

· Hive Partition is often used for distributing load horizontally, this has performance benefit and helps in organizing data in a logical fashion.

Types of Partition in HIVE:-

Static and dynamic

Joins in hive? and map-side join and reduce side join?
Ans:
Hive is Schema on read!

Ans:
Hive performance tuning? (Partitioning , buckting)

Ans:
What is Serde in HIVE?

Ans:
What is co-related subquery.

Ans:
Difference between managed table and external table? what you will use in your project?

Ans:
What are the different joins in HIVE.

Ans:
How do you do parallel processing in sqoop?

Ans:
Why to use sqoop?

Ans:
What problem you faced in initial stage in sqoop?

Ans:
How to handle incremental data in sqoop?

Ans:
What are the Limitation of sqoop export?

Ans:
Does sqoop has reduce function?

How did you import in direct directory in sqoop ?

Ans:
Directly you have imported into hdfs or you imported to linux and then moved to HDFS?

Ans:
Incremental load (Append mode or last modified mode) in sqoop

Ans:
Split by key. In sqoop

Ans:
What is RDD? Immutable object which is fault tolerence.

Ans:
How do you create RDD?

By using parallelize or textFile

What is Lazy evaluation in SPARK.

Ans:
Aws cloud based how do you run ?

Ans:
What is the difference between Reduce by key, combined by key, group by key.

Ans:
What is Fault tollerant in oozie

Ans:
HDFS Shell Command List

Edward19 November 2019 at 03:36
Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

Big Data Solutions

Data Lake Companies

Advanced Analytics Solutions

Full Stack Development Company
Alfred Avina29 January 2020 at 01:49
As the growth of AWS big data consultant , it is essential to spread knowledge in people. This meetup will work as a burst of awareness.

Become a Big Data - Hadoop Professional

Other QA

2 comments:

Total Pageviews

Popular Posts

Recent Posts

Categories

Unordered List

Text Widget

Pages

Blog Archive

Sample Text