What
is Default replication factor and how will you change it at file level?
At File Level:
hadoop fs –setrep
–w 3 /my/file
At Directory level:
hadoop fs –setrep –w 3 -R /my/dir
How many demons are there
in hadoop?
There are 5 demons
process which runs on HDFS as below :
A ) Name Node.
B ) Seconday Name Node.
C ) Data Node.
D) Task Tracker.
E ) Job Tracker.
What is ls -lrt?
Order Files Based on Last Modified
Time (In Reverse Order)
L-list
R-reverse
T-last modified time
How to copy file content
to another file using cat?
cat filename
>> newfilename
What is the default block
size HDFS? Is it configurable?
The default block size
is in Apache Hadoop is 64MB (Hadoop 2.x- 128MB) where
as
the default block size in
Cloudera/Hortonworks is 128MB.
Yes it is configurable, but
the block size should be like 64,128,512,1024 MB in bytes only such
as 134217728 bytes = 128 MB.
Example: hadoop fs -D dfs.block.size=134217728 -put local_name remote_location
How to load file
recursively in hive ?
SET
hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;
How to handle fault
tolerance in RDD?
Difference between
copy and move?
MV is used to move or rename the files but it
will delete the original file while moving.
CP is used to copy the files but like mv it's
not delete the original file means original file remain as it is.
Select 2nd highest
Salary form table in HIVE?
SELECT max(salary)
FROM emptable WHERE salary < (SELECT max(salary) FROM emptable);
Find Max date using Group
By in HIVE
Select order_productid,ordername,
max(date) from order groupby order_productid
How to get the Flume
source file and where to use in code ?
Ans:
Ans:
What are the default
HDFS/MR ports available in Hadoop?
What are the channel types
available in flume?
FILE Channel –File
Channel writes the contents to a file on the file system after reading the
event from a source. The file is deleted only after the contents are
successfully delivered to the sink.
MEMORY Channel –
Events are read from the source into memory and passed to the sink.
What is source and sink in flume?
A source is
the component of an Agent which receives data from the data generators and
transfers it to one or more channels in the form of Flume events.
A sink
stores the data into centralized stores like HBase and HDFS. It consumes the
data (events) from the channels and delivers it to the destination. The
destination of the sink might be another agent or the central stores.
Difference between
External table and Managed table ?
There are two types
of tables in Hive ,one is Managed table and second is external table. the
difference is , when you drop a table, if it is managed table hive deletes both
data and meta data, if it is external table Hive only deletes metadata.
Explain Project
flow/architecture ?
Ans:
Write a Query to get Max and Min salary for each department?
Write a Query to get Max and Min salary for each department?
select deptid,max(sal),min(sal) from
emp group by deptid;
Write
a Query to get 3 Max salary for each department?
What is Hive UDF?
Ans:
Ans:
What is the Cluster size
in your project?
20 nodes each data node is 2TB ram
16 gb
Name node is 6TB ram 16 gb
What is difference
between mongodb and hbase?
HBASE:-HBase is a open source database created by
Apache for saving billions of rows with millions of columns.It is basically
column based database in which data is stored in key value pairs.The column
database structure based on key value is very performant for reporting and
dashboarding usages
MONGODB:-It is a document oriented database.All
data in mongodb is treated in JSON/BSON format.It is a schema less database
which goes over tera bytes of data in database.
Difference between hadoop
1.0 and 2.0?
Hadoop1.x
- Hadoop 1.x Supports only MapReduce (MR) processing model.it Does not support non-MR tools.
- MR does both processing and cluster resource management.
- 1.x Has limited scaling of nodes. Limited to 4000 nodes per cluster.
- Works on concepts of slots – slots can run either a Map task or a Reduce task only.
- A single Namenode to manage the entire namespace.
- 1.x Has Single-Point-of-Failure (SPOF) – because of single Namenode- and in case of Namenode failure, needs manual intervention to overcome.
- MR API is compatible with Hadoop 1x. A program written in Hadoop1 executes in Hadoop1x without any additional files.
- 1.x Has a limitation to serve as a platform for event processing, streaming and real-time operations.
Hadoop2.x
- Hadoop 2.x Allows to work in MR as well as other distributed computing models like Spark, Hama, Giraph, Message Passing Interface) MPI & HBase coprocessors.
- YARN (Yet Another Resource Negotiator) does cluster resource management and processing is done using different processing models.
- 2.x Has better scalability. Scalable up to 10000 nodes per cluster.
- Works on concepts of containers. Using containers can run generic tasks.
- Multiple Namenode servers manage multiple namespace.
- 2.x Has feature to overcome SPOF with a standby Namenode and in case of Namenode failure, it is configured for automatic recovery.
- MR API requires additional files for a program written in Hadoop1x to execute in Hadoop2x.
- Can serve as a platform for a wide variety of data analytics-possible to run event processing, streaming and real time operations.
What is spark and explain its
architecture?
Apache Spark is a powerful open source processing
engine built around speed,ease of use,and sophisticated analytics. It was
originally developed at UC Berkeley in 2009.
Word count program in
SPARK using Scala?
val textFile =
sc.textFile("hdfs://...")
val counts =
textFile.flatMap(line => line.split(" "))
.map(word
=> (word, 1))
.reduceByKey(_
+ _)
counts.collect().foreach(println)
Difference between HDFS
and HIVE ?
Hadoop is a framework that helps processing
large data sets across multiple computers. It includes Map/Reduce (parallel
processing) and HDFS (distributed file system).
Hive is a data warehouse built on
top of HDFS and Map/Reduce. It provides a SQL-like query engine that converts
queries into Map/Reduce jobs and run them on the cluster.
Performance tuning in
hive?
What is narrow and wide
dependencies?
How you schedule the
jobs?
By using oozie
How many Mappers and
Reducers are being called in "Select * from table"?
No reducer will be call and no
mapper will be called because, its same like CAT command in HDFS to view
file
Example: hdfs dfs -cat <file
name>
What are the File formats
in hive?
- Text File.
- SequenceFile.
- RCFile.
- Avro Files.
- ORC Files.
- Parquet.
- Custom INPUTFORMAT and OUTPUTFORMAT
What is the difference
between Orc & rc file format?
Ans:
Ans:
Difference between
partition and bucketing?
Ans:
Ans:
Difference between static
& dynamic partition?
Static
Partition in Hive
How internally copy cmd
work while transferring data from local to hdfs?
Ans:
Ans:
What are the file format
in flume?
Ans:
Ans:
Command to copy local
file to hdfs
Hdfs dfs copyFromLocal
<source><destination>
Reg_Ex
function in hive?
Ans:
Ans:
How do
you do indexing in hive?
Ans:
Ans:
What is
partition and types?
· Hive Partitioning dividing the large amount
of data into number pieces of folders based on table columns value.
· Hive Partition is often used for
distributing load horizontally, this has performance benefit and helps in
organizing data in a logical fashion.
Types of
Partition in HIVE:-
Static and
dynamic
Joins in
hive? and map-side join and reduce side join?
Ans:
Hive is Schema on read!
Ans:
Hive is Schema on read!
Ans:
Hive performance tuning? (Partitioning , buckting)
Hive performance tuning? (Partitioning , buckting)
Ans:
What is Serde in HIVE?
What is Serde in HIVE?
Ans:
What is co-related subquery.
What is co-related subquery.
Ans:
Difference between managed table and external table? what you will use in your project?
Difference between managed table and external table? what you will use in your project?
Ans:
What are the different joins in HIVE.
What are the different joins in HIVE.
Ans:
How do you do parallel processing in sqoop?
How do you do parallel processing in sqoop?
Ans:
Why to use sqoop?
Why to use sqoop?
Ans:
What problem you faced in initial stage in sqoop?
What problem you faced in initial stage in sqoop?
Ans:
How to handle incremental data in sqoop?
How to handle incremental data in sqoop?
Ans:
What are the Limitation of sqoop export?
What are the Limitation of sqoop export?
Ans:
Does sqoop has reduce function?
Does sqoop has reduce function?
NO
How did
you import in direct directory in sqoop ?
Ans:
Directly you have imported into hdfs or you imported to linux and then moved to HDFS?
Directly you have imported into hdfs or you imported to linux and then moved to HDFS?
Ans:
Incremental load (Append mode or last modified mode) in sqoop
Incremental load (Append mode or last modified mode) in sqoop
Ans:
Split by key. In sqoop
Split by key. In sqoop
Ans:
What is RDD? Immutable object which is fault tolerence.
What is RDD? Immutable object which is fault tolerence.
Ans:
How do you create RDD?
How do you create RDD?
By using
parallelize or textFile
What is
Lazy evaluation in SPARK.
Ans:
Aws cloud based how do you run ?
Aws cloud based how do you run ?
Ans:
What is the difference between Reduce by key, combined by key, group by key.
What is the difference between Reduce by key, combined by key, group by key.
Ans:
What is Fault tollerant in oozie
What is Fault tollerant in oozie
Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.
ReplyDeleteBig Data Solutions
Data Lake Companies
Advanced Analytics Solutions
Full Stack Development Company
As the growth of AWS big data consultant , it is essential to spread knowledge in people. This meetup will work as a burst of awareness.
ReplyDelete