What is the
difference between Hadoop and Traditional RDBMS?
Hadoop :
Processes
semi-structured and unstructured data.
Schema on Read
Data discovery and Massive Storage/Processing of Unstructured data.
Writes are Fast
Schema on Read
Data discovery and Massive Storage/Processing of Unstructured data.
Writes are Fast
RDBMS
Processes structured
data.
Schema on Write
Best suited for OLTP
and complex ACID transactions
Reads are Fast
What do the 4
V’s of Big Data denote?
IBM has a nice, simple explanation for the
four critical features of big data:
a) Volume –Scale of data
b) Velocity –Analysis of streaming data
c) Variety – Different forms of data
d) Veracity –Uncertainty of data
a) Volume –Scale of data
b) Velocity –Analysis of streaming data
c) Variety – Different forms of data
d) Veracity –Uncertainty of data
How big data
analysis helps businesses increase their revenue? Give an example
Big data analysis is helping businesses
differentiate themselves – for example Walmart the world’s largest retailer in
2014 in terms of revenue - is using big data analytics to increase its sales
through better predictive analytics, providing customized recommendations and
launching new products based on customer preferences and needs. Walmart
observed a significant 10% to 15% increase in online sales for $1 billion in
incremental revenue. There are many more companies like Facebook, Twitter,
LinkedIn, Pandora, JPMorgan Chase, Bank of America, etc. using big data
analytics to boost their revenue.
Name some
companies that use Hadoop
Yahoo (One of the biggest user & more
than 80% code contributor to Hadoop)
Facebook
Netflix
Amazon
Adobe
eBay
Hulu
Spotify
Rubikloud
Twitter
Netflix
Amazon
Adobe
eBay
Hulu
Spotify
Rubikloud
Differentiate
between Structured and Unstructured data
Data which can be stored in traditional
database systems in the form of rows and columns, for example the online
purchase transactions can be referred to as Structured Data. Data which can be
stored only partially in traditional database systems, for example, data in XML
records can be referred to as semi structured data. Unorganized and raw data
that cannot be categorized as semi structured or structured data is referred to
as unstructured data. Facebook updates, Tweets on Twitter, Reviews, web logs,
etc. are all examples of unstructured data.
On what
concept the Hadoop framework works?
Hadoop Framework works on the following two core components-
1)HDFS
– Hadoop Distributed File System is the java based file system for scalable and reliable
storage of large datasets. Data in HDFS is stored in the form of blocks and it
operates on the Master Slave Architecture.
2)Hadoop
MapReduce-This is a java based programming paradigm of Hadoop framework that
provides scalability across various Hadoop clusters. MapReduce distributes the workload into various tasks that can run in
parallel. Hadoop jobs perform 2 separate tasks- job. The map job breaks down
the data sets into key-value pairs or tuples. The reduce job then takes the
output of the map job and combines the data tuples to into smaller set of
tuples. The reduce job is always performed after the map job is executed.
What are the
main components of a Hadoop Application?
Hadoop applications have wide range of technologies that
provide great advantage in solving complex business problems.
Core components of a Hadoop application
are-
1) Hadoop Common
2) HDFS
3) Hadoop MapReduce
4) YARN
Data Access Components are - Pig and
Hive
Data Storage Component is - HBase
Data Integration Components are -
Apache Flume, Sqoop, Chukwa
Data Management and Monitoring
Components are - Ambari, Oozie and Zookeeper.
Data Serialization Components are -
Thrift and Avro
Data Intelligence Components are -
Apache Mahout and Drill.
What is Hadoop
streaming?
Hadoop distribution has a generic
application programming interface for writing Map and Reduce jobs in any
desired programming language like Python, Perl, Ruby, etc. This is referred to
as Hadoop Streaming. Users can create and run jobs with any kind of shell
scripts or executable as the Mapper or Reducers.
What is the
best hardware configuration to run Hadoop?
The best configuration for executing Hadoop
jobs is dual core machines or dual processors with 4GB or 8GB RAM that use ECC
memory. Hadoop highly benefits from using ECC memory though it is not low -
end. ECC memory is recommended for running Hadoop because most of the Hadoop
users have experienced various checksum errors by using non ECC memory.
However, the hardware configuration also depends on the workflow requirements
and can change accordingly.
What are the
most commonly defined input formats in Hadoop?
The most common Input Formats defined in Hadoop are:
·
Text Input Format- This is the default input format defined in
Hadoop.
·
Key Value Input Format- This input format is used for plain text
files wherein the files are broken down into lines.
·
Sequence File Input Format- This input format is used for reading
files in sequence.
What is Big Data?
Big data is defined as
the voluminous amount of structured, unstructured or semi-structured data that
has huge potential for mining but is so large that it cannot be processed using
traditional database systems. Big data is characterized by its high velocity,
volume and variety that requires cost effective and innovative methods for
information processing to draw meaningful business insights. More than the
volume of the data – it is the nature of the data that defines whether it is
considered as Big Data or not.
0 comments:
Post a Comment