With was needed. Thus a programming model called

With an increase in the penetration of
internet and the usage of the internet, the data captured by Google has
increased exponentially year on year. For instance,in 2007 Google collected on
an average 270 PB of data every month. The same number increased to 20000 PB
everyday in 2009. To manage such enormous data, a better platform was needed. Thus
a programming model called MapReduce was implemented which could process this
20000 PB per day. Google ran these MapReduce operations on a special file
system called Google File System (GFS). Unfortunately, GFS is not an open
source.

Doug cutting and Yahoo! reverse
engineered the model GFS and built a parallel Hadoop Distributed File System
(HDFS). Thus came Hadoop, a framework- an open-source Apache project- that can
be used for performing operations on data in a distributed environment(using
HDFS) using a simple programing model called MapReduce .

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

In other words, Hadoop
can be thought of as a set of open source programs and procedures which anyone can
use as the “backbone” of their big data operations. Hadoop
lets you store files bigger than what can be stored on one particular node or
server. So one can store very, very large files. It also lets you store many,
many files.

It is also a scalable and fault tolerant system. In
the realm of Big Data, Hadoop falls primarily into the distributed processing
category but also has a powerful storage capability.

The core components of Hadoop are:

1.    
Hadoop YARN – A manager and scheduling
system that schedules resources on a cluster of machines. It manages resources of the systems storing
the data and running the analysis.

2.     Hadoop
MapReduce – MapReduce is named after the two basic operations this module
carries out – reading data from the database, putting it into a format suitable
for analysis (map), and performing mathematical operations (reduce).MapReduce
provides a programming model that makes combining the data from various hard
drives a much easier task. There are two parts to the programming model – the
map phase and the reduce phase—and it’s the interface between the two where the
“combining” of data occurs. Hadoop distributes the data across multiple
servers. Each and every server offers the ability to analyze and store the data
locally. When you run a query on a large dataset, every server in this network
will execute the query on its local server on the local dataset. Finally, the
results from all the local servers are consolidated. The consolidation part is
handled effectively by MapReduce.

3.     Hadoop
Distributed File System (HDFS)

This is a self-healing, high
bandwidth clustered file storage, which is optimized for high throughput access
to data. It can store any type of data, structured or complex from any number
of sources in their original format. It is a file system
designed for storing very large files with streaming data access patterns,
running on clusters of commodity hardware. Hadoop by default
stores 3 copies of each data block in the cluster on different nodes of the
cluster. Any time a node or machine fails containing a certain block of data,
another copy is created on another node in the cluster thus making the system
fail proof. In simpler terms Hadoop distributes and replicates the dataset
across the multiple nodes efficiently. So that if any of the nodes fail in the
Hadoop ecosystem, it will still return the dataset appropriately.

 

KEY CHARACTERISTICS

 

–       High Availability

MapReduce,
a YARN based system has efficient load balancing. It ensures that jobs run and
fail independently.  Jobs are restarted
automatically on failure.

–       Scalability of Storage/Compute

 Using the MapReduce model, applications can
scale from a single node to hundreds of nodes without having to re-architect
the system. Scalability is built into the model because data is chunked and
distributed as independent compute quantities.

–       Controlling Cost  

Based
on the evolution of storage and analytic requirements, adding or retiring nodes
is easy with Hadoop. Since,you don’t have to commit to more storage or
processing power ahead of time and can scale only when required, costs can be
controlled.

–       Agility and Innovation

It
is easy to apply new and constantly changing analytic techniques to this data
using MapReduce because data is stored in its original format and there is no
predefined schema.