Breaking News
Loading...

HADOOP – THE SOLUTION FOR BIG DATA

10:19 pm







Introduction:  
                      
                           Instead of a regular database or cloud, a new method is being developed in order 

to manage this Bigdata, HADOOP - framework. The main technique used in Hadoop is clustering 





of systems. That is data could be stored into a cluster of systems by dividing it into small sachets 

and distributed among a group of systems.

           This is open source software written in java that could be invoked in any platform. There is no 

question of protocols as in cloud computing. By using this software, huge amount of data could be 

stored in a small space or a system.



                
To study on this concept we need to know what DATA, BIGDATA is and what are the 

problems regarding them.
                       
               Data-Any real world symbol (character, numeric, special character) or a of group of them 

is said to be data it may be of the visual or audio or scriptural, etc... In daily life, Facebook updates, 

Youtube uploads, Twitter tweets, Google, Wikipedia, Blogger, Shopping websites, news channels, 

research reports, whether updates, business reports, etc… 
                     
              Then where do we store all this data, if it was printed in paper, it would have consumed all the 

trees on earth. If those papers are spread then it occupy about surface area of moon.  Millions of data 

is being uploaded daily into websites through drives, etc; Channels store most of the data or videos in 

databases rather than in the form of disks, etc… 

              Then question arises how to store this data, we have many storage devices to do this job like 

hard disks, pen drives, compact disks, floppies, etc… if it is a small amount of data these devices are 

sufficient then what about if it were in a huge amount. Then where does all this data go? 


There are three trends in storing such huge amount of data they are- 

                                   1. File systems
                         
                                   2. Databases

               File system is the method of storing data into a disk with no much idea in simplistic retrieval, 




search, etc…

        Database An integrated computer structure that stores a collection of data, end user details 





and Meta data (data about data) through which the end-user data is integrates and managed. This is 

called database. Database is maintained by a system of programs called Database Management 

System (DBMS). In the sense, a database resembles a very well-organised electronic filing cabinet in 

which powerful software, known as a database management system, helps manage the cabinet.



Need for new software technology:

               See the measurements of data: If we measure and compare all that it goes from 

bytes – kilobytes – megabytes – gigabytes – terabytes – zettabytes – petabytes – exabytes.  

                     8 bits = 1 byte

                     1024 bytes = 1 kilo byte (KB)

                      1024 KB = 1 Mega byte (MB)

                      1024 MB = 1 Giga byte (GB)

                      1024 GB = 1 tera byte (TB)

                      1024 TB = 1 Peta byte (PB)

                      1024 PB = 1 Zetta byte (ZB)

And so on………….

Now data being used measured in ZETTABYTES and EXABYTES

                Daily millions of exabytes are being uploaded into the databases and web-based. So the 

data bases and web storages cannot meet the range of today’s demand. This is the curtain raiser for a 

theory BIGDATA. 




Bigdata

          Big data usually includes data sets with sizes beyond the ability of commonly-used software tools 

to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a 

constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a 

single data set. With this difficulty, a new platform of "big data" tools has arisen to handle sense making 

over large quantities of data, as in the Apache Hadoop Big Data Platform.


Achievement of science by this method:

                The Large Hadron Collider (LHC) experiments represent about 150 million sensors delivering 




data 40 million times per second. There are nearly 600 million collisions per second. After 

filtering and not recording more than 99.999% of these streams, there are 100 collisions of 

interest per second.

As a result, only working with less than 0.001% of the sensor stream data, the data flow from all 

four LHC experiments represents 25 petabytes annual rate before replication (as of 2012). This 

becomes nearly 200 petabytes after replication.

If all sensor data were to be recorded in LHC, the data flow would be extremely hard to work 

with. The data flow would exceed 150 million petabytes annual rate, or nearly 500 exabytes per day, 

before replication. To put the number in perspective, this is equivalent to 500 quintillion (5×1020) bytes 

per day, almost 200 times higher than all the other sources combined in the world.

Decoding the human genome originally took 10 years to process; now it can be achieved in one 

week. 


Work by Apache: 

          Hadoop was created by Doug Cutting and Michael J. Cafarella. Doug, who was working at 

Yahoo at the time, named it after his son's toy elephant. It was originally developed to support 

distribution for the Nutch search engine project. 



Architecture: 

            Hadoop consists of the Hadoop Common which provides access to the file systems supported 

by Hadoop. The Hadoop Common package contains the necessary JAR files and scripts needed to 

start Hadoop. The package also provides source code, documentation, and a contribution section 

which includes projects from the Hadoop Community.

                 For effective scheduling of work, every Hadoop-compatible filesystems should provide 

location awareness: the name of the rack (more precisely, of the network switch) where a worker node 

is. Hadoop applications can use this information to run work on the node where the data is, and, failing 

that, on the same rack/switch, reducing backbone traffic. 

                 The Hadoop Distributed File System (HDFS) uses this when replicating data, to try to keep 

different copies of the data on different racks. The goal is to reduce the impact of a rack power outage 

or switch failure so that even if these events occur, the data may still be readable. 

                 A small Hadoop cluster will include a single master and multiple worker nodes. The master 

node consists of a JobTracker, TaskTracker, NameNode, and Datanode. A slave or worker node acts 

as both a Datanode and TaskTracker, though it is possible to have data-only worker nodes, and 

compute-only worker nodes; these are normally only used in non-standard applications. Hadoop 

requires JRE 1.6 or higher. The standard startup and shutdown scripts require to be set up between 

nodes in the cluster.

                 In a larger cluster, the HDFS is managed through a dedicated NameNode server to host the 

filesystem index, and a secondary NameNode that can generate snapshots of the namenode's memory 

structures, thus preventing filesystem corruption and reducing loss of data. Similarly, a standalone 

JobTracker server can manage job scheduling. In clusters where the Hadoop MapReduce engine is 

deployed against an alternate filesystem, the NameNode, secondary NameNode and Datanode 

architecture of HDFS is replaced by the filesystem-specific equivalent.




HDFS (Hadoop Distributed File System): 

               HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop 

framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes 

form the HDFS cluster. The situation is typical because each node does not require a datanode to be 

present. Each datanode serves up blocks of data over the network using a block protocol specific to 

HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate 

between each other.
            
             HDFS stores large files (an ideal file size is a multiple of 64 MB, across multiple machines. It 

achieves reliability by replicating the data across multiple hosts, and hence does not require RAID 

storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same 

rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move 

copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because 

the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. 

The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data 

throughput. HDFS was designed to handle very large files. 

             HDFS has recently added high-availability capabilities, allowing the main metadata server (the 

Namenode) to be manually failed over to a backup in the event of failure. Automatic failover is being 

developed as well. Additionally, the filesystem includes what is called a Secondary Namenode, which 

misleads some people into thinking that when the Primary Namenode goes offline, the Secondary 

Namenode takes over. In fact, the Secondary Namenode regularly connects with the Primary 

Namenode and builds snapshots of the Primary Namenode's directory information, which is then s

aved to local/remote directories. These check pointed images can be used to restart a failed Primary 

Namenode without having to replay the entire journal of filesystem actions, and then edit the log to create 

an up-to-date directory structure. Since Namenode is the single point for storage and management of 

metadata, this can be a bottleneck for supporting a huge number of files, especially a large number of 

small files. HDFS Federation is a new addition which aims to tackle this problem to a certain extent by 

allowing multiple namespaces served by separate Namenodes.

           An advantage of using HDFS is data awareness between the JobTracker and TaskTracker. The 

JobTracker schedules map/reduce jobs to TaskTracker with an awareness of the data location. An 

example of this would be if node A contained data (x, y, z) and node B contained data (a, b, c). The 

JobTracker will schedule node B to perform map/reduce tasks on (a, b, c) and node A would be 

scheduled to perform map/reduce tasks on (x, y, z). This reduces the amount of traffic that goes over 

the network and prevents unnecessary data transfer. When Hadoop is used with other filesystems this 

advantage is not always available. This can have a significant impact on the performance of job 

completion times, which has been demonstrated when running data intensive jobs. 

                     Another limitation of HDFS is that it cannot be directly mounted by an existing operating 

system. Getting data into and out of the HDFS file system, an action that often needs to be performed 

before and after executing a job can be inconvenient. A Filesystem in Userspace (FUSE) virtual file 

system has been developed to address this problem, at least for Linux and some other UNIX systems.

File access can be achieved through the native Java API, the Thrift API to generate a client in the 

language of the users' choosing (C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, 

Smalltalk, and OCaml), the command-line interface, or browsed through the HDFS-UI webapp 

over HTTP.JobTracker and TaskTracker: the MapReduce engine

            Above the file systems comes the MapReduce engine, which consists of one JobTracker, 

to which client applications submit MapReduce jobs. The JobTracker pushes work out to available 

TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. 

With a rack-aware filesystem, the JobTracker knows which node contains the data, and which 

other machines are nearby. If the work cannot be hosted on the actual node where the data resides, 

priority is given to nodes in the same rack. This reduces network traffic on the main backbone 

network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker 

on each node spawns off a separate Java Virtual Machine process to prevent the TaskTracker itself 

from failing if the running job crashes the JVM. A heartbeat is sent from the TaskTracker to the 

JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and 

information is exposed by Jetty and can be viewed from a web browser.

          If the JobTracker failed on Hadoop, all ongoing work was lost. Later on, Hadoop versions 

added some checkpointing to this process; the JobTracker records what it is up to in the filesystem. 

When a JobTracker starts up, it looks for any such data, so that it can restart work from where it left off. 

In earlier versions of Hadoop, all active work was lost when a JobTracker restarted.



Other applications:



Scheduling:
               
       By default Hadoop uses FIFO, and optional 5 scheduling priorities to schedule jobs.



Fair scheduler:

          The fair scheduler was developed by Facebook. The goal of the fair scheduler is to provide fast 

response times for small jobs and QoS for production jobs. The fair scheduler has three basic concepts. 

                 1. Jobs are grouped into Pools. 

                 2. Each pool is assigned a guaranteed minimum share. 

                 3. Excess capacity is split between jobs. 

        By default, jobs that are uncategorized go into a default pool. Pools have to specify the minimum 

number of map slots, reduce slots, and a limit on the number of running jobs.

           The HDFS filesystem is not restricted to MapReduce jobs. It can be used for other applications, 

many of which are under development at Apache. The list includes the HBase database, the Apache 

Mahout Machine learning system, and the Apache Hive Data Warehouse system. Hadoop can in theory 

be used for any sort of work that is batch-oriented rather than real-time, that is very data-intensive, and 

able to work on pieces of the data in parallel. 



Prominent users of HADOOP framework (apache) in their words:



Amazon:






We build Amazon's product search indices using the streaming API and pre-existing C++, Perl, 

and Python tools


We process millions of sessions daily for analytics, using both the Java and streaming APIs. 

Our clusters vary from 1 to 100 nodes 



Adobe: 






We currently have about 30 nodes running HDFS, Hadoop and HBase in clusters ranging from 

5 to 14 nodes on both production and development. We plan a deployment on an 80 nodes cluster. 

We constantly write data to HBase and run MapReduce jobs to process then store it back to 

HBase or external systems. 

Our production cluster has been running since Oct 2008. 



EBay: 





532 nodes cluster (8 * 532 cores, 5.3PB). 

Heavy usage of Java MapReduce, Pig, Hive, HBase

Using it for Search optimization and Research 




Facebook:








We use Hadoop to store copies of internal log and dimension data sources and use it as a 

source for reporting/analytics and machine learning. 

Currently we have 2 major clusters: 

An 1100-machine cluster with 8800 cores and about 12 PB raw storage. 

A 300-machine cluster with 2400 cores and about 3 PB raw storage. 

Each (commodity) node has 8 cores and 12 TB of storage. 

We are heavy users of both streaming as well as the Java APIs. We have built a higher level 

data warehousing framework using these features called Hive. We have also developed a FUSE 

implementation over HDFS. 



Twitter:






We use Hadoop to store and process tweets, log files, and many other types of data generated 

across Twitter. We use Cloudera's CDH2 distribution of Hadoop. 

We use both Scala and Java to access Hadoop's MapReduce APIs 

We use Pig heavily for both scheduled and ad-hoc jobs, due to its ability to accomplish a lot with 

few statements. 

We employ committers on Pig, Avro, Hive, and Cassandra, and contribute much of our               

internal Hadoop work to opensource 



Yahoo!:

More than 100,000 CPUs in >40,000 computers running Hadoop 

Our biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) 

Used to support research for Ad Systems and Web Search 

Also used to do scaling tests to support development of Hadoop on larger clusters 

>60% of Hadoop Jobs within Yahoo are Pig jobs. 



References:
Wikepedia.org
Hapachi.hadoop.com
How the stuff works.

0 comments:

Post a Comment

 
Toggle Footer