What is Hadoop?
Hadoop is a distributed data platform developed by the Apache Software Foundation as a means to utilize computer clusters for handling extremely large volumes of data. Maintained under an open source license, it is a popular options for companies seeking a big data solution.
History of the Technology
The term "Big Data" is a buzzword often used in today's technology companies. While the concept for large distributed storage systems has desisted for quite some time, Hadoop is one of the first major frameworks to recently emerge and to popularize the concept, and understanding the history of it's development is crucial to understanding how it works and why it is important.
Hadoop emerged as a synthesis of two major paradigms in dealing with massive sets of data — GFS (Google File System) and MapReduce (a way for organizing data and running searches on a cluster) — but later, the capabilities of Hadoop were strongly expanded by many different enterprise contributors. Hadoop is more of an evolving "ecosystem" than a single piece of software.
Hadoop cofounder Doug Cutting has been working for Yahoo, with the goal of creating a web-scale crawler based search for a project codenamed "Nutch." This idea was to figure out the best way to handle robust storage of data across many different machines, with flexible management of loosely defined data. As Nutch matured, Cutting spent more time scrutinizing the scaling problems, which eventually lead to several papers being published. Google File System in this time, was a way to use commercially off-the-shelf computing resources involved with scaled solutions for storage of data.
When Google released a paper on GFS, Cutting and his team found a way to implement the concept in Java — creating the Nutch Distributed File System (NDFS). MapReduce rose in parallel to NDFS, which also was released as a Google separate publication — but Cutting was able to combine the two concepts in an original way, from Nutch into a new project under the umbrella called Lucene (which is maintained by the Apache Software Foundation). This new project, started in 2006, he named "Hadoop" after his son's Toy elephant. After some internal strife, Yahoo! officially put their full support behind the Hadoop concept, and new concepts began to flourish.
Throughout 2007, major players like Twitter, Facebook, and LinkedIn started doing serious work for Hadoop and expanding its open source ecosystem. By 2008, Hadoop branched out from under project Lucene and became a top level project. Many different sub-components of Hadoop came into being, including HBase, Facebook introduced Hive, a SQL implementation, and Amazon created a hosting service called Elastic MapReduce.
Cutting left Yahoo! and joined with Cloudera as a chief architect, a startup supported by a number of experts from Google, Facebook, Yahoo and BerekeleyDB. By this stage however, Hadoop had become very large — and well supported by all the major players in the Big Data and Web services sector. The best way to define Hadoop is — a collaborative distributed data project organized under the Apache Software Foundation, but contributed to by industry leaders in web applications.
How Hadoop Works
The best way to think about Hadoop is to think about stack-able blocks. Most blocks will have some generic shape, but they also will have connector units that allow the blocks to stick together. In Hadoop — by bundling the volume of the block (data) with a small connector (algorithmic search) — the system is able to stack new block indefinitely, while keeping everything connected in the same structure.
However, Hadoop goes a step further, with MapReduce, with the Map() procedure filtering and sorting the blocks, while the Reduce() algorithm handles summary operations. It would be like creating two houses out of toy blocks, but then for each house, making a list which counts all of the blocks by size, color, and height from the ground. This makes it easy to both estimate statistics about all the houses, or if you're looking for one particular Red block, you can first exclude all the houses without Red blocks before doing a deeper search on other parameters.
The MapReduce engine is effectively run by a distributed hierarchy of MapReduce "TaskTrackers" which are controlled by one central "JobTracker." The TaskTracker nodes are aware of other proximate nodes, and are able to adjust their schedules based on network demands. A TaskTracker has a set of available task slots, which it performs in order to meet the needs of the job (store, search, delete, etc). In effect, a Hadoop cluster is setup so for each Node in an array, the disk storage is being used to house the data, while the processor is being used by a TaskTracker to act across local and nearby named nodes.
The concepts behind Hadoop can be daunting, but the implications are far reaching. Hadoop in conjunction with HBase makes it easy to scale storage space and create schema on the fly. This makes for very dynamic content, and it also allows for data platforms to be developed which support redundancy. If a single server goes down, the data is already backed up somewhere else and no outage ever occurs. Read more for yourself from Apache or Cloudera.
Q & A
How do I get started with Hadoop and integrate it into my site?
Installation requires a download of the release of your choosing — latest stable release is recommended. Getting started is a bigger task, for seasoned Java programmers, jumping right in should be easier than starting from scratch. The best process for learning might be to find an open source project template and follow blogs or user guides on how to get up and running.
Managing a Grid of Computer sounds hard, can a dedicated cluster be run for me as a service?
Absolutely! While Hadoop was developed for enterprise level businesses to create value from using cheap hardware, some of those same enterprise level contributors offer small portions of their hardware as a service in conjuction with the software. Some hosting companies can sell Physical Clusters directly, while others offer Virtualized Clusters for getting started. Hadoop focused hosting exist — like Stratogen, Inetservices or Contegix, but large cloud providers like Amazon, Google or Rackspace, also provide Hadoop options.
Here's What Your Hadoop Hosting Needs To Offer
Hadoop is a collection of software tools that allow a network of computers to work on the same problem. So it is a way to do distributed computing. But using Hadoop is not for the newbie. For one thing, you should be experienced with Java. Hadoop can be hosted on a traditional server, but more and more is hosted on a cloud. For Hadoop, we recommend LiquidWeb.