The Best Hadoop Hosting: Who’s The Best For Your Site? [Updated: 2019]
What is Hadoop?
Hadoop is a distributed data platform developed by the Apache Software Foundation as a means to utilize computer clusters for handling extremely large volumes of data. Maintained under an open source license, it is a popular options for companies seeking a big data solution.
History of the Technology
The term “Big Data” is a buzzword often used in today’s technology companies. While the concept for large distributed storage systems has desisted for quite some time, Hadoop is one of the first major frameworks to recently emerge and to popularize the concept, and understanding the history of it’s development is crucial to understanding how it works and why it is important.
Hadoop emerged as a synthesis of two major paradigms in dealing with massive sets of data — GFS (Google File System) and MapReduce (a way for organizing data and running searches on a cluster) — but later, the capabilities of Hadoop were strongly expanded by many different enterprise contributors. Hadoop is more of an evolving “ecosystem” than a single piece of software.
Hadoop cofounder Doug Cutting has been working for Yahoo, with the goal of creating a web-scale crawler based search for a project codenamed “Nutch.” This idea was to figure out the best way to handle robust storage of data across many different machines, with flexible management of loosely defined data. As Nutch matured, Cutting spent more time scrutinizing the scaling problems, which eventually lead to several papers being published. Google File System in this time, was a way to use commercially off-the-shelf computing resources involved with scaled solutions for storage of data.
When Google released a paper on GFS, Cutting and his team found a way to implement the concept in Java — creating the Nutch Distributed File System (NDFS). MapReduce rose in parallel to NDFS, which also was released as a Google separate publication — but Cutting was able to combine the two concepts in an original way, from Nutch into a new project under the umbrella called Lucene (which is maintained by the Apache Software Foundation). This new project, started in 2006, he named “Hadoop” after his son’s Toy elephant. After some internal strife, Yahoo! officially put their full support behind the Hadoop concept, and new concepts began to flourish.
Throughout 2007, major players like Twitter, Facebook, and LinkedIn started doing serious work for Hadoop and expanding its open source ecosystem. By 2008, Hadoop branched out from under project Lucene and became a top level project. Many different sub-components of Hadoop came into being, including HBase, Facebook introduced Hive, a SQL implementation, and Amazon created a hosting service called Elastic MapReduce.
Cutting left Yahoo! and joined with Cloudera as a chief architect, a startup supported by a number of experts from Google, Facebook, Yahoo and BerekeleyDB. By this stage however, Hadoop had become very large — and well supported by all the major players in the Big Data and Web services sector. The best way to define Hadoop is — a collaborative distributed data project organized under the Apache Software Foundation, but contributed to by industry leaders in web applications.