Hadoop is an Apache open source software framework for storage and large scale processing of the data-sets
and clusters on commodity hardware. It is licensed under Apache, and it’s open source. Hadoop was created by Doug Cutting and Mike Cafarella in 2005.
It was originally developed to support distribution of the Nutch Search Engine Project. Doug, who was working at Yahoo at the time, who is now a chief architect at Cloudera, has named this project after his son’s elephant
Hadoop started out as a simple batch processing framework.
MapReduce is designed to process unlimited amounts of data of any type that’s stored in HDFS by dividing workloads into multiple tasks across servers that are run in parallel.
MapReduce allows people to perform computations of big data-sets, computations that we can’t easily perform
without this kind of architecture. It is a simple, efficient powerful computing framework. It is very efficient.
The idea behind Hadoop is to move computation to data. Hadoop MapReduce provides a shared and integrated foundation where we can bring additional tools and build up this framework.
Scalability’s is the core of a Hadoop system.
Apache’s Hadoop MapReduce and HTFS components were originally derived from the Google’s MapReduce and
Google’s file system. Hadoop’s novel approach to data ensures we can keep all the data that we have, and we can take that data and analyze it in new interesting ways.
Schema on read refers to an innovative data analysis strategy in new data-handling tools like Hadoop and other more involved database technologies. In schema on read, data is applied to a plan or schema as it is pulled out of a stored location, rather than as it goes in.