Hadoop MapReduce is a framework for processing large amounts of data(of the size of terabytes).The large data sets are
processed on nodes or clusters(which is hardware) in parallel.
It consists of map jobs and reduce jobs.It works by providing the data set to the map jobs.The output of the map job is
input to the reduce job.The input to the reduce job is stored in the HDFS.In Hadoop map and reduce jobs can be written in any programming language such as java,python and ruby.
Every cluster consists of the Jobtracker and TaskTracker.The TaskTracker executes the jobs while Jobtracker schedules the
jobs to the tasktracker.
The map job and reduce jobs implementations are provided by the programmer.These are usually classes.
The map function is applied on the every data element.A data element is a key value pair.After the job executes a new list of data elements is produced with the initial list unmodified.
The output from the map job is input to the reduce job.Reduce job consists of a function which is applied to every item in the input list.
So following steps are performed by the mapreduce jobs:
- Map job operates on the input dataset and filters the dataset to extract useful values.
- Reduce job operates on the output of the map job and produces the final result.
MapReduce operates on the principle of divide and conquer where the large problem is divided into small manageable problems.After solving these small problems in parallel ,all the solutions to these problems are combined together to give the final output or solution to the problem.
Since the small problems are solved in parallel ,the problem is executed fast.