MapReduce offers an effective, faster, and cost-effective way of creating applications.
This model utilizes advanced concepts such as parallel processing, data locality, etc., to provide lots of benefits to programmers and organizations.
But there are so many programming models and frameworks in the market available that it becomes difficult to choose.
And when it comes to Big Data, you can’t just choose anything. You must choose such technologies that can handle large chunks of data.
MapReduce is a great solution to that.
In this article, I’ll discuss what MapReduce really is and how it can be beneficial.
What Is MapReduce?
MapReduce is a programming model or software framework within the Apache Hadoop framework. It is used for creating applications capable of processing massive data in parallel on thousands of nodes (called clusters or grids) with fault tolerance and reliability.
This data processing happens on a database or filesystem where the data is stored. MapReduce can work with a Hadoop File System (HDFS) to access and manage large data volumes.
This framework was introduced in 2004 by Google and is popularized by Apache Hadoop. It’s a processing layer or engine in Hadoop running MapReduce programs developed in different languages, including Java, C++, Python, and Ruby.
The MapReduce programs in cloud computing run in parallel, thus, suitable for performing data analysis on large scales.
MapReduce aims at splitting a task into smaller, multiple tasks using the “map” and “reduce” functions. It will map each task and then reduce it to several equivalent tasks, which results in lesser processing power and overhead on the cluster network.
Example: Suppose you are preparing a meal for a house full of guests. So, if you try to prepare all the dishes and do all the processes yourself, it will become hectic and time-consuming.
But suppose you involve some of your friends or colleagues (not guests) to help you prepare the meal by distributing different processes to another person who can perform the tasks simultaneously. In that case, you will prepare the meal way faster and easier while your guests are still in the house.
MapReduce works in a similar fashion with distributed tasks and parallel processing to enable a faster and easier way to complete a given task.
Apache Hadoop allows programmers to utilize MapReduce to execute models on large distributed data sets and use advanced machine learning and statistical techniques to find patterns, make predictions, spot correlations, and more.
Features of MapReduce
Some of the main features of MapReduce are:
- User interface: You will get an intuitive user interface that provides reasonable details on each framework aspect. It will help you configure, apply, and tune your tasks seamlessly.
- Payload: Applications utilize Mapper and Reducer interfaces to enable the map and reduce functions. The Mapper maps input key-value pairs to intermediate key-value pairs. Reducer is used to reduce intermediate key-value pairs sharing a key to other smaller values. It performs three functions – sort, shuffle, and reduce.
- Partitioner: It controls the division of the intermediate map-output keys.
- Reporter: It’s a function to report progress, update Counters, and set status messages.
- Counters: It represents global counters that a MapReduce application defines.
- OutputCollector: This function collects output data from Mapper or Reducer instead of intermediate outputs.
- RecordWriter: It writes the data output or key-value pairs to the output file.
- DistributedCache: It efficiently distributes larger, read-only files that are application-specific.
- Data compression: The application writer can compress both job outputs and intermediate map outputs.
- Bad record skipping: You can skip several bad records while processing your map inputs. This feature can be controlled through the class – SkipBadRecords.
- Debugging: You will get the option to run user-defined scripts and enable debugging. If a task in MapReduce fails, you can run your debug script and find the issues.
Let’s understand the architecture of MapReduce by going deeper into its components:
- Job: A job in MapReduce is the actual task the MapReduce client wants to perform. It comprises several smaller tasks that combine to form the final task.
- Job History Server: It’s a daemon process to store and save all the historical data about an application or task, such as logs generated after or before executing a job.
- Client: A client (program or API) brings a job to MapReduce for execution or processing. In MapReduce, one or multiple clients can continuously send jobs to the MapReduce Manager for processing.
- MapReduce Master: A MapReduce Master divides a job into several smaller parts, ensuring tasks are progressing simultaneously.
- Job Parts: The sub jobs or job parts are obtained by dividing the primary job. They are worked upon and combined at last to create the final task.
- Input data: It’s the dataset fed to MapReduce for task processing.
- Output data: It’s the final result obtained once the task is processed.
So, what really happens in this architecture is the client submits a job to the MapReduce Master, who divides it into smaller, equal parts. This enables the job to be processed faster as smaller tasks take less time to get processed instead of larger tasks.
However, ensure the tasks are not divided into too small tasks because if you do that, you may have to face a larger overhead of managing splits and waste significant time on that.
Next, the job parts are made available to proceed with the Map and Reduce tasks. Furthermore, the Map and Reduce tasks have a suitable program based on the use case that the team is working on. The programmer develops the logic-based code to fulfill the requirements.
After this, the input data is fed to the Map Task so that the Map can quickly generate the output as a key-value pair. Instead of storing this data on HDFS, a local disk is used to store the data to eliminate the chance of replication.
Once the task is complete, you can throw away the output. Hence, replication will become an overkill when you store the output on HDFS. The output of each map task will be fed to the reduce task, and the map output will be provided to the machine running the reduce task.
Next, the output will be merged and passed to the reduce function defined by the user. Finally, the reduced output will be stored on an HDFS.
Moreover, the process can have several Map and Reduce tasks for data processing depending on the end goal. The Map and Reduce algorithms are optimized to keep the time or space complexity minimum.
Since MapReduce primarily involves Map and Reduce tasks, it’s pertinent to understand more about them. So, let’s discuss the phases of MapReduce to get a clear idea of these topics.
Phases of MapReduce
The input data is mapped into the output or key-value pairs in this phase. Here, the key can refer to the id of an address while the value can be the actual value of that address.
There are only one but two tasks in this phase – splits, and mapping. Splits means the sub-parts or job parts divided from the main job. These are also called input splits. So, an input split can be called an input chunk consumed by a map.
Next, the mapping task takes place. It’s considered the first phase while executing a map-reduce program. Here, data contained in every split will be passed to a map function to process and generate the output.
The function – Map() executes in the memory repository on the input key-value pairs, generating an intermediate key-value pair. This new key-value pair will work as the input to be fed to the Reduce() or Reducer function.
The intermediate key-value pairs obtained in the mapping phase work as the input for the Reduce function or Reducer. Similar to the mapping phase, two tasks are involved – shuffle and reduce.
So, the key-value pairs obtained are sorted and shuffled to be fed to the Reducer. Next, the Reducer groups or aggregates the data according to its key-value pair based on the reducer algorithm that the developer has written.
Here, the values from the shuffling phase are combined to return an output value. This phase sums up the entire dataset.
Now, the complete process of executing Map and Reduce tasks is controlled by some entities. These are:
- Job Tracker: In simple words, a job tracker acts as a master that is responsible for executing a submitted job completely. The job tracker manages all the jobs and resources across a cluster. In addition, the job tracker schedules every map added on the Task Tracker that runs on a specific data node.
- Multiple task trackers: In simple words, multiple task trackers work as slaves performing the task following the instruction of the Job Tracker. A task tracker is deployed on every node separately in the cluster executing the Map and Reduce tasks.
It works because a job will be divided into several tasks that will run on different data nodes from a cluster. The Job Tracker is responsible for coordinating the task by scheduling the tasks and running them on multiple data nodes. Next, the Task Tracker sitting on each data node executes parts of the job and looks after each task.
Furthermore, the Task Trackers send progress reports to the job tracker. Also, the Task Tracker periodically sends a “heartbeat” signal to the Job Tracker and notifies them of the system status. In case of any failure, a job tracker is capable of rescheduling the job on another task tracker.
Output phase: When you reach this phase, you will have the final key-value pairs generated from the Reducer. You can use an output formatter to translate the key-value pairs and write them to a file with the help of a record writer.
Why Use MapReduce?
Here are some of the benefits of MapReduce, explaining the reasons why you must use it in your big data applications:
You can divide a job into different nodes where every node simultaneously handles a part of this job in MapReduce. So, dividing bigger tasks into smaller ones decreases the complexity. Also, since different tasks run in parallel in different machines instead of a single machine, it takes significantly less time to process the data.
In MapReduce, you can move the processing unit to data, not the other way around.
In traditional ways, the data was brought to the processing unit for processing. However, with the rapid growth of data, this process started posing many challenges. Some of them were higher cost, more time consuming, burdening of the master node, frequent failures, and reduced network performance.
But MapReduce helps overcome these issues by following a reverse approach – bringing a processing unit to data. This way, the data gets distributed among different nodes where every node can process a part of the stored data.
As a result, it offers cost-effectiveness and reduces processing time since each node works in parallel with its corresponding data part. In addition, since every node processes a part of this data, no node will be overburdened.
The MapReduce model offers higher security. It helps protect your application from unauthorized data while enhancing cluster security.
Scalability and Flexibility
MapReduce is a highly scalable framework. It allows you to run applications from several machines, using data with thousands of terabytes. It also offers the flexibility of processing data that can be structured, semi-structured, or unstructured and of any format or size.
You can write MapReduce programs in any programming language like Java, R, Perl, Python, and more. Therefore, it’s easy for anyone to learn and write programs while ensuring their data processing requirements are met.
Use Cases of MapReduce
- Full-text indexing: MapReduce is used to perform full-text indexing. Its Mapper can map each word or phrase in a single document. And the Reducer is used to write all the mapped elements to an index.
- Calculating the Pagerank: Google uses MapReduce for calculating the Pagerank.
- Log analysis: MapReduce can analyze log files. It can break a large log file into various parts or splits while the mapper searches for accessed web pages.
A key-value pair will be fed to the reducer if a web page is spotted in the log. Here, the webpage will be the key, and the index “1” is the value. After giving out a key-value pair to the Reducer, various web pages will be aggregated. The final output is the overall number of hits for each webpage.
- Reverse Web-Link Graph: The framework also finds usage in Reverse Web-Link Graph. Here, the Map() yields the URL target and the source and takes input from the source or web page.
Next, Reduce() aggregates the list of each source URL associated with the target URL. Finally, it outputs the sources and the target.
- Word counting: MapReduce is used to count how many times a word appears in a given document.
- Global warming: Organizations, governments, and companies can use MapReduce to solve issues of global warming.
For example, you may want to know about the ocean’s increased temperature level due to global warming. For this, you can gather thousands of data across the globe. The data can be high temperature, low temperature, latitude, longitude, date, time, etc. this will take several maps and reduce tasks to calculate the output using MapReduce.
- Drug trials: Traditionally, data scientists and mathematicians worked together to formulate a new drug that can fight an illness. With the dissemination of algorithms and MapReduce, IT departments in organizations can easily tackle issues that only were handled by Supercomputers, Ph.D. scientists, etc. Now, you can inspect the effectiveness of a drug for a group of patients.
- Other applications: MapReduce can process even large-scale data that won’t otherwise fit in a relational database. It also uses data science tools and allows running them over different, distributed datasets, which was previously possible only on a single computer.
As a result of MapReduce’s robustness and simplicity, it finds applications in the military, business, science, etc.
MapReduce can prove to be a breakthrough in technology. It’s not only a faster and simpler process but also cost-efficient and less time-consuming. Given its advantages and increasing usage, it’s likely to witness higher adoption across industries and organizations.
You may also explore some best resources to learn Big Data and Hadoop.