MR(Map Reduce)
Introduction to Map Reduce
- It's a massive parallel processing framework
- In general without distributed data it's hard to do parallel processing.
- Map Reduce uses Java by default
- The aim of Map Reduce is to achieve data locality and Parallelism.
- Spark is an alternative for Map Reduce(functionality has 80% similarity)
- Hive, Sqoop, Pig, Oozie are abstract of Map Reduce.
- Daemons in Map Reduce are JT and TT.
- In map reduce we also have two functionalities called Mapper and Reducer which are responsible for achieving the processing.
- The MR has an disadvantage of Resource Pressure.
Advantages of Map Reduce
- Cluster Monitoring
- Cluster Management
- Resource Allocating
- Scheduling
- Execution
- Speculative Execution(L)
Working Principle of Map Reduce
- The task for the MR is given in the form of ZAR(java) file to the Job Tracker.
- The Job Tracker will be sending the request to the Name Node and in return it gives response.
- Then the Job Tracker will be sending the task information to Task Tracker.
- The Task Tracker requests the data from the Data Node and Performs the Mapper Task(MAP JVM).
- The Mapper Tasks are performed parallel with nearest to and when the task are done they reducer will be executing it's task in the existing node or new node.
- The reducer job is performed via http protocol.
- After the reducer job is finished the result is stored in local file system.
- Then the result is transferred to HDFS.
- The Task Tracker sends heartbeat for every 3 seconds to JT similar to HDFS and also the job status.
- If there is failure of Slave then JT will send information to the NN and the task is restarted in replicated node else the job is failed.
- The MAPPER AND REDUCER JVM will be sending job status to Slave Node.
About Mapper
- Input for the mapper is given in the form of blocks.
- The Storage can be HDFS, NOSQL ,RDBMS ,Any Storage Layer.
- No of Mappers=No of Blocks(default, but not always).
- By default TT will create 2 MAP Jars.
- If the node has been assigned with more than 2 tasks then remaining tasks will be in queue.
- The output is stored in local file system which is called as intermediate data.
About Reducer
- Input for the reducer is the output of the mapper.
- The Storage can be HDFS, NOSQL ,RDBMS ,Any Storage Layer.
- No of Reducer is decided by developer.
- The reducers can't be reduced further.
- The output is stored in local file system and then stored in HDFS.
Map Reduce Input/Output Format
- Text input & Text Output format
- Key=offset of the record
- Value=Whole Line of the record
- Key Value Input & Key Value Output
- Key=1st tab delimiter
- Value=Remaining Line
- Mapper Side-Select statements
- Reducer Side-Groupby statements
- Shuffling is of two types from reducer side:
- Sortby key
- Groupby key
Map Reduce Programming Part
Syntax:
class ABC
{
class mapper{}
class reducer{}
public static void main(String args[])
{}
}
- Data types
- Java Data types: int, float, String, long
- MR Data types: intwritable, floatwritable, Text, longwritable
- When working with Key/Value Input/Output format use MR data types else use Java data types.
- The mapper and reducer will be having these function:
- mapper()
- *reducer()
- setup()
- cleanup()
- run()
- The mapper or reducer function will be executed for each column in a block
- The setup and cleanup function will be executed only once.
- It is preferred to use main method instead of run method.
- The list will be handled internally.
Comments
Post a Comment