MR(Map Reduce)


Introduction to Map Reduce

  • It's a massive parallel processing framework
  • In general without distributed data it's hard to do parallel processing.
  • Map Reduce uses Java by default
  • The aim of Map Reduce is to achieve data locality and Parallelism.
  • Spark is an alternative for Map Reduce(functionality has 80% similarity)
  • Hive, Sqoop, Pig, Oozie are abstract of Map Reduce.
  • Daemons in Map Reduce are JT and TT.
  • In map reduce we also have two functionalities called Mapper and Reducer which are responsible for achieving the processing.
  • The MR has an disadvantage of Resource Pressure.

Advantages of Map Reduce

  • Cluster Monitoring
  • Cluster Management
  • Resource Allocating
  • Scheduling
  • Execution
  • Speculative Execution(L)

Working Principle of Map Reduce

  • The task for the MR is given in the form of ZAR(java) file to the Job Tracker.
  • The Job Tracker will be sending the request to the Name Node and in return it gives response.
  • Then the Job Tracker will be sending the task information to Task Tracker.
  • The Task Tracker requests the data from the Data Node and Performs the Mapper Task(MAP JVM).
  • The Mapper Tasks are performed parallel with nearest to and when the task are done they reducer will be executing it's task in the existing node or new node.
  • The reducer job is performed via http protocol.
  • After the reducer job is finished the result is stored in local file system.
  • Then the result is transferred to HDFS.
  • The Task Tracker sends heartbeat for every 3 seconds to JT similar to HDFS and also the job status.
  • If there is failure of Slave then JT will send information to the NN and the task is restarted in replicated node else the job is failed.
  • The MAPPER AND REDUCER JVM will be sending job status to Slave Node.

About Mapper

  • Input for the mapper is given in the form of blocks.
  • The Storage can be HDFS, NOSQL ,RDBMS ,Any Storage Layer.
  • No of Mappers=No of Blocks(default, but not always).
  • By default TT will create 2 MAP Jars.
  • If the node has been assigned with more than 2 tasks then remaining tasks will be in queue.
  • The output is stored in local file system which is called as intermediate data.

About Reducer

  • Input for the reducer is the output of the mapper.
  • The Storage can be HDFS, NOSQL ,RDBMS ,Any Storage Layer.
  • No of Reducer is decided by developer.
  • The reducers can't be reduced further.
  • The output is stored in local file system and then stored in HDFS.

Map Reduce Input/Output Format

  • Text input & Text Output format
    • Key=offset of the record
    • Value=Whole Line of the record
  • Key Value Input & Key Value Output
    • Key=1st tab delimiter
    • Value=Remaining Line
  • Mapper Side-Select statements
  • Reducer Side-Groupby statements
  • Shuffling is of two types from reducer side:
    • Sortby key
    • Groupby key

Map Reduce Programming Part

Syntax:

class ABC
{
class mapper{}
class reducer{}
public static void main(String args[])
{}
}

  • Data types
    • Java Data types: int, float, String, long
    • MR Data types: intwritable, floatwritable, Text, longwritable
  • When working with Key/Value Input/Output format use MR data types else use Java data types.
  • The mapper and reducer will be having these function:
    • mapper()
    • *reducer()
    • setup()
    • cleanup()
    • run()
  • The mapper or reducer function will be executed for each column in a block
  • The setup and cleanup function will be executed only once.
  • It is preferred to use main method instead of run method.
  • The list will be handled internally.

Comments

Popular posts from this blog

Spark Yarn Cluster

Hadoop Installation