Input Split and Speculative Execution

 


About Input Split

  • A block is physical unit of file which is divided into small units.
  • Input is in between the Blocks and Mapper
  • Block is physical whereas split is logical
  • The blocks can split unstructured also.
  • Task Tracker has an intelligence to identify the balance(missing) data present in particular node which is taken care by split so when these are read from particular node it is called as lookup.
  • The property for the max split size can be set by
    • conf.set("mapred.max.split.size","2");
  • No of splitters may or may not be equal to no of block but no of mappers always equals to no of splitters.
Example:
imagine two blocks the data is split as following

1,a,b
2,c,d
3,e,f
4,g,h
5,i,j

Imagine the red data is in B1 and green data is in B2 then mapper1 lookup for data in mapper2 of 3rd row.


About Speculative Execution

  • If there is Job Tracker with Task Trackers under it the when a particular task got hanged.
  • Then the scheduler will create a duplicate task of the hanged task with it's replica
  • Then which ever task get's completed first (either existing task or new task) then the other task gets killed is the concept of Speculative Execution.

Comments

Popular posts from this blog

MR(Map Reduce)

Spark Yarn Cluster

Hadoop Installation