Input Split and Speculative Execution

Input Split and Speculative Execution

About Input Split

A block is physical unit of file which is divided into small units.
Input is in between the Blocks and Mapper
Block is physical whereas split is logical
The blocks can split unstructured also.
Task Tracker has an intelligence to identify the balance(missing) data present in particular node which is taken care by split so when these are read from particular node it is called as lookup.
The property for the max split size can be set by

conf.set("mapred.max.split.size","2");

No of splitters may or may not be equal to no of block but no of mappers always equals to no of splitters.

Example:

imagine two blocks the data is split as following

1,a,b

2,c,d

3,e,f

4,g,h

5,i,j

Imagine the red data is in B1 and green data is in B2 then mapper1 lookup for data in mapper2 of 3rd row.

About Speculative Execution

If there is Job Tracker with Task Trackers under it the when a particular task got hanged.
Then the scheduler will create a duplicate task of the hanged task with it's replica
Then which ever task get's completed first (either existing task or new task) then the other task gets killed is the concept of Speculative Execution.

Comments