Spark Standalone Architecture and Working
About Spark Standalone Architecture
- The Daemons in spark are(Master Slave Architecture):
- Master Node
- Worker Node
- The Driver Program is starting point of spark which is started by Master and Responsible for Job.
Lets understand the architecture of spark with an Example
- Client submitted job in the form of jar
- Each jar file contains driver program.
- The Spark master executes the driver code.
- The driver is requesting for resource from cluster manager.
- The cluster manager is responsible for allocation of resources.
- The cluster manager is allocating resources in the form of executors.
- The cluster manager finds data locality(particular data location) before assigning executors.
- Driver program is responsible for assigning the job to the executors (JAR File).
- Once the executor has been created it starts sending heartbeat to driver program and cluster manager also informs the driver program that it has been allocated the resources.
- The driver program is in Master and Executors are Workers.
- Executor has been assigned task and They will be taking the respective data as input and starts processing
- Empty worker nodes also sends to spark master that they are alive.
- The persists(intermediate data) can be stored in 3 ways which is a result after executors task is done they are
- Disk
- Memory
- Disk+Memory
- When task is done by executors then driver program look for any worker node is free then cluster manager allocates resource and driver manager allocates job lets say grouping of persist and final output will be stored to any memory.
Key Points
- If the executor is failed then the driver program relaunches the executor with same task(temporary crash)
- In spark replication can be achieved through RDD.
- Spark has feature of Inmemory Processing.
- The metadata holder is the driver program.
- If the driver program fails then the spark master recreates and starts recomputing the tasks.
- In case of permanent failure of master the Zookeeper is a technology that enables HA(High Availability) i.e. more than one master.
- The zookeeper elects passive master if the active master is failure.
- All Passive master gets heartbeat from the workers which is useful when it becomes Active Master.
- If Zookeeper fails it is managed automatically(if leader is dead then follower is elected as leader).
RDD
- In Spark development, RDD refers to the distributed data elements collection across various devices in the cluster.
- RDD
- Resultant(Tolerance)
- Distributed(Data Distribution)
- Dataset(csv, json)
- RDD lineage is nothing but the graph of all the parent RDDs of an RDD.
- Ex: P1----(Transformation)---->P2-----(Transformation)----->P3----......->Output
- The transformations from one phase to other phase are called as data lineage which is maintained through DAG(holds metadata).
Fault Tolerance of Lineage(Transformation):
- If a particular transformation is failed it finds the lineage and recomputes the transformation using DAG.
- The persist of each transformation can be stored in:
- Memory_only_2
- which stores 2 copies 1 in other machine.
- We have to decide replication which was depended on memory.
- Each concept in spark is based on RDD(transformation+action).
About RDD
- There are three ways to create RDD
- Parallelized Collection
- val rdd1=sc.parallelize(Array["Jan","Feb","Mar"]);
- rdd1.collect;
- External Dataset
- val r=sc.read.textFile("file:/home/src.txt");
- r.collect;
- Creating RDD from existing RDD
- val r=sc.textFile("file:/home/src.txt").flatMap(line => line.split(" "));
- val m=r.map(word => (word,1));
- m.collect;
- RDD has lazy evaluation in which bottom to top approach is performed. Until no action is mentioned the job won't execute because after transformation if we won't perform any action what is the use ?
Other Keypoints
- Transformations are of two types
- Narrow(one to one)
- Wide(with shuffle)
- Repartition is increase/decrease output aggregation.
- For narrow and wide transformations each stage will be created new.
- Terms
- num executors(no of executors in worker nodes(in cluster)
- executor memory(memory space for each executor)
- executor cores(max tasks an executor can run at a time)
- driver memory(When spark submit is done then master connection created with it and it coordinates the jobs in worker nodes (preferably 1-2 GB))
- All data's are combinedly received by the driver program and it should enough to receive that else job will fail.
Comments
Post a Comment