Spark Standalone Architecture and Working

About Spark Standalone Architecture

The Daemons in spark are(Master Slave Architecture):

Master Node
Worker Node

The Driver Program is starting point of spark which is started by Master and Responsible for Job.

Lets understand the architecture of spark with an Example

Client submitted job in the form of jar
Each jar file contains driver program.
The Spark master executes the driver code.
The driver is requesting for resource from cluster manager.
The cluster manager is responsible for allocation of resources.
The cluster manager is allocating resources in the form of executors.
The cluster manager finds data locality(particular data location) before assigning executors.
Driver program is responsible for assigning the job to the executors (JAR File).
Once the executor has been created it starts sending heartbeat to driver program and cluster manager also informs the driver program that it has been allocated the resources.
The driver program is in Master and Executors are Workers.
Executor has been assigned task and They will be taking the respective data as input and starts processing
Empty worker nodes also sends to spark master that they are alive.
The persists(intermediate data) can be stored in 3 ways which is a result after executors task is done they are

Disk
Memory
Disk+Memory

When task is done by executors then driver program look for any worker node is free then cluster manager allocates resource and driver manager allocates job lets say grouping of persist and final output will be stored to any memory.

Key Points

If the executor is failed then the driver program relaunches the executor with same task(temporary crash)
In spark replication can be achieved through RDD.
Spark has feature of Inmemory Processing.
The metadata holder is the driver program.
If the driver program fails then the spark master recreates and starts recomputing the tasks.
In case of permanent failure of master the Zookeeper is a technology that enables HA(High Availability) i.e. more than one master.
The zookeeper elects passive master if the active master is failure.
All Passive master gets heartbeat from the workers which is useful when it becomes Active Master.
If Zookeeper fails it is managed automatically(if leader is dead then follower is elected as leader).

RDD

In Spark development, RDD refers to the distributed data elements collection across various devices in the cluster.
RDD

Resultant(Tolerance)
Distributed(Data Distribution)
Dataset(csv, json)

RDD lineage is nothing but the graph of all the parent RDDs of an RDD.

Ex: P1----(Transformation)---->P2-----(Transformation)----->P3----......->Output
The transformations from one phase to other phase are called as data lineage which is maintained through DAG(holds metadata).

Fault Tolerance of Lineage(Transformation):

If a particular transformation is failed it finds the lineage and recomputes the transformation using DAG.
The persist of each transformation can be stored in:

Memory_only_2

which stores 2 copies 1 in other machine.

We have to decide replication which was depended on memory.
Each concept in spark is based on RDD(transformation+action).

About RDD

There are three ways to create RDD

Parallelized Collection

val rdd1=sc.parallelize(Array["Jan","Feb","Mar"]);
rdd1.collect;

External Dataset

val r=sc.read.textFile("file:/home/src.txt");
r.collect;

Creating RDD from existing RDD

val r=sc.textFile("file:/home/src.txt").flatMap(line => line.split(" "));
val m=r.map(word => (word,1));
m.collect;

RDD has lazy evaluation in which bottom to top approach is performed. Until no action is mentioned the job won't execute because after transformation if we won't perform any action what is the use ?

Other Keypoints

Transformations are of two types

Narrow(one to one)
Wide(with shuffle)

Repartition is increase/decrease output aggregation.
For narrow and wide transformations each stage will be created new.
Terms

num executors(no of executors in worker nodes(in cluster)
executor memory(memory space for each executor)
executor cores(max tasks an executor can run at a time)
driver memory(When spark submit is done then master connection created with it and it coordinates the jobs in worker nodes (preferably 1-2 GB))

All data's are combinedly received by the driver program and it should enough to receive that else job will fail.

Search This Blog

Data Diaries by Krishnasairaj

Spark Standalone Architecture and Working

About Spark Standalone Architecture

Key Points

RDD

Fault Tolerance of Lineage(Transformation):

About RDD

Other Keypoints

Comments

Post a Comment

Popular posts from this blog

MR(Map Reduce)

Hadoop Installation

NOSQL Database and Pipeline tool