Spark

Spark

About Spark

Spark is a data processing framework which is a solution for bigdata.
Spark is subset of Hadoop
Spark SQL is used to connect with hive for retrieving data.
Spark has two data process methods

Batch Processing

It's about dealing with files, databases.

Stream Processing

It's about dealing with live data.

It is has less security.
Spark can run on standalone file system then distributes the data in memory is called Inmemory Processing.

Spark can connect to any data storage.
Spark replaces MR in Hadoop.
Spark is a data processing framework which do's batch processing (or) stream processing.
Spark works on Java, Python, Scala and R.

Architecture of Spark

Deployment Modes

Standalone(Installation of only spark in single/multi node)
Yarn(Spark+Hadoop)
Mesos

Daemon Processes

Master(Similar to Master in HDFS)
Worker(Similar to Slave in HDFS)

The memory will be distributed its data in it and then processes which is called as Memory Computation
Spark REPL(Read Evaluate Print Loop)

Any technology based on cmd is called as repl
Sparkshell(scala)
Pyshell(python)

API's in Spark(Processing Data)

RDD(Coding)
DataFrames(Rows and Columns)
Dataset(Extension of RDD)

Lazy Evaluation

In spark we have two technologies called transformations and actions

Transformations: map, group, sort, filter
Actions: count, savefile, show, output

For any job after performing transformations there should be an action at the end for the compilation as it is bottom up approach which is called as Lazy Evaluation
If there is no action after performing transformation then there is no usage of the job so this was introduced.
Ex

Map->Filter->Group->Sort->...........->group (No output as there is no action at the end)
Map->Filter->Group->Sort->...........->count (Output as there is action at the end)

Tooling

To start spark

sbin/start-all.sh

Spark UI is at localhost:8080
To start spark shell(scala)

localhost:4040
bin/spark-shell

To start pyspark(python)

bin/pyspark

WordCount is the basic program for spark similar to Hello World.
There are two types of variables in spark

var (similar to static variable)
val (similar to final variable)

sc is the spark context where the compilation will be started by the compiler

Wordcount Program

var a=sc.textFile("file:/home/user/data.txt").flatMap(line => line.split(" ")).map(word => (word,1));

var b=a.reduceByKey(_ + _);

b.collect;

Comments