Spark
About Spark
- Spark is a data processing framework which is a solution for bigdata.
- Spark is subset of Hadoop
- Spark SQL is used to connect with hive for retrieving data.
- Spark has two data process methods
- Batch Processing
- It's about dealing with files, databases.
- Stream Processing
- It's about dealing with live data.
- It is has less security.
- Spark can run on standalone file system then distributes the data in memory is called Inmemory Processing.
- Spark can connect to any data storage.
- Spark replaces MR in Hadoop.
- Spark is a data processing framework which do's batch processing (or) stream processing.
- Spark works on Java, Python, Scala and R.
Architecture of Spark
- Deployment Modes
- Standalone(Installation of only spark in single/multi node)
- Yarn(Spark+Hadoop)
- Mesos
- Daemon Processes
- Master(Similar to Master in HDFS)
- Worker(Similar to Slave in HDFS)
- The memory will be distributed its data in it and then processes which is called as Memory Computation
- Spark REPL(Read Evaluate Print Loop)
- Any technology based on cmd is called as repl
- Sparkshell(scala)
- Pyshell(python)
API's in Spark(Processing Data)
- RDD(Coding)
- DataFrames(Rows and Columns)
- Dataset(Extension of RDD)
Lazy Evaluation
- In spark we have two technologies called transformations and actions
- Transformations: map, group, sort, filter
- Actions: count, savefile, show, output
- For any job after performing transformations there should be an action at the end for the compilation as it is bottom up approach which is called as Lazy Evaluation
- If there is no action after performing transformation then there is no usage of the job so this was introduced.
- Ex
- Map->Filter->Group->Sort->...........->group (No output as there is no action at the end)
- Map->Filter->Group->Sort->...........->count (Output as there is action at the end)
Tooling
- To start spark
- sbin/start-all.sh
- Spark UI is at localhost:8080
- To start spark shell(scala)
- localhost:4040
- bin/spark-shell
- To start pyspark(python)
- bin/pyspark
- WordCount is the basic program for spark similar to Hello World.
- There are two types of variables in spark
- var (similar to static variable)
- val (similar to final variable)
- sc is the spark context where the compilation will be started by the compiler
Wordcount Program
var a=sc.textFile("file:/home/user/data.txt").flatMap(line => line.split(" ")).map(word => (word,1));
var b=a.reduceByKey(_ + _);
b.collect;
Comments
Post a Comment