

About Spark

  • Spark is a data processing framework which is a solution for bigdata.
  • Spark is subset of Hadoop
  • Spark SQL is used to connect with hive for retrieving data.
  • Spark has two data process methods
    • Batch Processing
      • It's about dealing with files, databases.
    • Stream Processing
      • It's about dealing with live data.
  • It is has less security.
  • Spark can run on standalone file system then distributes the data in memory is called Inmemory Processing.
  • Spark can connect to any data storage.
  • Spark replaces MR in Hadoop.
  • Spark is a data processing framework which do's batch processing (or) stream processing.
  • Spark works on Java, Python, Scala and R.

Architecture of Spark

  • Deployment Modes
    • Standalone(Installation of only spark in single/multi node)
    • Yarn(Spark+Hadoop)
    • Mesos
  • Daemon Processes
    • Master(Similar to Master in HDFS)
    • Worker(Similar to Slave in HDFS)
  • The memory will be distributed its data in it and then processes which is called as Memory Computation
  • Spark REPL(Read Evaluate Print Loop)
    • Any technology based on cmd is called as repl
    • Sparkshell(scala)
    • Pyshell(python)

API's in Spark(Processing Data)

  • RDD(Coding)
  • DataFrames(Rows and Columns)
  • Dataset(Extension of RDD)

Lazy Evaluation

  • In spark we have two technologies called transformations and actions
    • Transformations: map, group, sort, filter
    • Actions: count, savefile, show, output
  • For any job after performing transformations there should be an action at the end for the compilation as it is bottom up approach which is called as Lazy Evaluation
  • If there is no action after performing transformation then there is no usage of the job so this was introduced.
  • Ex
    • Map->Filter->Group->Sort->...........->group (No output as there is no action at the end)
    • Map->Filter->Group->Sort->...........->count (Output as there is action at the end)


  • To start spark
    • sbin/
  • Spark UI is at localhost:8080
  • To start spark shell(scala)
    • localhost:4040
    • bin/spark-shell
  • To start pyspark(python)
    • bin/pyspark
  • WordCount is the basic program for spark similar to Hello World.
  • There are two types of variables in spark
    • var (similar to static variable)
    • val (similar to final variable)
  • sc is the spark context where the compilation will be started by the compiler

Wordcount Program

var a=sc.textFile("file:/home/user/data.txt").flatMap(line => line.split(" ")).map(word => (word,1));
var b=a.reduceByKey(_ + _);


