Spark Yarn Cluster

 


  • Spark Submit Cluster(Yarn)->(Hadoop+Spark)
  • Spark Lens Integration is a profiling tool to check the memory allocation(executor memory, etc.)
  • Spark has Hash Partition which is the same algorithm used by MapReduce also.
  • Map(Parallelism which acts as input), Reduce(Aggregation which acts as output)
  • No of blocks=No of tasks when it is HDFS.
  • No of O/P tasks=No of I/P tasks
  • Spark uses the algorithm called hash partition to decide which particular output partition should be set to particular output task.
  • Hash Partition(output partition)=(Hash of the key)%(No of output task count)

Few Commands

  • data.getNumPartitions    (It is used to get no of partitions)
  • reducedata.repartition(3).saveAsTextFile("out.txt");
  • Removing Null Values
    • df.dropna(how="any").show() #removes row if any one column is null
    • df.dropna(how="all").show() #removes row if all columns are null
    • df.dropna(how="any",subset=["salary"]).show() #removes column if any null value is found
    • df.dropna(how="any",thresh=2).show() #more than or equal to 2 values then it will display

Repartition or Coalesce

  • Repartition  is to increase the no of partitions(parent+new)
    • When it is performed total data will be reshuffled and it needs to be tuned again.
    • It can be used to increase and decrease partition but it depends on data.
  • Coalesce is to reduce the no of partitions(new)
    • The shuffle doesn't happens here so the performance increases.
    • But parallelism is lost and it's not fast in all cases.

GroupbyKey vs ReducebyKey

  • ReduceByKey is prefered for calculations
  • ReduceByKey uses (MR-Combiner) for it's tasks.
    • Uses reducer logic
    • It's a mini reducer

Other important points

  • Spark Dedup is a process of removing duplicates from a table.
    • dropDuplicates (column_name).show;
  • The duplicates is due to the type of data provided by source.
  • The duplicates will be created when the scheduler(oozie) is failed and it restarted.
  • PySpark UDF is slow in python it is suggested to use other languages to overcome this challenge.
  • Moving Average Calculation=((-1)+(1)+(0))/3
    • Spark MLib is used to calculate FP Growth(Frequent Pattern)
    • confidence=pq where p=4(constant)
  • CrossTab in PySpark is used to represent frequency between two or more than 2 variables(frequency distribution).
    • df.crosstab("name","age").show()
  • size of cluster=size of system(sum of nodes size)

Comments

Popular posts from this blog

MR(Map Reduce)

Hadoop Installation