Spark Yarn Cluster

Spark Submit Cluster(Yarn)->(Hadoop+Spark)
Spark Lens Integration is a profiling tool to check the memory allocation(executor memory, etc.)
Spark has Hash Partition which is the same algorithm used by MapReduce also.
Map(Parallelism which acts as input), Reduce(Aggregation which acts as output)
No of blocks=No of tasks when it is HDFS.
No of O/P tasks=No of I/P tasks
Spark uses the algorithm called hash partition to decide which particular output partition should be set to particular output task.
Hash Partition(output partition)=(Hash of the key)%(No of output task count)

df.dropna(how="any").show() #removes row if any one column is null
df.dropna(how="all").show() #removes row if all columns are null
df.dropna(how="any",subset=["salary"]).show() #removes column if any null value is found
df.dropna(how="any",thresh=2).show() #more than or equal to 2 values then it will display

Repartition or Coalesce

When it is performed total data will be reshuffled and it needs to be tuned again.
It can be used to increase and decrease partition but it depends on data.

The duplicates is due to the type of data provided by source.
The duplicates will be created when the scheduler(oozie) is failed and it restarted.
PySpark UDF is slow in python it is suggested to use other languages to overcome this challenge.
Moving Average Calculation=((-1)+(1)+(0))/3

CrossTab in PySpark is used to represent frequency between two or more than 2 variables(frequency distribution).