Posts

NOSQL Database and Pipeline tool

Image
  Kafka is a data pipeline tool. Zookeeper is needed before installing Kafka as its a cluster coordinator. Kafka stores all metadata information in zookeeper. NOSQL CAP is not possible at a time in NOSQL databases. Products like Facebook, Instagram achieves CAP by using multiple databases. CA is not possible at a time because of partition. In RDBMS no replication is present so CA is achieved. Ex: Hbase, MongoDB, Cassandra Datastores Keyvalue Datastore(Reddis) ColumnOriented Datastore(Hbase, Cassandra) DocumentOriented Datastore(MongoDB) Graph Datastore(Neo4j) Hbase is used for internal operations Lookup(veryfast) upset(insert+update)

Memory Calculations In Spark

Image
  Dynamic Resource Allocator(DRA)- Spark itself take care of memory calculations which highly not prefer.   Cores:  No of concurrent tasks that can be executed in each node. 5 cores which means 5 parallel tasks can execute for a executor. No of executors=(Total no of cores)/(No of cores in a node) Executor Memory=((RAM)/(No of executors))-(Yarn Memory(2gb))=x Executor memory range is (no of cores-x) Driver Memory varies based on condition.

Spark Yarn Cluster

Image
  Spark Submit Cluster(Yarn)->(Hadoop+Spark) Spark Lens Integration is a profiling tool to check the memory allocation(executor memory, etc.) Spark has Hash Partition which is the same algorithm used by MapReduce also. Map(Parallelism which acts as input), Reduce(Aggregation which acts as output) No of blocks=No of tasks when it is HDFS. No of O/P tasks=No of I/P tasks Spark uses the algorithm called hash partition to decide which particular output partition should be set to particular output task. Hash Partition(output partition)=(Hash of the key)%(No of output task count) Few Commands data.getNumPartitions    (It is used to get no of partitions) reducedata.repartition(3).saveAsTextFile("out.txt"); Removing Null Values df.dropna(how="any").show() #removes row if any one column is null df.dropna(how="all").show() #removes row if all columns are null df.dropna(how="any",subset=["salary"]).show() #removes column if any null value is fo

Spark Standalone Architecture and Working

Image
  About Spark Standalone Architecture The Daemons in spark are(Master Slave Architecture): Master Node Worker Node The Driver Program is starting point of spark which is started by Master and Responsible for Job. Lets understand the architecture of spark with an Example Client submitted job in the form of jar Each jar file contains driver program. The Spark master executes the driver code. The driver is requesting for resource from cluster manager. The cluster manager is responsible for allocation of resources. The cluster manager is allocating resources in the form of executors. The cluster manager finds data locality(particular data location) before assigning executors. Driver program is responsible for assigning the job to the executors (JAR File). Once the executor has been created it starts sending heartbeat to driver program and cluster manager also informs the driver program that it has been allocated the resources. The driver program is in Master and Executors are Workers. Exec

Spark

Image
  About Spark Spark is a data processing framework which is a solution for bigdata. Spark is subset of Hadoop Spark SQL is used to connect with hive for retrieving data. Spark has two data process methods Batch Processing It's about dealing with files, databases. Stream Processing It's about dealing with live data. It is has less security. Spark can run on standalone file system then distributes the data in memory is called Inmemory Processing. Spark can connect to any data storage. Spark replaces MR in Hadoop. Spark is a data processing framework which do's batch processing (or) stream processing. Spark works on Java, Python, Scala and R. Architecture of Spark Deployment Modes Standalone( Installation of only spark in single/multi node) Yarn( Spark+Hadoop ) Mesos Daemon Processes Master( Similar to Master in HDFS) Worker ( Similar to Slave in HDFS) The memory will be distributed its data in it and then processes which is called as Memory Computation Spark REPL(Read Evaluat

Extended Topic In Hive

Image
  TABLES Internal Table/Managed Table When table is dropped it deletes both schema and data from HDFS. It is managed by hive The queries will be done through Hive. CREATE TABLE TBL_NAME(CLM DTYPE,....); External Table When table is dropped the schema(structure) will be deleted but not data in HDFS. The metadata will be managed by hive. It is used when files are in remote machine. CREATE EXTERNAL TABLE TBL_NAME(CLM DTYPE,..); Hive Partition It is used to split the large tables into several tables based on one or more columns. Static partition needs destination Dynamic partition doesn't needs destination because hive internally takes care. It is used when less number of columns are present. Hive Bucket It is similar to partitioning but with an extra function of hash function added to it. It is generally used for large datasets. If there is no proper selection of bucket count the performance will be decreased CREATE TABLE TBL_NAME(CLM DTYPE,..) CLUSTERED BY (CLM) INTO 10 BUCKETS ROW F

Hive

Image
  About Hive Hive is introduced by Facebook. It's a data warehouse (data collected from various sources). It's functionality is based on SQL. It's basically a Query Engine it can also called as database. It's an open source framework. It's a vehicle than runs on engine(MR). It's replacing Java not Map Reduce. Basically we are using java in MR to replace the use of that we are using hive. The hive has metadata which stores information but not data. The hive metadata stores only in RDBMS(Oracle, MySQL) but not the data you insert. The data you insert is stored in HDFS. In the absence of RDBMS for metadata hive will create an embedded RDBMS called derby. The combinations in hive are: MYSQL + Hive=Remote Metastore Derby + Hive=Embedded Metastore The drawback of embedded metastore is data concurrency in clustered system as multiple nodes maybe present. By default all hive tables are stored in bin/hadoop fs -ls /user/hive/warehouse Without load command also we can mov