Posts

Showing posts from July, 2023

NOSQL Database and Pipeline tool

Image
  Kafka is a data pipeline tool. Zookeeper is needed before installing Kafka as its a cluster coordinator. Kafka stores all metadata information in zookeeper. NOSQL CAP is not possible at a time in NOSQL databases. Products like Facebook, Instagram achieves CAP by using multiple databases. CA is not possible at a time because of partition. In RDBMS no replication is present so CA is achieved. Ex: Hbase, MongoDB, Cassandra Datastores Keyvalue Datastore(Reddis) ColumnOriented Datastore(Hbase, Cassandra) DocumentOriented Datastore(MongoDB) Graph Datastore(Neo4j) Hbase is used for internal operations Lookup(veryfast) upset(insert+update)

Memory Calculations In Spark

Image
  Dynamic Resource Allocator(DRA)- Spark itself take care of memory calculations which highly not prefer.   Cores:  No of concurrent tasks that can be executed in each node. 5 cores which means 5 parallel tasks can execute for a executor. No of executors=(Total no of cores)/(No of cores in a node) Executor Memory=((RAM)/(No of executors))-(Yarn Memory(2gb))=x Executor memory range is (no of cores-x) Driver Memory varies based on condition.

Spark Yarn Cluster

Image
  Spark Submit Cluster(Yarn)->(Hadoop+Spark) Spark Lens Integration is a profiling tool to check the memory allocation(executor memory, etc.) Spark has Hash Partition which is the same algorithm used by MapReduce also. Map(Parallelism which acts as input), Reduce(Aggregation which acts as output) No of blocks=No of tasks when it is HDFS. No of O/P tasks=No of I/P tasks Spark uses the algorithm called hash partition to decide which particular output partition should be set to particular output task. Hash Partition(output partition)=(Hash of the key)%(No of output task count) Few Commands data.getNumPartitions    (It is used to get no of partitions) reducedata.repartition(3).saveAsTextFile("out.txt"); Removing Null Values df.dropna(how="any").show() #removes row if any one column is null df.dropna(how="all").show() #removes row if all columns are null df.dropna(how="any",subset=["salary"]).show() #removes column if any null value is fo

Spark Standalone Architecture and Working

Image
  About Spark Standalone Architecture The Daemons in spark are(Master Slave Architecture): Master Node Worker Node The Driver Program is starting point of spark which is started by Master and Responsible for Job. Lets understand the architecture of spark with an Example Client submitted job in the form of jar Each jar file contains driver program. The Spark master executes the driver code. The driver is requesting for resource from cluster manager. The cluster manager is responsible for allocation of resources. The cluster manager is allocating resources in the form of executors. The cluster manager finds data locality(particular data location) before assigning executors. Driver program is responsible for assigning the job to the executors (JAR File). Once the executor has been created it starts sending heartbeat to driver program and cluster manager also informs the driver program that it has been allocated the resources. The driver program is in Master and Executors are Workers. Exec

Spark

Image
  About Spark Spark is a data processing framework which is a solution for bigdata. Spark is subset of Hadoop Spark SQL is used to connect with hive for retrieving data. Spark has two data process methods Batch Processing It's about dealing with files, databases. Stream Processing It's about dealing with live data. It is has less security. Spark can run on standalone file system then distributes the data in memory is called Inmemory Processing. Spark can connect to any data storage. Spark replaces MR in Hadoop. Spark is a data processing framework which do's batch processing (or) stream processing. Spark works on Java, Python, Scala and R. Architecture of Spark Deployment Modes Standalone( Installation of only spark in single/multi node) Yarn( Spark+Hadoop ) Mesos Daemon Processes Master( Similar to Master in HDFS) Worker ( Similar to Slave in HDFS) The memory will be distributed its data in it and then processes which is called as Memory Computation Spark REPL(Read Evaluat

Extended Topic In Hive

Image
  TABLES Internal Table/Managed Table When table is dropped it deletes both schema and data from HDFS. It is managed by hive The queries will be done through Hive. CREATE TABLE TBL_NAME(CLM DTYPE,....); External Table When table is dropped the schema(structure) will be deleted but not data in HDFS. The metadata will be managed by hive. It is used when files are in remote machine. CREATE EXTERNAL TABLE TBL_NAME(CLM DTYPE,..); Hive Partition It is used to split the large tables into several tables based on one or more columns. Static partition needs destination Dynamic partition doesn't needs destination because hive internally takes care. It is used when less number of columns are present. Hive Bucket It is similar to partitioning but with an extra function of hash function added to it. It is generally used for large datasets. If there is no proper selection of bucket count the performance will be decreased CREATE TABLE TBL_NAME(CLM DTYPE,..) CLUSTERED BY (CLM) INTO 10 BUCKETS ROW F

Hive

Image
  About Hive Hive is introduced by Facebook. It's a data warehouse (data collected from various sources). It's functionality is based on SQL. It's basically a Query Engine it can also called as database. It's an open source framework. It's a vehicle than runs on engine(MR). It's replacing Java not Map Reduce. Basically we are using java in MR to replace the use of that we are using hive. The hive has metadata which stores information but not data. The hive metadata stores only in RDBMS(Oracle, MySQL) but not the data you insert. The data you insert is stored in HDFS. In the absence of RDBMS for metadata hive will create an embedded RDBMS called derby. The combinations in hive are: MYSQL + Hive=Remote Metastore Derby + Hive=Embedded Metastore The drawback of embedded metastore is data concurrency in clustered system as multiple nodes maybe present. By default all hive tables are stored in bin/hadoop fs -ls /user/hive/warehouse Without load command also we can mov

Input Split and Speculative Execution

Image
  About Input Split A block is physical unit of file which is divided into small units. Input is in between the Blocks and Mapper Block is physical whereas split is logical The blocks can split unstructured also. Task Tracker has an intelligence to identify the balance(missing) data present in particular node which is taken care by split so when these are read from particular node it is called as lookup. The property for the max split size can be set by conf.set("mapred.max.split.size","2"); No of splitters may or may not be equal to no of block but no of mappers always equals to no of splitters. Example: imagine two blocks the data is split as following 1,a,b 2,c,d 3 , e,f 4,g,h 5,i,j Imagine the red data is in B1 and green data is in B2 then mapper1 lookup for data in mapper2 of 3rd row. About Speculative Execution If there is Job Tracker with Task Trackers under it the when a particular task got hanged. Then the scheduler will create a duplicate task of the hange

YARN(Yet Another Resource Negotation)

Image
  About Yarn It's a cluster manager that supports common execution. Resource Manager(JT) and Node Manager(TT) are responsible for it's functionality. It is introduced in Hadoop-2 The components like Spark can run individually without Yarn Also. Functionality of Yarn Application Manager is a combination of Scheduler and Resource Manager. Application Master acts between Resource Manager and Data Node. For every jar the Resource Manager starts new Application Master There would be more than one Resource Manager(Active and Passive) When the job is submitted to the Application Manager then it sends the information to application master and the data node starts the mapper task and when it's done reducer task starts. When reducer job is done the Application master sends the final heartbeat to the Application Manager. If there is more than one jar is submitted then new Resource Manager is started. If Application Master is failed then new Application Master Would be started and job

MR(Map Reduce)

Image
Introduction to Map Reduce It's a massive parallel processing framework In general without distributed data it's hard to do parallel processing. Map Reduce uses Java by default The aim of Map Reduce is to achieve data locality and Parallelism. Spark is an alternative for Map Reduce(functionality has 80% similarity) Hive, Sqoop, Pig, Oozie are abstract of Map Reduce. Daemons in Map Reduce are JT and TT. In map reduce we also have two functionalities called Mapper and Reducer which are responsible for achieving the processing. The MR has an disadvantage of Resource Pressure. Advantages of Map Reduce Cluster Monitoring Cluster Management Resource Allocating Scheduling Execution Speculative Execution(L) Working Principle of Map Reduce The task for the MR is given in the form of ZAR(java) file to the Job Tracker. The Job Tracker will be sending the request to the Name Node and in return it gives response. Then the Job Tracker will be sending the task information to Task Tracker. The

Hadoop Installation

Image
If you are working with Windows Then Follow These Steps Else If(Linux) you can skip to hadoop installation. Download VMWare: VMWare Properties To Be Modified As Following: RAM-4GB ROM(MEMORY)-30GB CORES-4 ISO FILE LINK: https://ubuntu.com/download/desktop/thank-you?version=22.04.2&architecture=amd64 Accept all permissions and start using Linux. Hadoop Installation: Hadoop Installation Steps

Quota(An Admin Feature)

Image
There are  2-types of quotas: Space Quota Normal Quota Space Quota Setting the size of directory to fixed byte. The space quota is valid for future files(i.e. existing+setQuota(future)) Commands for space quota hadoop dfsadmin -setSpaceQuota 100 /folder hadoop dfsadmin -clrSpaceQuota /folder Normal Quota Setting the number of files of a directory to fixed number. The normal quota is valid for future file(i.e. existing+setQuota(future)) Commands for normal quota hadoop dfsadmin -setQuota 3 /folder hadoop dfsadmin -clrQuota /folder Keypoints Quotas will be used when hive compaction arises. It deletes the old records and updates with new records if duplicates are found which takes large time called as compaction command to check quota hadoop fs -count -q -h -v /folder

HDFS(Hadoop Distributed File System)

Image
  In which platform Hadoop can be implemented ? Windows Mac Linux Working and principles of HDFS Information About the data is transferred to the Master Node(JP1) It's not data transfer Client API acts intermediate between Master and Slave. The information transferred to Master is operation, filename, size, etc. The Acknowledgement is sent to the Master from Slave about the requested operation. Standalone file system is needed to create distributed file system. Through Edge node(Our PC's) we will be sending data to Slave node and the edge node doesn't contain any HDFS data as we know that data never comes back from HDFS. When an request is sent to slave node then Client API sends information to the master then it sends the write/read operation information to MetaData(Placement Allocation(Rack awareness Allocation), Block size, Replication) Pipeline is used to distribute data. The client API is not only responsible for interaction between Master and Slave. The Master and Sla

Hadoop

Image
About Hadoop Architecture Hadoop Consists: HDFS(Hadoop Distributed File System) [Parallel Processing] MR(Map Reduce) [Processing] Hive(Query Engine Introduced by Facebook) [Uses SQL] Pig(Introduced by Yahoo) [Uses PigLatin] Sqoop(Introduced by a group of people) [Uses Java] Oozie(Scheduler Introduced by Yahoo) [XML,Java] Flume(Messaging Queue) [Only Incoming Request] Mahoth(Data Science,AI,ML Component) HBASE(Hadoop Database Introduced by Facebook) Key Points All the databases in Big Data are NOSQL(Modern). Any framework in Big Data is Loosely Coupled. Loosely Coupled: Removal of one component doesn't affect the technology(Hadoop) function.  Ex: C Sharp Hadoop can be integrated with any other Big Data Technology. Question & Answers Q. What is file System ? Ans. Used to read/write to and from Hard Disk Ex: NTFS(Windows), EXT(Linux), MACFS(MAC) A program in execution is called a process. Q. What is Block ? Ans. A large file is divided into small units called chunks Ex: NTFS 16K

Road Map and Prerequisites of Big Data

Image
Road Map Learn Linux Learn SQL Learn Programming Language(Java and Python) Big Data Concepts(HDFS, MR, Hive, Spark, Etc.) Gather Knowledge on ETL(Extract Transform Load) Concepts. Basics on Cloud Computing Work on Project Challenges and Optimization Prepare CV/Resume and Attend interviews. Prerequisites Linux Environment Knowledge on Linux Knowledge on SQL Knowledge on Programming Language(Java and Python)

Unveiling the Power of Big Data: A New Frontier in Information

Image
  What is Big Data ? The problem arises due to storage or processing of data is called big data. The solutions for the big data are more than 10k but one famous solution is Hadoop . The data has 5 problems such as Volume, Value(Purity), Visualization, Velocity, Variety. The problem can be processing also. It can be a technology for achieving speed also. The Layers of Data Include: Automation Storage Testing Visualization DS,ML,AI History of Hadoop In the year 2002 Google has introduced new file system called GFS(Google File System) and in the year 2004 it has introduced new system for processing data called GMR(Google Map Reduce). So, Doug Cutting has donated an open source software to Apache called Hadoop(GFS+GMR) which is combination of Both systems introduced by Google. The Hadoop consists of HDFS(Hadoop Distributed File System) and MR(Map Reduce). Big Data Technology is used by various organizations in the Industry called Commercial Products such as: Cloudera, Hartonworks EMR-Amazo