Hadoop

Hadoop

About Hadoop Architecture

Hadoop Consists:

HDFS(Hadoop Distributed File System) [Parallel Processing]
MR(Map Reduce) [Processing]
Hive(Query Engine Introduced by Facebook) [Uses SQL]
Pig(Introduced by Yahoo) [Uses PigLatin]
Sqoop(Introduced by a group of people) [Uses Java]
Oozie(Scheduler Introduced by Yahoo) [XML,Java]
Flume(Messaging Queue) [Only Incoming Request]
Mahoth(Data Science,AI,ML Component)
HBASE(Hadoop Database Introduced by Facebook)

Key Points

All the databases in Big Data are NOSQL(Modern).
Any framework in Big Data is Loosely Coupled.

Loosely Coupled: Removal of one component doesn't affect the technology(Hadoop) function. Ex: C Sharp

Hadoop can be integrated with any other Big Data Technology.

Question & Answers

Q. What is file System ?

Ans. Used to read/write to and from Hard Disk

Ex: NTFS(Windows), EXT(Linux), MACFS(MAC)

A program in execution is called a process.

Q. What is Block ?

Ans. A large file is divided into small units called chunks

Ex: NTFS 16K

Q. Client and Server

Ans. Client->Requests and Server->Responds

Q. Types of File System

Standalone file system---NTFS,EXT,MACFS
Distributed file system---HDFS,S3

Q. Types of Distributed File System

Master and Slave(Hadoop, Spark) [One Master and N-Slaves]
Peer to Peer(NOSQL-Cassandra) [Each and every node connected to each other]

The background processes are called as Daemon Processes

In Hadoop we have 5 Daemon Processes(JP1, JP2, JP3, JP4, JP5 (Java Program))

Node is an Individual System or Virtual Machine and Cluster is a group of nodes together.
API-Application Program Interface
BLOCK SIZE in Hadoop

1-Version(1B=64MB)(Default)
Latest Version(1B=128MB)(Default)

Replication: Replicate or duplicate's of data

In general the replication factor if Hadoop is 3
If we load 1GB of data we total of 3GB for replication
No duplicates in same node
Failure of one node can prevent data corrupt

HDFS+MR------------> HADOOP
HADOOP+OTHER--> HADOOP FRAMEWORK

Comments