HDFS(Hadoop Distributed File System)

 

In which platform Hadoop can be implemented ?

  • Windows
  • Mac
  • Linux

Working and principles of HDFS

  • Information About the data is transferred to the Master Node(JP1) It's not data transfer
  • Client API acts intermediate between Master and Slave.
  • The information transferred to Master is operation, filename, size, etc.
  • The Acknowledgement is sent to the Master from Slave about the requested operation.
  • Standalone file system is needed to create distributed file system.
  • Through Edge node(Our PC's) we will be sending data to Slave node and the edge node doesn't contain any HDFS data as we know that data never comes back from HDFS.
  • When an request is sent to slave node then Client API sends information to the master then it sends the write/read operation information to MetaData(Placement Allocation(Rack awareness Allocation), Block size, Replication)
  • Pipeline is used to distribute data.
  • The client API is not only responsible for interaction between Master and Slave. The Master and Slave interacts with each other for every 3 seconds by sending signals called as Heartbeat.
  • Block(Data) is stored in Slave and Metadata is stored in Master.

What if there is a failure in the HDFS System ?

  • There are two types of failures(Slave):
    • Temporary Failure: Network, Software Failure
    • Permanent Failure: Hardware Failure
  • The process of giving back the data of failure node is called as Automatic Failover.
    • So here the data replicated is prevented.
  • After failure node gets updated to another node the metadata gets updated which is in the case of permanent failure.
  • Temporary Failure POV Discussion: The node which was dead has some data which has been replicated during dead state and then node came alive then the data would not be considered for future requests or operations by the Master Machine.
  • If there is no availability of node for replication then there would be no automatic failover.
  • The read request is similar to write request the only difference is pipeline is absent in read request where Client API reads directly from nodes.
  • If during write operation node was failed/inactive then the pipeline sends the acknowledgement to the Master and it will update the metadata then pipeline resumes it's operation.
  • If Write/Read request has been failed then all nodes/API is killed.
  • If there is failure of master then system is lost as all the metadata is stored in it.
  • Suggestion: when there is a failure of node which temp/perm then make it permanent outage.
  • Till Hadoop-1 there is no HighAvailability(HA(Master Failure Fixture)).
  • Hadoop-2 got fixture for Master Failure.
  • Hadoop-2 enables the feature of more than one master.
  • The JP1...JP5 has been renamed as following:
    • JP1-Name Node
    • JP2-Data Node
    • JP3-Secondary Name Node
    • JP4-Resource Manager(Job Tracker)
    • JP5-Node Manager(Task Tracker)
  • So when the HA concept is introduced we have one active name node and n-passive name nodes.
  • Zookeeper is a technology runs on all name nodes where it's task is to elect name node when there is failure of active name node. It keeps on running on multiple name nodes(active and passive).
  • Zookeeper runs on the policy of leader-follower(if leader is dead then follower becomes leader).
  • The Secondary Name Node is not for HighAvailability(HA).
  • If there is a failure of ANN and PNN gets active then the edit logs and fsimage get's are updated in SNN.
  • SNN is called as checkpoint as failure information is stored in it.
  • The Metadata is stored in centralised machine called Journal Nodes.
  • Hadoop is High Availability Cluster.
  • Types of Nodes:
    • H1
      • Master Node(NN+TT)
      • Slave Node(DN+JT)
      • Checkpoint Node(SNN)
    • H2
      • Master Node(NN+RM)
      • Slave Node(DN+NM)
      • Checkpoint Node(SNN)

Types of Cluster

  • Single Node(Pseudo Node)-5 daemons run on a single machine
  • Distributed Cluster-5 separate nodes.

Keypoints

  • Can't split DN and TT
  • Overload of NN is not recommended
  • SSH is required to start 5-Daemons.

Few Hadoop Commands

  • sbin/start-all.sh (or) start-all.sh   -> used to start Hadoop.
  • hadoop fs -ls /    ->used to list files in hdfs
  • hadoop fs -mkdir /test   ->used to create a directory
  • hadoop fs -put /local/path /hdfs/path   ->used to copy file from local file system
  • sbin/stop-all.sh (or) stop-all.sh   ->used to stop Hadoop
  • hadoop dfs -cat /test.txt

Comments

Popular posts from this blog

MR(Map Reduce)

Spark Yarn Cluster

Hadoop Installation