Extended Topic In Hive

 

TABLES

  • Internal Table/Managed Table
      • When table is dropped it deletes both schema and data from HDFS.
      • It is managed by hive
      • The queries will be done through Hive.
      • CREATE TABLE TBL_NAME(CLM DTYPE,....);
  • External Table
    • When table is dropped the schema(structure) will be deleted but not data in HDFS.
    • The metadata will be managed by hive.
    • It is used when files are in remote machine.
    • CREATE EXTERNAL TABLE TBL_NAME(CLM DTYPE,..);

Hive Partition

  • It is used to split the large tables into several tables based on one or more columns.
  • Static partition needs destination
  • Dynamic partition doesn't needs destination because hive internally takes care.
  • It is used when less number of columns are present.

Hive Bucket

  • It is similar to partitioning but with an extra function of hash function added to it.
  • It is generally used for large datasets.
  • If there is no proper selection of bucket count the performance will be decreased
  • CREATE TABLE TBL_NAME(CLM DTYPE,..) CLUSTERED BY (CLM) INTO 10 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
  • 2^(n)>=(Size of file/block size(128)) where n=no of buckets
  • In general bucket count changes when the volume is changed. It is performed in less number of times as it's time consuming.
  • If bucket change is not performed then it may affect the performance.
  • As we can change the bucket size after creating the table we will overwrite with new bucket size into new table.
  • If sub-partition is not possible go for bucketing.
  • It is used when more number of columns are present.

Hive ORC(Optimized Row Column) File Format

  • The file will get decreased it size up to 75% in this format.
  • Performance increases.
  • Data stores in column format.
  • Hive Acid must be internal and orc type.
  • Parquet format has highest optimization(62%).

Hive ACID(Atomicity Consistency Isolation Durability)

  • Entered in 0.14 Version.
  • A table enables insert ,update and delete.
  • It should be internal table.
  • It should be used only orc file type.
  • It must be bucketed table.
  • Transactional is true.

Comments

Popular posts from this blog

MR(Map Reduce)

Spark Yarn Cluster

Hadoop Installation