Extended Topic In Hive
TABLES
- Internal Table/Managed Table
- When table is dropped it deletes both schema and data from HDFS.
- It is managed by hive
- The queries will be done through Hive.
- CREATE TABLE TBL_NAME(CLM DTYPE,....);
- External Table
- When table is dropped the schema(structure) will be deleted but not data in HDFS.
- The metadata will be managed by hive.
- It is used when files are in remote machine.
- CREATE EXTERNAL TABLE TBL_NAME(CLM DTYPE,..);
Hive Partition
- It is used to split the large tables into several tables based on one or more columns.
- Static partition needs destination
- Dynamic partition doesn't needs destination because hive internally takes care.
- It is used when less number of columns are present.
Hive Bucket
- It is similar to partitioning but with an extra function of hash function added to it.
- It is generally used for large datasets.
- If there is no proper selection of bucket count the performance will be decreased
- CREATE TABLE TBL_NAME(CLM DTYPE,..) CLUSTERED BY (CLM) INTO 10 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
- 2^(n)>=(Size of file/block size(128)) where n=no of buckets
- In general bucket count changes when the volume is changed. It is performed in less number of times as it's time consuming.
- If bucket change is not performed then it may affect the performance.
- As we can change the bucket size after creating the table we will overwrite with new bucket size into new table.
- If sub-partition is not possible go for bucketing.
- It is used when more number of columns are present.
Hive ORC(Optimized Row Column) File Format
- The file will get decreased it size up to 75% in this format.
- Performance increases.
- Data stores in column format.
- Hive Acid must be internal and orc type.
- Parquet format has highest optimization(62%).
Hive ACID(Atomicity Consistency Isolation Durability)
- Entered in 0.14 Version.
- A table enables insert ,update and delete.
- It should be internal table.
- It should be used only orc file type.
- It must be bucketed table.
- Transactional is true.
Comments
Post a Comment