Result of clustering a partitioned table in Hive
- Partitioning data is used for distributing load horizontally, helps to organizing data in a very logical fashion.
- Example:Assume user managing a large employee table and often run queries with where clauses that restrict the results to a particular country or department.
- For a quicker query response Hive table can be partitioned BY (country STRING, DEPT STRING). Partitioning tables changes however Hive structures the data storage and Hive can now create subdirectories reflecting the partitioning structure like…/employees/country=ABC/DEPT=XYZ.
- If query limits for employee from country=ABC, it will only scan the contents of 1 directory country=ABC.
- This can dramatically improve query performance, but providing the partitioning scheme reflects common filtering.
- Hive used to design too several partitions could optimize some queries, but be damaging for other important queries.
- Disadvantage several partitions having huge number of Hadoop files and file directories and Name Node should keep all data for the file system in memory.
- Decomposing data sets into more manageable parts using bucketing technique.
- For example, table using date as the top-level partition and employee_id as the second-level partition leads to too many small partitions.
- Employee_id contains the column it will be hashed by a user-defined number into buckets.
- Records with the same employee_id kept within the same bucket.
- Assuming the number of employee_id is much larger than the quantity of buckets, every bucket can have several employee_id.
- While creating table you can specify like CLUSTERED BY (employee_id) INTO XX BUCKETS; wherever XX is the number of buckets.
- Bucketing has many benefits.The number of buckets is fastened so it does not alternate with data.
- If a pair of tables are bucketed by employee_id, Hive will produce a logically correct sampling. Bucketing additionally aids in doing efficient map-side joins etc.