hadoop - Hive clustered by on more than one column

Question

I understand that when the hive table has clustered by on one column, then it performs a hash function of that bucketed column and then puts that row of data into one of the buckets. And there is a file for each bucket i.e. if there are 32 buckets then there are 32 files in hdfs.

What does it mean to have the clustered by on more than one column? For example, lets say that the table has CLUSTERED BY (continent, country) INTO 32 BUCKETS.

How would the hash function be performed if there are more than one column?

How many files would be generated? Is this still 32?

score 7 · Accepted Answer

是的，文件数仍为 32。
哈希函数将通过将“大陆，国家”视为单个字符串来操作，然后将其用作输入。

希望能帮助到你！！

score 0 · Accepted Answer

一般来说，桶号由表达式 hash_function(bucketing_column) mod num_buckets 确定。（那里也有一个'0x7FFFFFFF，但这并不重要）。hash_function 取决于分桶列的类型。对于 int，很简单，hash_int(i) == i。例如，如果 user_id 是一个 int，并且有 10 个桶，我们希望所有以 0 结尾的 user_id 都在桶 1 中，所有以 1 结尾的 user_id 都在桶 2 中，等等。对于其他数据类型，它是有点棘手。特别是，BIGINT 的哈希值与 BIGINT 不同。字符串或复杂数据类型的哈希值将是从值派生的某个数字，但不是任何人类可识别的数字。例如，如果 user_id 是 STRING，则存储桶 1 中的 user_id 可能不会以 0 结尾。通常，

参考：https ://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

hadoop - Hive clustered by on more than one column

2 回答 2

Related

Reference