distribution - 细分之间的Greenplum数据分布

Question

我有一个 Greenplum 数据库，其中有 10 个段，反映了 10 个硬盘。我的表根据日期分为主分区和基于哈希ID的辅助分区。所以一个月会有30个主分区，每个分区包含100个子分区。并且子分区中的数据是基于 hashid 加载的。现在的问题是这些分区是如何在段之间分布的。

猜想一：

seg1(equally distributed based on pri partition : 30datepartition/10=3 date partitions)
date1---0-99 sub partition of hashid
date2---0-99 sub partition of hashid
date3---0-99 sub partition of hashid

seg2(equally contains 30/10=3 date partitions)
date4---0-99 partition of hashid
date5---0-99 partition of hashid
date6---0-99 partition of hashid

...
..

seg10
date27---0-99 partition of hashid
date28---0-99 partition of hashid
date29---0-99 partition of hashid

或者

猜猜 2

seg1(distributed by 100hashid/10=10 hashid partitions)
date1---0-9 partition of hashid
date2---0-9 partition of hashid
...
date30---0-9 partition of hashid

seg2(equally contains 100hashid/10=10 hashid partitions)
date1---10-19 partition of hashid
date2---10-19 partition of hashid
...
date30---10-19 partition of hashid

这是如何运作的？猜测 1 或 2，如果两者都错了，请告诉我它在分段级别的分布方式。

根据哈希 id 对它进行子分区是一个好的设计吗？因为我每天要处理 600 万条记录，而且我必须存储一年的日期，所以我希望搜索能够找到非常少的数据。换句话说，我将根据键查询确定哈希值范围，并将在这些特定分区中进行搜索。

谢谢 Ganesh.R

score 2 · Accepted Answer

在 Greenplum 中，分布键确定数据如何分散在集群中的所有段中。分区将每个段内的数据分解成更小的块，就像在任何其他 DBMS 中进行分区一样。

您想选择一个在集群中均匀划分数据的分布键，然后使用分区来细分表。这个想法是设置您的表，以便集群中的每个分段数据库都可以处理大小大致相同的数据集。整体数据库响应将与集群中最慢的段一样慢。

score 0 · Accepted Answer

我不是 100% 确定，但我认为分区是按节点划分的。因此，在您的示例中，每个节点将有 30 个分区。

如果要指定要分片的键，请使用DISTRIBUTE BY.

score 0 · Accepted Answer

当您创建表时，分布键是任何键，例如 event_id，分布是基于任何日期列（例如 event_date）完成的，最好的方法是按列分区应该是分布键的一部分，以正确分布数据/用于偏度,

谢谢

distribution - 细分之间的Greenplum数据分布

3 回答 3

Related

Reference