您应该如何设计 Fact 和 Dimension 表以加快新 Azure SQL 数据仓库的连接速度?
通过散列分布大型事实表并复制较小的维度表是否有助于加快连接速度,还是应该将索引作为主要考虑因素?
您应该如何设计 Fact 和 Dimension 表以加快新 Azure SQL 数据仓库的连接速度?
通过散列分布大型事实表并复制较小的维度表是否有助于加快连接速度,还是应该将索引作为主要考虑因素?
Azure SQL 数据仓库最初提供两种表类型 - Round Robin 和 Hash Distributed(请参阅https://azure.microsoft.com/documentation/articles/sql-data-warehouse-develop-table-design/上的 SQL DW 表文档) .
通常对于维度表,您会选择循环分配。对于事实表,您需要选择基于 HASH 的分布式表设计。
**编辑:现在也支持复制,这对于某些维度表可能是一个有用的选项。
您通过散列分布大型事实表和复制较小维度表的基本前提在 PDW 等 MPP 环境中效果很好,但由于 SQL DW 不假设复制数据(但 - 希望有一天),您需要使用 Round罗宾分布。
如果您可以最大限度地减少数据移动,那么您就可以采取一些好的步骤来提高连接的性能。但是,在正确的服务器上拥有数据只是成功的一半,您还应该考虑索引策略,就像在常规 (SMP) SQL Server 环境中一样。
Please note that ADW REPLICATE is in public preview but I think it is still buggy. I have several small tables that I have changed to REPLICATE but when I Join to these replicated tables and look at the explain xml plan, I still see data movement steps which should not be in the data is REPLICATED on all nodes. So to investigate why I did a DBCC PDW_SHOWSPACEUSED on several of the replicated tables and instead of seeing the row count being identical across all nodes they differ with some node having a zero row count. I am no expert by any means but I believe their is work to be done, but I cannot find any forums, discussions or feedback pages to report these issues to.