您应该如何设计 Fact 和 Dimension 表以加快新 Azure SQL 数据仓库的连接速度?
您应该如何设计 Fact 和 Dimension 表以加快新 Azure SQL 数据仓库的连接速度?
Azure SQL 数据仓库最初提供两种表类型 - Round Robin 和 Hash Distributed(请参阅https://azure.microsoft.com/documentation/articles/sql-data-warehouse-develop-table-design/上的 SQL DW 表文档) .
通常对于维度表,您会选择循环分配。对于事实表,您需要选择基于 HASH 的分布式表设计。
您通过散列分布大型事实表和复制较小维度表的基本前提在 PDW 等 MPP 环境中效果很好,但由于 SQL DW 不假设复制数据(但 - 希望有一天),您需要使用 Round罗宾分布。
如果您可以最大限度地减少数据移动,那么您就可以采取一些好的步骤来提高连接的性能。但是,在正确的服务器上拥有数据只是成功的一半,您还应该考虑索引策略,就像在常规 (SMP) SQL Server 环境中一样。
Please note that ADW REPLICATE is in public preview but I think it is still buggy. I have several small tables that I have changed to REPLICATE but when I Join to these replicated tables and look at the explain xml plan, I still see data movement steps which should not be in the data is REPLICATED on all nodes. So to investigate why I did a DBCC PDW_SHOWSPACEUSED on several of the replicated tables and instead of seeing the row count being identical across all nodes they differ with some node having a zero row count. I am no expert by any means but I believe their is work to be done, but I cannot find any forums, discussions or feedback pages to report these issues to.