我认为列存储的工作方式是,如果您将超过 102,400 行批量加载到列存储的一个分布中,它会自动压缩它。我没有在 Azure SQL DW 中观察到这一点。
我正在执行以下 CTAS 声明:
create table ColumnstoreDemoCTAS
WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION=HASH(Column1))
AS
select top 102401 cast(1 as int) as Column1, f.*
from FactInternetSales f
cross join sys.objects o1
cross join sys.objects o2
现在我检查列存储行组的状态:
select t.name
,NI.distribution_id
,CSRowGroups.state_description
,CSRowGroups.total_rows
,CSRowGroups.deleted_rows
FROM sys.tables AS t
JOIN sys.indexes AS i
ON t.object_id = i.object_id
JOIN sys.pdw_index_mappings AS IndexMap
ON i.object_id = IndexMap.object_id
AND i.index_id = IndexMap.index_id
JOIN sys.pdw_nodes_indexes AS NI
ON IndexMap.physical_name = NI.name
AND IndexMap.index_id = NI.index_id
LEFT JOIN sys.pdw_nodes_column_store_row_groups AS CSRowGroups
ON CSRowGroups.object_id = NI.object_id
AND CSRowGroups.pdw_node_id = NI.pdw_node_id
AND CSRowGroups.distribution_id = NI.distribution_id
AND CSRowGroups.index_id = NI.index_id
WHERE t.name = 'ColumnstoreDemoCTAS'
ORDER BY 1,2,3,4 desc;
我最终得到一个具有 102401 行的 OPEN 行组。我是否误解了列存储的这种行为?Azure SQL DW 有什么不同吗?
如果我从 SSIS 执行相同数量的行的批量插入,我会看到相同的行为,所有这些都作为一个缓冲区。
我尝试了 Drew 的插入超过 650 万行的建议,但我仍然得到了所有 OPEN 行存储:
create table ColumnstoreDemoWide
WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION=HASH(Column1))
AS
select top 7000000 ROW_NUMBER() OVER (ORDER BY f.ProductKey) as Column1, f.*
from FactInternetSales f
cross join sys.objects o
cross join sys.objects o2
cross join sys.objects o3