2

In our production system (SQL Server 2008 / R2) there is a table in which generated documents are stored.

The documents have a reference (varchar) and a sequence_nr (int). The document may be generated multiple times and each iteration gets saved in this table incrementing the sequence number. Additionally each record has a data column (varbinary) and a timestamp as well as a user tag.

The only reason to query this table is for auditing purposes later on and during inserts.

The primary key for the table is clustered over the reference and sequence_nr columns.

As you can probably guess generation of documents and thus the data in the table (since a document can be generated again at a later time) does not grow in order.

I realized this after inserts in the table started timing out.

The inserts are performed with a stored procedure. The stored procedure determines the current max sequence_nr for the given reference and inserts the new row with the next sequence_nr.

I am fairly sure a poor choice of clustered index is causing the timeout problems, since records will be inserted for already existing references, only with a different sequence_nr and thus may end up anywhere in the record collection, but most likely not at the end.

On to my question: would it be better to go for a non-clustered index as primary key or would it be better to introduce an identity column, make it a clustered primary key and keep an index for the combination of reference and sequence_nr?

Knowing that for the time being (and not at all as far as we can foresee) there is no need to query this table intensively, except for the case where a new sequence_nr must be determined.

Edit in answer to questions: Tbh, I'm not sure about the timeout in the production environment. I do know that new documents get added in parallel running processes.

Table:

CREATE TABLE [dbo].[tbl_document] (
    [reference]     VARCHAR(50)    NOT NULL,
    [sequence_nr]   INT            NOT NULL,
    [creation_date] DATETIME2      NOT NULL,
    [creation_user] NVARCHAR (50)  NOT NULL,
    [document_data] VARBINARY(MAX) NOT NULL
);

Primary Key:

ALTER TABLE [dbo].[tbl_document]
    ADD CONSTRAINT [PK_tbl_document] PRIMARY KEY CLUSTERED ([reference] ASC, [sequence_nr] ASC) 
    WITH (ALLOW_PAGE_LOCKS = ON, ALLOW_ROW_LOCKS = ON, PAD_INDEX = OFF, IGNORE_DUP_KEY = OFF, STATISTICS_NORECOMPUTE = OFF);

Stored procedure:

CREATE PROCEDURE [dbo].[usp_save_document] @reference     NVARCHAR (50),
                                           @sequence_nr   INT OUTPUT,
                                           @creation_date DATETIME2,
                                           @creation_user NVARCHAR(50),
                                           @document_data VARBINARY(max)
AS
  BEGIN
      SET NOCOUNT ON;

      DECLARE @current_sequence_nr INT

      SELECT @current_sequence_nr = max(sequence_nr)
      FROM   [dbo].[tbl_document]
      WHERE  [reference] = @reference

      IF @current_sequence_nr IS NULL
        BEGIN
            SELECT @sequence_nr = 1
        END
      ELSE
        BEGIN
            SELECT @sequence_nr = @current_sequence_nr + 1
        END

      INSERT INTO [dbo].[tbl_document]
                  ([reference],
                   [sequence_nr],
                   [creation_date],
                   [creation_user],
                   [document_data])
      VALUES      (@reference,
                   @sequence_nr,
                   @creation_date,
                   @creation_user,
                   @document_data)
  END 

Hope that helps.

4

2 回答 2

2

我会去设置PK not clustered,因为:

  • varchar当密钥具有使每个叶子更大时保持b-tree平衡。
  • 你说什么,你不是一次扫描这个表的多行
于 2013-08-22T09:40:11.490 回答
1

由于聚集索引在物理上重新排序表的记录以匹配索引顺序,因此仅当您想按该顺序读取多个连续记录时才有用,因为这样可以通过在磁盘上执行顺序读取来读取整个记录。

如果您只使用索引中存在的数据,则将其聚集在一起并没有任何好处,因为索引本身(无论是否聚集)都与数据分开并按顺序排列。

因此,对于您的特定情况,非聚集索引是正确的方法。插入不需要重新排序数据(仅索引),并且sequence_nr可以通过单独查看索引来完成查找新数据。

于 2013-08-22T09:41:45.890 回答