3

我有一个必须维护的遗留产品。其中一张表与以下示例有些相似:

DECLARE @t TABLE
(
 id INT,
 DATA NVARCHAR(30)
);

INSERT  INTO @t
        SELECT  1,
                'name: Jim Ey'
        UNION ALL
        SELECT  2,
                'age: 43'
        UNION ALL
        SELECT  3,
                '----------------'
        UNION ALL
        SELECT  4,
                'name: Johnson Dom'
        UNION ALL
        SELECT  5,
                'age: 34'
        UNION ALL
        SELECT  6,
                '----------------'
        UNION ALL
        SELECT  7,
                'name: Jason Thwe'
        UNION ALL
        SELECT  8,
                'age: 22'

SELECT  *
FROM    @t;
/*
You will get the following result
id          DATA
----------- ------------------------------
1           name: Jim Ey
2           age: 43
3           ----------------
4           name: Johnson Dom
5           age: 34
6           ----------------
7           name: Jason Thwe
8           age: 22
*/

现在我想以以下形式获取信息:

name           age
-------------- --------
Jim Ey         43
Johnson Dom    34
Jason Thwe     22

最简单的方法是什么?谢谢。

4

4 回答 4

4

出于(有点病态)的好奇心,我试图想出一种方法来转换您提供的确切输入数据。

当然,更好的方法是正确构建原始数据。对于遗留系统,这可能是不可能的,但是可以创建一个 ETL 流程来将此信息带到一个中间位置,这样就不需要实时运行像这样的丑陋查询。

示例 #1

此示例假定所有 ID 都是一致且连续的(否则,需要使用附加ROW_NUMBER()列或新的标识列来保证对 ID 进行正确的余数运算)。

SELECT
    Name = REPLACE( Name, 'name: ', '' ),
    Age = REPLACE( Age, 'age: ', '' )
FROM
(
    SELECT
        Name = T2.Data,
        Age = T1.Data,
        RowNumber = ROW_NUMBER() OVER( ORDER BY T1.Id ASC )

    FROM @t T1 
        INNER JOIN @t T2 ON T1.id = T2.id +1 -- offset by one to combine two rows
    WHERE T1.id % 3 != 0 -- skip delimiter records
) Q1
 -- skip every other record (minus delimiters, which have already been stripped)
WHERE RowNumber % 2 != 0

示例 #2:不依赖于顺序 ID

这是一个更实际的示例,因为实际的 ID 值无关紧要,只有行序列。

DECLARE @NumberedData TABLE( RowNumber INT, Data VARCHAR( 100 ) );

INSERT @NumberedData( RowNumber, Data )
    SELECT 
        RowNumber = ROW_NUMBER() OVER( ORDER BY id ASC ),
        Data
    FROM @t;

SELECT 
    Name = REPLACE( N2.Data, 'name: ', '' ),
    Age = REPLACE( N1.Data, 'age: ', '' ) 
FROM @NumberedData N1 
    INNER JOIN @NumberedData N2 ON N1.RowNumber = N2.RowNumber + 1
WHERE ( N1.RowNumber % 3 ) = 2;

DELETE @NumberedData;

示例 #3:光标

同样,最好避免实时运行这样的查询并使用计划的事务性 ETL 过程。以我的经验,像这样的半结构化数据很容易出现异常。

虽然示例 #1 和 #2(以及其他人提供的解决方案)展示了处理数据的巧妙方法,但转换这些数据的更实用的方法是游标。为什么?它实际上可能执行得更好(没有嵌套查询、递归、旋转或行编号),即使速度较慢,它也为错误处理提供了更好的机会。

-- this could be a table variable, temp table, or staging table
DECLARE @Results TABLE ( Name VARCHAR( 100 ), Age INT );

DECLARE @Index INT = 0, @Data VARCHAR( 100 ), @Name VARCHAR( 100 ), @Age INT;

DECLARE Person_Cursor CURSOR FOR SELECT Data FROM @t;
OPEN Person_Cursor;
FETCH NEXT FROM Person_Cursor INTO @Data;

WHILE( 1 = 1 )BEGIN -- busy loop so we can handle the iteration following completion
    IF( @Index = 2 ) BEGIN
        INSERT @Results( Name, Age ) VALUES( @Name, @Age );
        SET @Index = 0;
    END
    ELSE BEGIN
            -- optional: examine @Data for integrity

        IF( @Index = 0 ) SET @Name = REPLACE( @Data, 'name: ', '' );
        IF( @Index = 1 ) SET @Age = CAST( REPLACE( @Data, 'age: ', '' ) AS INT );
        SET @Index = @Index + 1;
    END

    -- optional: examine @Index to see that there are no superfluous trailing 
    -- rows or rows omitted at the end.

    IF( @@FETCH_STATUS != 0 ) BREAK;
    FETCH NEXT FROM Person_Cursor INTO @Data;
END

CLOSE Person_Cursor;
DEALLOCATE Person_Cursor;

表现

我创建了 100K 行的示例源数据,前面提到的三个示例似乎大致等同于转换数据。

我创建了一百万行源数据,并且类似于以下的查询在选择行子集时提供了出色的性能(例如将在网页或报告的网格中使用)。

-- INT IDENTITY( 1, 1 ) numbers the rows for us
DECLARE @NumberedData TABLE( RowNumber INT IDENTITY( 1, 1 ), Data VARCHAR( 100 ) );

-- subset selection; ordering/filtering can be done here but it will need to preserve
-- the original 3 rows-per-result structure and it will impact performance
INSERT @NumberedData( Data )
    SELECT TOP 1000 Data FROM @t;

SELECT
    N1.RowNumber,
    Name = REPLACE( N2.Data, 'name: ', '' ),
    Age = REPLACE( N1.Data, 'age: ', '' ) 
FROM @NumberedData N1 
    INNER JOIN @NumberedData N2 ON N1.RowNumber = N2.RowNumber + 1
WHERE ( N1.RowNumber % 3 ) = 2;

DELETE @NumberedData;

我看到一组一百万条记录的执行时间为 4-10 毫秒(i7-3960x)。

于 2012-04-22T09:00:47.707 回答
1

鉴于该表,您可以这样做:

;WITH DATA
AS
(
    SELECT
        SUBSTRING(t.DATA,CHARINDEX(':',t.DATA)+2,LEN(t.DATA)) AS value,
        SUBSTRING(t.DATA,0,CHARINDEX(':',t.DATA)) AS ValueType,
        ID,
        ROW_NUMBER() OVER(ORDER BY ID) AS RowNbr
    FROM
        @t AS t
    WHERE
        NOT t.DATA='----------------'
)
, RecursiveCTE
AS
(
    SELECT
        Data.RowNbr,
        Data.value,
        Data.ValueType,
        NEWID() AS ID
    FROM
        Data
    WHERE
        Data.RowNbr=1
    UNION ALL
    SELECT
        Data.RowNbr,
        Data.value,
        Data.ValueType,
        CASE 
            WHEN Data.ValueType='age'
            THEN RecursiveCTE.ID
            ELSE NEWID()
        END AS ID
    FROM
        Data
        JOIN RecursiveCTE
            ON RecursiveCTE.RowNbr+1=Data.RowNbr
)
SELECT
    pvt.name,
    pvt.age
FROM
    (
        SELECT
            ID,
            value,
            ValueType
        FROM
            RecursiveCTE
    ) AS SourceTable
    PIVOT
    (
        MAX(Value)
        FOR ValueType IN ([name],[age])
    ) AS pvt

输出

Name          Age
------------------
Jim Ey        43
Jason Thwe    22
Johnson Dom   34
于 2012-04-22T09:13:21.007 回答
1

一种没有自连接、递归和单次遍历来自的行的解决方案@t

SELECT  *
FROM
(
        SELECT  
                CASE 
                    WHEN a.DATA LIKE 'name:%' THEN 'Name'
                    ELSE 'Age'
                END AS Attribute,
                CASE 
                    WHEN a.DATA LIKE 'name:%' THEN SUBSTRING(a.DATA, 7, 4000) --or LTRIM(SUBSTRING(...,6,...))
                    ELSE SUBSTRING(a.DATA, 6, 4000) --or LTRIM(SUBSTRING(...,5,...))
                END AS Value,
                (ROW_NUMBER() OVER(ORDER BY id) + 1) / 2 AS PseudoDenseRank
        FROM    @t a
        WHERE   a.DATA LIKE 'name:%' OR a.DATA LIKE 'age:%'
) b
PIVOT( MAX(b.Value) FOR b.Attribute IN ([Name], [Age]) ) pvt

结果:

PseudoDenseRank Name        Age
--------------- ----------- ---
1               Jim Ey      43
2               Johnson Dom 34
3               Jason Thwe  22

注1:派生表b将使用. 派生表的结果:name:%age:%(ROW_NUMBER() OVER(ORDER BY id) + 1) / 2b

Attribute Value       ROW_NUMBER() OVER(ORDER BY id) PseudoDenseRank
--------- ----------- ------------------------------ ---------------
Name      Jim Ey      1                              1
Age       43          2                              1
Name      Johnson Dom 3                              2
Age       34          4                              2
Name      Jason Thwe  5                              3
Age       22          6                              3

注意 2:如果id列中的值没有间隙(例如 (id 1, name:Jim Ey), (id 3 age: 43) ),那么您可以(a.id + 1) / 2 AS PseudoDenseRank使用(ROW_NUMBER() OVER(ORDER BY id) + 1) / 2 AS PseudoDenseRank.

注意 3:如果您使用(a.id + 1) / 2 AS PseudoDenseRank解决方案(对姓名和年龄行进行分组),则第一个 id 值应该是奇数。如果第一个 id 值是偶数,那么你应该使用这个表达式a.id / 2 AS PseudoDenseRank

于 2012-04-22T09:43:46.630 回答
0

如果您升级到 SQL Server 2012,这是另一个选项,它为聚合函数实现 OVER 子句。这种方法将允许您只选择那些您知道自己想要的标签并找到它们,而不管名称之间有多少行。

如果姓名和年龄在代表单个人的一组行中的顺序并不总是相同,这也将起作用。

with Ready2Pivot(tag,val,part) as (
  select
    CASE WHEN DATA like '_%:%' THEN SUBSTRING(DATA,1,CHARINDEX(':',DATA)-1) END as tag,
    CASE WHEN DATA like '_%:%' THEN SUBSTRING(DATA,CHARINDEX(':',DATA)+1,8000) END as val,
    max(id * CASE WHEN DATA LIKE 'name:%' THEN 1 ELSE 0 END)
    over (
      order by id
    )
  from @t
  where DATA like '_%:%'
)
  select [name], [age]
  from Ready2Pivot
  pivot (
    max(val)
    for tag in ([name], [age])
  ) as p

如果您的旧数据有一个包含额外项目的条目(例如“altName:Jimmy”),则此查询将忽略它。如果您的旧数据没有针对某人年龄的行(也没有 ID 号),那么它将在该位置为您提供 NULL。它将所有信息与最接近的前一行相关联,并将“名称:...”作为数据,因此每组行都有一个“名称:...”行是很重要的。

于 2012-04-23T01:32:56.237 回答