0

我需要开发一个 SSIS 包,我需要在其中导入/使用一个平面文件(只有 1 列)来将每一行与现有表的 2 列(开始和结束列)进行比较。

平面文件数据 -

110000
111112
111113
112222
112525
113222
113434
113453
114343
114545

并将平面文件的每一行与结构/数据进行比较 -

id  start   end
8   110000  119099
8   119200  119999
3   200000  209999
3   200000  209999
2   300000  300049
2   770000  779999
2   870000  879999

现在,如果需要在一个非常简单的简单存储过程中实现这一点,但是如果我在 SSIS 包中做到这一点,我将无法理解这一点。

有任何想法吗?非常感谢任何帮助。

4

1 回答 1

0

在核心,您将需要使用查找组件。编写一个查询,SELECT T.id, T.start, T.end FROM dbo.MyTable AS T并将其用作您的来源。将输入列映射到start列并选择 id,以便将其添加到数据流中。

If you hit run, it will perform an exact lookup and only find values of 110000 and 119200. To convert it to a range query, you will need to go into the Advanced tab. There should be 3 things you can check: amount of memory, rows and customize the query. When you check the last one, you should get a query like

SELECT * FROM 
(SELECT T.id, T.start, T.end FROM dbo.MyTable AS T`) AS ref 
WHERE ref.start = ?

You will need to modify that to become

SELECT * FROM 
(SELECT T.id, T.start, T.end FROM dbo.MyTable AS T`) AS ref 
WHERE ? BETWEEN ref.start AND ref.end

It's been my experience that the range queries can become rather inefficient as it seems to cache what's been seen already so if the source file had 110001, 110002, 110003 you would see 3 unique queries sent to the database. For small data sets, that may not be so bad but it led to some ugly load times for my DW.

An alternative to this is to explode the ranges. For me, I had a source system that only kept date ranges and I needed to know by day what certain counts were. The range lookups were not performing well so I crafted a query to convert the single row with a range of 2010-01-01 to 2013-07-07 to many rows, each with a single date 2013-01-01, 2013-01-02... While this approach lead to a longer pre-execute phase (it took a few minutes as the query had to generate ~30k rows per day for the past 5 years), once cached locally it was a simple seek to find a given transaction by day.

Preferably, I'd create a numbers table, fill it to the max of int and be done with it but you might get by with just using an inline table valued function to generate numbers. Your query would then look something like

SELECT
    T.id
,   GN.number 
FROM 
    dbo.MyTable AS T
    INNER JOIN
        -- Make this big enough to satisfy your theoretical ranges
        dbo.GenerateNumbers(1000000) AS GN
        ON GN.number BETWEEN T.start and T.end;

That would get used in a "straight" lookup without the need for any of the advanced features. The lookup is going to get very memory hungry though so make the query as tight as possible. For example, cast the GN.number from a bigint to an int in the source query if you know your values will fit in an int.

于 2013-07-08T01:47:39.360 回答