0

I have a task that requires me to pull in a set of xml files, which are all related, then pick out a subset of records from these files, transform some columns, and export to a a single xml file using a different format.

I'm new to SSIS, and so far I've managed to first import two of the xml files (for simplicity, starting with just two files).

The first file we can call "Item", containing some basic metadata, amongst those an ID, which is used to identify related records in the second file "Milestones". I filter my "valid records" using a lookup transformation in my dataflow - now I have the valid Item ID's to fetch the records I need. I funnel these valid ID's (along with the rest of the columns from Item.xml through a Sort, then into a merge join.

The second file is structured with 2 outputs, one containing two columns (ItemID and RowID). The second containing all of the Milestone related data plus a RowID. I put these through a inner merge join, based on RowID, so I have the ItemID in with the milestone data. Then I do a full outer join merge join on both files, using ItemID.

This gives me data sort of like this:

  • ItemID[1] - MilestoneData[2]
  • ItemID[1] - MilestoneData[3]
  • ItemID[1] - MilestoneData[4]
  • ItemID[1] - MilestoneData[5]
  • ItemID[2] - MilestoneData[6]
  • ItemID[2] - MilestoneData[7]
  • ItemID[2] - MilestoneData[8]

I can put this data through derived column transformations to create the columns of data I actually need, but I can't see how to structure this in a relational way/normalize it into a different xml format.

The idea is to output something like:

<item id="1">
      <Milestone id="2">
      <Milestone />
      <Milestone id="3">
      <Milestone />
</item>

Can anyone point me in the right direction?

UPDATE: A bit more detailed picture of what I have, and what I'd like to achieve:

Item.xml:

 <Items>
     <Item ItemID="1">
         <Title>
         Data
         </Title>
     </Item>
     <Item ItemID="2">
          ...
     </Item>
     ...
 </Items>

Milestone.xml:

  <Milestones>
  <Item ItemID="2">
       <MS id="3">
            <MS_DATA>
             Data
            </MS_DATA>
       </MS>
       <MS id="4">
            <MS_DATA>
             Data
            </MS_DATA>
       </MS>
   </Item>
   <Item ItemID="3">
       <MS id="5">
            <MS_DATA>
             Data
            </MS_DATA>
       </MS>
   </item>
 </Milestones>

The way it's presented in SSIS when I use XML source, is not entirely intuitive, meaning the Item rows and the MS rows are two seperate outputs. I had to run these through a join in order to get the Milestones that corresponds to specific Items. No problem here, then run it through a full outer join with the items, so I get a flattened table with multiple rows containing obviously the same data for an Item and with different data for the MS. Basically I get what I tried to show in my table, lots of redundant Item data, for each unique MilestoneData.

In the end it has to look similar to:

 <NewItems>
 <myNewItem ItemID="2">
       <SomeDataDirectlyFromItem>
            e.g. Title
       </SomeDataDirectlyFromItem>
       <DataConstructedFromMultipleColumnsInItem>
            <MyMilestones>
                  <MS_DATA_TRANSFORMED MSID="3">
                       data
                  </MS_DATA_TRANSFORMED>
                  <MS_DATA_TRANSFORMED MSID="4">
                       data
                  </MS_DATA_TRANSFORMED>
            </MyMilestones>
       </DataConstructedFromMultipleColumnsInItem>
  <myNewItem ItemID="3">
       <SomeDataDirectlyFromItem>
            e.g. Title
       </SomeDataDirectlyFromItem>
       <DataConstructedFromMultipleColumnsInItem>
            <MyMilestones>
                  <MS_DATA_TRANSFORMED MSID="5">
                       data
                  </MS_DATA_TRANSFORMED>
            </MyMilestones>
       </DataConstructedFromMultipleColumnsInItem>
 </myNewItem>
 <myNewItem ItemID="4">
       <SomeDataDirectlyFromItem>
            e.g. Title
       </SomeDataDirectlyFromItem>
       <DataConstructedFromMultipleColumnsInItem>
            <MyMilestones></MyMilestones>
       </DataConstructedFromMultipleColumnsInItem>
 </myNewItem>
 </NewItems>
4

2 回答 2

0

将 XML 导入关系表(例如在 tempdb 中)然后使用 FOR XML PATH 重建 XML 怎么样?FOR XML PATH 对您希望 XML 的外观提供了高度的控制。下面是一个非常简单的例子:

CREATE TABLE #items ( itemId INT PRIMARY KEY, title VARCHAR(50) NULL )
CREATE TABLE #milestones ( itemId INT, msId INT, msData VARCHAR(50) NOT NULL, PRIMARY KEY ( itemId, msId ) )
GO

DECLARE @itemsXML XML

SELECT @itemsXML = x.y
FROM OPENROWSET( BULK 'c:\temp\items.xml', SINGLE_CLOB ) x(y)

INSERT INTO #items ( itemId, title )
SELECT 
    i.c.value('@ItemID', 'INT' ),
    i.c.value('(Title/text())[1]', 'VARCHAR(50)' )
FROM @itemsXML.nodes('Items/Item') i(c)
GO


DECLARE @milestoneXML XML

SELECT @milestoneXML = x.y
FROM OPENROWSET( BULK 'c:\temp\milestone.xml', SINGLE_CLOB ) x(y)

INSERT INTO #milestones ( itemId, msId, msData )
SELECT 
    i.c.value('@ItemID', 'INT' ),
    i.c.value('(MS/@id)[1]', 'VARCHAR(50)' ) msId,
    i.c.value('(MS/MS_DATA/text())[1]', 'VARCHAR(50)' ) msData
FROM @milestoneXML.nodes('Milestones/Item') i(c)
GO

SELECT 
    i.itemId AS "@ItemID"
FROM #items i
    INNER JOIN #milestones ms ON i.itemId = ms.itemId
FOR XML PATH('myNewItem'), ROOT('NewItems'), TYPE


DROP TABLE #items 
DROP TABLE #milestones
于 2012-09-24T16:12:30.047 回答
0

我会尝试使用script component带有组件类型的 a 来处理这个问题transformation。由于您是 ssis 的新手,我假设您以前没有使用过它。所以基本上你

  • 定义输入列,您的组件将期望(即input_xml包含ItemID[1] - MilestoneData[2];...
  • 使用 c# 创建一个切割和粘在一起的逻辑
  • 定义您的组件将用于传递转换后的行的输出列

您将面临最后可能会使用两次的问题,例如 ie

ItemID[1] - MilestoneData[2]

将导致

<item id="1">
      <Milestone id="2">

我使用Pentaho 水壶做了一些非常相似的事情,即使没有使用script component你定义自己的逻辑的东西。但我猜 ssis 这里缺少任务。

于 2012-09-24T09:22:40.390 回答