java - 在文件中拆分一个巨大的 xml

Question

我们一直在尝试在文件中拆分一个巨大的 7GB xml，到目前为止，没有一个尝试过的选项有希望。让我解释：

有一个来自外部用户的文件，因此我们无法更改它。为了加载到数据库中，需要对其进行拆分。

检查后，informatica 有大约 4400 个端口，这意味着每个项目上至少有 4400 个节点。该文件被切割成 11 个不同的文件。



    <?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
    <file>
        <fileHeader>This has some information</fileHeader>
        <fileBody>
            <Item id="1">
                <definition>
                    <id>1</id>
                    <name>Something</name>
                    <description>This is a dummy</description>
                </definition>
                <raw_materials>
                    <material>
                        <name>polycarbonate</name>
                        <description>Something to describe</description>
                        <cost>24.33</cost>
                        <units>LB</units>
                    </material>
                    <material txt="this" />
                    <material txt="had to" />
                    <material txt="be splitted" />
                    <material txt="into 3" />
                    <material txt="different files"/>
                </raw_materials>
                <specs>
                    <rating_usa issuer_id="3">A</rating_usa>
                    <rating_cnd issuer_id="9">10</rating_cnd>
                    <rating_br issuer_id="5">24.12</rating_bra>
                </specs>
                <budget>
                    <budget_usa>
                        <amount>465</amount>
                        <currency>USD</currency>
                        <usd_vs>1</usd_vs>
                    </budget_usa>
                    <budget_cnd>
                        <amount>30</amount>
                        <currency>CND</currency>
                        <usd_vs>1.24</usd_vs>
                    </budget_cnd>
                    <budget_bra>
                        <amount>20</amount>
                        <currency>BRP</currency>
                        <usd_vs>17.31</usd_vs>
                    </budget_bra>
                </budget>
                <vendor>  
                    <id>1HR24ZA</id>
                    <vendorName>Vendor</vendorName>
                    <deliveryRate>9.5</deliveryRate>
                    <location>
                        <country>Italy</country>
                        <address>Lamborghini Str. 245</address>
                        <phone>1234</phone>                 
                    </location>
                </vendor>
                <taxes>
                    <tax>
                        <country>MEX</country>
                        <federal_pct>16</federal_pct>
                        <currency>MXN</currency>
                        <pct_price>5</pct_price>
                    </tax>
                    <tax txt="this also"/>
                    <tax txt="contains too"/>
                    <tax txt="much nodes"/>
                </taxes>
            </Item>
            <Item id="2">
            </Item>
        </fileBody>
    </file>

这里每个项目只有 6 个市长标签（定义、原材料、规格、预算、供应商、税收），但实际上它有 9 个。

原始映射是这样的：Source -> Source Qualifier -> Target (XML)

为了尝试解决问题，更改了设置，但没有明显改善。之后，每个文件都被放入一个任务中的工作流中，并且所有任务都并行放置。最后一次，和原来一样。

之后，尝试了java。DOM 不是一个选项，因为它将文件加载到内存中。然后，尝试了 SAX 和 StAX，StAX 表现出比 SAX 更好的性能，所以我们朝那个方向走。

值得一提的是，informatica 上的最终文件是这样的：



    <?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
    <file>
        <fileHeader>This has some information</fileHeader>
        <fileBody>
            <Item id="1">
                <raw_materials>
                    <material>
                        <name>polycarbonate</name>
                        <description>Something to describe</description>                    
                    </material>
                    <material txt="this" />
                    <material txt="is hardcore" />
                </raw_materials>          
    </file>

如您所见，您必须检查文件中是否存在特定标签。因此，每次有新标签出现时，您最终都会检查大约 200 个标签，并且您要为要将该标签放入的每个文件都执行此操作：



    public class XMLCopier implements javax.xml.stream.StreamFilter {
        static boolean allowStream = false;
        static boolean tagFinished = false;

        private static boolean isWithinValidTag = false;
        private static Map tagMap = new HashMap();
        private static String currentTag = "";

        public static void main(String[] args) throws Exception {        
            String filename = "/path/to/xml/xmlInput.xml";        
            String fileOutputName = "/path/to/target/finalXML.xml";
            try
            {

                XMLInputFactory xmlif = null;
                xmlif = XMLInputFactory.newInstance();          
                FileInputStream fis = new FileInputStream(filename);
                XMLStreamReader xmlr = xmlif.createFilteredReader(xmlif.createXMLStreamReader(fis),new XMLCopier());            

                OutputStream outputFile = new FileOutputStream(fileOutputName);                     
                XMLOutputFactory outputFactory = XMLOutputFactory.newInstance();            
                XMLStreamWriter xmlWriter = outputFactory.createXMLStreamWriter(outputFile);

                while (xmlr.hasNext())          
                {               
                    write(xmlr, xmlWriter);             
                    xmlr.next();
                }
                write(xmlr, xmlWriter);                                 
                xmlWriter.flush();
                xmlWriter.close();          
                xmlr.close();
                outputFile.close();         
            }
            catch (Exception e)
            {
                e.printStackTrace();
            }
        }

        public boolean accept(XMLStreamReader reader) {
            int eventType = reader.getEventType();
            if ( eventType == XMLEvent.START_ELEMENT )
            {
                String currentName = reader.getLocalName();
                if (isWithinValidTag)
                    if ( ( (List)tagMap.get(currentTag) ).contains(currentName) )               
                    {   
                        allowStream = true;                 
                    }

                if ( tagMap.containsKey(currentName) )
                {   
                    isWithinValidTag = true;
                    currentTag = currentName;
                    allowStream = true;
                }
            }
            return allowStream;
        }

        private void write(XMLStreamReader xmlr, XMLStreamWriter writer) throws XMLStreamException
        {
            switch (xmlr.getEventType()) {
                case XMLEvent.START_ELEMENT:
                    final String localName = xmlr.getLocalName();
                    writer.writeStartElement(localName);
                break;
            }
        }

当我们尝试在单个类上执行此操作时，我们以难以维护的代码结束，并且完成时间比 informatica 流程少大约 5 分钟。然后我们将类拆分为并行运行，但它看起来并不乐观，因为它的运行时间比 informatica 进程少 7 分钟，可能是因为您要在 4400 个节点上执行 200 个标签的搜索。11 次。

正如你所看到的，这不是关于如何制作东西，而是关于如何快速制作东西。

您对我们如何改进文件拆分有任何想法吗？

PD。服务器有 JVM 1.4.2，所以我们必须坚持下去。PD2。这里它只显示一个项目，在真实文件中它有很多。

java - 在文件中拆分一个巨大的 xml

0 回答 0

Related

Reference