0

假设我有一个名为 pixelTest.xml 的 xml 文件,它看起来像这样......

<twa>
  <trackingPixels>
    <pixelNew pagekey="somepagekey">
        <html>
            <!--
            Google Code for Lead Tracking
            -->
            <script type="text/javascript">
                /*
                <![CDATA[
                */ var google_conversion_id = 10; var google_conversion_language = "en_US"; /*
                ]]>
                */
            </script>
            <script type="text/javascript" src="//www.SomeWebsite.com"></script>
            <noscript>
                <div style="display:inline;">
                    <img height="1" width="1" style="border-style:none;" alt="" src="//www.SomeOtherWebsite.com"/>
                </div>
            </noscript>
        </html>
        <publisher>Google adwords</publisher>
        <dateAddedToRegistry>2013-05-08</dateAddedToRegistry>
    </pixelNew>

    <pixelNew pagekey="someotherpagekey">
        <html>
            <script type="text/javascript">
                var axel = Math.random() + "";
                var a = axel * 10000000000000;
                document.write('<iframe src="www.somewebsite.com/ad" width="1" height="1" frameborder="0" style="display:none"></iframe>');
            </script>
            <noscript>
                <iframe src="https:www.somewebsite.com/ads" width="1" height="1" frameborder="0" style="display:none"></iframe>
            </noscript>
        </html>
        <publisher>Agency Doubleclick Tag</publisher>
        <dateAddedToRegistry>2013-04-17</dateAddedToRegistry>
    </pixelNew>

  </trackingPixels>
</twaDoc>

我想要做的是以形成 html 元素的形式显示 html 元素内的所有内容。这意味着我想打印 html 元素的确切输出。这是我的代码的样子......

    def f = new File('c:\\pixelsTest.xml')
    def x = new XmlSlurper().parse(f)
    def htmlList = []

    x.trackingPixels.children().each { px ->
        def html = new StreamingMarkupBuilder().bind { out << px.html } as String
        htmlList << html
    }

    htmlList.each { h ->
        println '-' * 79
        println h
    }

但我无法让它正确访问 html 元素,我通过打印我的 htmlList 进行检查。这是我的输出...

 -------------------------------------------------------------------------------
 <html><script type='text/javascript'>
 /*

 */ var google_conversion_id = 10; var google_conversion_language = "en_US"; /*

 */
 </script><script src='//www.SomeWebsite.com' type='text/javascript'></script><noscript><div style='display:inline;'><img height='1' style='border-style:none;' alt='' width='1' src='//www.SomeOtherWebsite.com'></img></div></noscript></html>
 -------------------------------------------------------------------------------
 <html><script type='text/javascript'>
                var axel = Math.random() + "";
                var a = axel * 10000000000000;
                document.write('<iframe frameborder='0' height='1' style='display:none' width='1' src='www.somewebsite.com/ad'></iframe>');
      </script><noscript><iframe frameborder='0' height='1' style='display:none' width='1' src='https:www.somewebsite.com/ads'></iframe></noscript></html>  

但我希望它正确地保存在我的 htmlList 中,就像底部打印出来的那样......

 -------------------------------------------------------------------------------
 <html>
     <!--
     Google Code for Lead Tracking
     -->
     <script type="text/javascript">
         /*
         <![CDATA[
         */ var google_conversion_id = 10; var google_conversion_language = "en_US"; /*
         ]]>
         */
     </script>
     <script type="text/javascript" src="//www.SomeWebsite.com"></script>
     <noscript>
         <div style="display:inline;">
             <img height="1" width="1" style="border-style:none;" alt="" src="//www.SomeOtherWebsite.com"/>
         </div>
     </noscript>
 </html>
 -------------------------------------------------------------------------------
 <html>
     <script type="text/javascript">
         var axel = Math.random() + "";
         var a = axel * 10000000000000;
         document.write('<iframe src="www.somewebsite.com/ad" width="1" height="1" frameborder="0" style="display:none"></iframe>');
     </script>
     <noscript>
         <iframe src="https:www.somewebsite.com/ads" width="1" height="1" frameborder="0" style="display:none"></iframe>
     </noscript>
 </html>  

看起来 XmlSlurper 也跳过了 CDATA 和评论之类的东西。谁能帮帮我吗?谢谢!

4

1 回答 1

0

我最终得到了确切的 html 标签。虽然我没有使用 XmlSlurper。我继续使用另一种方法,将文件解析为整个字符串。然后我在 StringUtils 中使用了 substringBetween 方法,得到了包含在 html 标签内的字符串。这是我的代码片段。

import org.apache.commons.lang.StringUtils

String file = new File('c:\\pixelsTest.xml').text
def newPixelList = []
def htmlList = []
newPixelList = StringUtils.substringsBetween(file, "<pixelNew", "</pixelNew>")
for(int i =0; i < newPixelList.size(); i++){
    //Here I can access the html tag and other tags as well like publisher...

    htmlList[i] = StringUtils.substringBetween(newPixelList[i], "<html>", "</html>")
}
于 2013-07-17T21:48:27.107 回答