1

我想用 XmlSlurper 解析我使用 HTTPBuilder 阅读的 HTML 文档。最初我尝试这样做:

def response = http.get(path: "index.php", contentType: TEXT)
def slurper = new XmlSlurper()
def xml = slurper.parse(response)

但它会产生一个异常:

java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

我找到了一种解决方法来提供缓存的 DTD 文件。我找到了一个简单的类实现,在这里应该有所帮助:

class CachedDTD {
/**
 * Return DTD 'systemId' as InputSource.
 * @param publicId
 * @param systemId
 * @return InputSource for locally cached DTD.
 */
  def static entityResolver = [
          resolveEntity: { publicId, systemId ->
            try {
              String dtd = "dtd/" + systemId.split("/").last()
              Logger.getRootLogger().debug "DTD path: ${dtd}"
              new org.xml.sax.InputSource(CachedDTD.class.getResourceAsStream(dtd))
            } catch (e) {
              //e.printStackTrace()
              Logger.getRootLogger().fatal "Fatal error", e
              null
            }
          }
  ] as org.xml.sax.EntityResolver

}

我的包树如下所示:

替代文字

我还修改了一些用于解析响应的代码,所以它看起来像这样:

def response = http.get(path: "index.php", contentType: TEXT)
def slurper = new XmlSlurper()
slurper.setEntityResolver(org.yuri.CachedDTD.entityResolver)
def xml = slurper.parse(response)

但现在我得到了java.net.MalformedURLException。从 CachedDTD entityResolver 记录的 DTD 路径是org/yuri/dtd/xhtml1-transitional.dtd,我无法让它工作......

4

2 回答 2

1

I was able to solve my parsing issue by using another XmlSlurper constructor:

public XmlSlurper(boolean validating, boolean namespaceAware, boolean allowDocTypeDeclaration)

like this:

def parser = new XmlSlurper(false, false, true)

In my XML case, disabling the validation (1st parameter false) and enabling the DOCTYPE declaration (3rd parameter true) did the trick.

Note:

于 2014-11-28T19:33:34.497 回答
1

您可以使用 HTML 解析,结合 XmlSlurper 来解决这些问题

http://sourceforge.net/projects/nekohtml/

示例用法在这里

http://groovy.codehaus.org/Testing+Web+Applications

于 2010-09-19T16:16:38.907 回答