groovy - Groovy html 单元

Question

我在将 htmlunit (htmlunit.sf.net) 导入 groovy 脚本时遇到问题。

我目前只是使用网络上的示例脚本，它让我无法解析类 com.gargoylesoftware.htmlunit.WebClient

脚本是：

import com.gargoylesoftware.htmlunit.WebClient

client = new WebClient()
html = client.getPage('http://www.msnbc.msn.com/')
println page.anchors.collect{ it.hrefAttribute }.sort().unique().join('\n')

我从网站下载了源代码，并将 com 文件夹（及其所有内容）放在了我的脚本所在的位置。

有谁知道我遇到了什么问题？我不太确定为什么它不会导入它

score 3 · Accepted Answer

您可以在脚本运行时使用 Grape 为您获取依赖关系。最简单的方法是在导入语句中添加 @Grab 注释。

像这样：

@Grab('net.sourceforge.htmlunit:htmlunit:2.7')
import com.gargoylesoftware.htmlunit.WebClient

client = new WebClient()

// Added as HtmlUnit had problems with the JavaScript
client.javaScriptEnabled = false
html = client.getPage('http://www.msnbc.msn.com/')
println page.anchors.collect{ it.hrefAttribute }.sort().unique().join('\n')

只有一个问题。对于 HtmlUnit，该页面似乎有点难以理解。当我运行代码时，我每次都得到 OutOfMemoryException。我建议以正常方式下载 html，然后使用 NekoHtml 或 TagSoup 之类的东西将 html 解析为 XML 并以这种方式使用它。

此示例使用 TagSoup 在 Groovy 中将 html 用作 xml：http: //blog.foosion.org/2008/06/09/parse-html-the-groovy-way/

score 1 · Accepted Answer

您只需要下载 zip 文件，提取 jar 文件并在编译时将它们放在类路径中...您不需要源

http://sourceforge.net/projects/htmlunit/files/htmlunit/2.8/htmlunit-2.8.zip/download

groovy - Groovy html 单元

2 回答 2

Related

Reference