1

我正在使用 HtmlCleaner 来解析 html 文档并遇到了一个小问题:

htmlcleaner 属性指南上,它说如果我将 useCdata 标志设置为 false,它将在脚本和样式标签中搜索 html。好的,我来了:

scala> val cleanerProps = new CleanerProperties()
cleanerProps: org.htmlcleaner.CleanerProperties = org.htmlcleaner.CleanerProperties@203e9b48

scala> cleanerProps.setUseCdataForScriptAndStyle(false)

scala> val clnr = new HtmlCleaner(cleanerProps)
clnr: org.htmlcleaner.HtmlCleaner = org.htmlcleaner.HtmlCleaner@4a5800a9

scala> val test = """<script language="javascript">
     | document.write('<h1>Obviously a heading</h1>')
     | </script>"""
test: java.lang.String =
<script language="javascript">
document.write('<h1>Obviously a heading</h1>')
</script>

scala> clnr.clean(test).getElementsByName("h1", true)
res61: Array[org.htmlcleaner.TagNode] = Array()

htmlcleaner 不应该找到 h1 吗?为了使事情更加混乱,以下工作正常:

scala> val test2 = """document.write('<h1>Obviously a heading</h1>')"""
test2: java.lang.String =
"document.write('<h1>Obviously a heading</h1>')"
scala> clnr.clean(test2).getElementsByName("h1", true)
res62: Array[org.htmlcleaner.TagNode] = Array(h1)

或者

scala> clnr.clean(test.replaceAllLiterally("script","style")).getElementsByName("h1", true)
res65: Array[org.htmlcleaner.TagNode] = Array(h1)

???

4

0 回答 0