0

我想从这个网页中提取 XPATH //DIV[@id="ps-content"]:http: //www.amazon.com/dp/1449319432(保存为本地文件)

我想使用最好的解析器之一(如 Saxon-PE 或 BaseX)用一行命令行来完成。

到目前为止,我(似乎)找到的最短解决方案是这两行:

java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"
java -cp saxon9pe.jar net.sf.saxon.Query -s:"page.xhtml" -qs:"//DIV[@id='ps-content']"

但它返回的只是这个,这不是我预期的 html 代码块:

<?xml version="1.0" encoding="UTF-8"?>

我的问题有两个:

4

2 回答 2

0

我找到了正确的命令行来启动没有 TagSoup 的查询:

java -cp saxon9pe.jar net.sf.saxon.Query -s:"test.xhtm" -qs:"//*:div[@id='ps-content']"

请注意,像这样反转引号类型不起作用(在 Win7 中):

java -cp saxon9pe.jar net.sf.saxon.Query -s:"test.xhtm" -qs:'//*:div[@id="ps-content"]'

有谁知道如何在同一命令行中添加 TagSoup 预处理?

于 2013-06-11T11:02:07.407 回答
-1

我上次尝试将 TagSoup 集成到同一命令行中失败:

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(fn:unparsed-text('
page.html'))//*:div[@id='ps-content']"
Error on line 1 column 17
  XPST0017 XQuery static error near #...:unparsed-text('page.html'))//#:
    System function unparsed-text#1 is not available with this host language/ver
sion
Static error(s) in query

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"fn:parse-html(fn:unparsed-text('pag
e.html'))//*:div[@id='ps-content']"
Error on line 1 column 14
  XPST0017 XQuery static error near #...:unparsed-text('page.html'))//#:
    System function unparsed-text#1 is not available with this host language/ver
sion
Static error(s) in query

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.h
tml;unparsed=yes'))//*:div[@id='ps-content']"
Error on line 1 of *module with no systemId*:
  FODC0002: The file or directory
  file:/C:/Users/diego/Downloads/SaxonPE9-4-0-7J/page.html;unparsed=yes does not
 exist
Query processing failed: Run-time errors were reported

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.h
tml';'unparsed=yes'))//*:div[@id='ps-content']"
Error on line 1 column 39
  XPST0003 XQuery syntax error near #...ion('page.html';'unparsed=yes'#:
    expected ")", found ";"
Static error(s) in query
于 2013-06-14T21:03:31.900 回答