r - 从 XMLNodeSet 中 R 提取数据

Question

我正在尝试使用 XML 包从 html 文档中提取数据。我去如下：

library(XML)
sink("parse.txt")
parse<-htmlParse(file = "jdwaz.html",encoding = "GBK")
a=getNodeSet(parse,'//div[@class="amount"]')
print(a)

然后 class(a) 返回“XMLNodeSet”，其内容在 txt 文件中如下所示

[[1]]
<div class="amount">
                    <span>总额 ￥113.80</span> <br /><span class="ftx-13">在线支付</span>
                                    </div> 

[[2]]
<div class="amount">
                    <span>总额 ￥99.00</span> <br /><span class="ftx-13">在线支付</span>
                                    </div>

我只显示“a”的 20 个中的 2 个

class([a]) 返回“列表”想要获取内容“我想要获取 0 ￥99.0”。我在r 中找到了一种方法 - XMLNodeSet 上的 xpathApply（带有 XML 包）它使用 xmlvalue 来获取如下文本：

x <- xpathApply(y, "//table/tr")
sapply(x,xmlValue)          ## it a list of nodes..
 " Test1.1  Test1.2 " " Test1.3  Test1.4 "

但这不适合我的情况。当我输入 xmlvalue(a) 时，它返回：

Error in UseMethod("xmlValue") : no applicable method for 'xmlValue' applied to an object of class "XMLNodeSet"

我没有找到合适的方法来处理 XMLNodeSet 类。帮助！

score 2 · Accepted Answer

我正在使用rvest包从网页中抓取数据，并且遇到了与您类似的问题。

在rvest中，您通过html_nodes(read_html(url), css)获取类似于您的 .xml 节点集的数据来获取数据a。我学会了一种通过简单函数从 xml 节点集中提取内容的快速方法html_text()。要使用它，您需要环绕html_text()xml 节点集。

假设以下示例根据其域获取公司名称：

library('rvest')
url <- 'https://who.is/whois/gmail.com' 
webpage <- read_html(url)
a <- html_nodes(webpage,'.col-md-7')
a[1] # returns the long xml nodeset
html_text(a) # converts the xml into vector with content extracted
html_text(a)[2] # gives you the company name

它看起来像是rvest一个流行的 Web 数据抓取包，并且还有很多帮助资源。

score 2 · Accepted Answer

要查询 XML 节点集，请使用前导“.”。所以它相对于当前节点。由于您有两个 span 标签，因此请获取一个没有 class 属性的标签。

sapply(a, function(x) xpathSApply(x, ".//span[not(@class)]", xmlValue)) #OR
sapply(a, xpathSApply, ".//span[not(@class)]", xmlValue)
[1] "总额 ￥113.80" "总额 ￥99.00"

r - 从 XMLNodeSet 中 R 提取数据

2 回答 2

Related

Reference