r - R中的网页抓取

Question

我正在尝试获取 'Dated Posted' 和 'Date Updated' 的值，如图所示。网址为：http ://sulit.com.ph/3991016

我觉得我应该使用 xpathSApply，正如这个线程Web Scraping (in R?) 中所建议的那样，但我就是无法让它工作。

url = "http://sulit.com.ph/3991016"
doc = htmlTreeParse(url, useInternalNodes = T)

date_posted = xpathSApply(doc, "??????????", xmlValue)

还有人知道在网站上也列出“P27M”这个短语的快速方法吗？帮助将不胜感激。

score 3 · Accepted Answer

这是另一种方法。

> require(XML)
> 
> url = "http://www.sulit.com.ph/index.php/view+classifieds/id/3991016/BEAUTIFUL+AYALA+HEIGHTS+QC+HOUSE+FOR+SALE"
> doc = htmlParse(url)
> 
> dates = getNodeSet(doc, "//span[contains(string(.), 'Date Posted') or contains(string(.), 'Date Updated')]")
> dates = lapply(dates, function(x){
+         temp = xmlValue(xmlParent(x)["span"][[2]])
+         strptime(gsub("^[[:space:]]+|[[:space:]]+$", "", temp), format = "%B %d, %Y")
+ 
+ })
> dates
[[1]]
[1] "2012-07-05"

[[2]]
[1] "2011-08-11"

无需使用 RCurl，因为 htmlParse 将解析 url。getNodeSet 将返回一个列表，其中包含具有“发布日期”或“更新日期”作为值的节点。lapply 在这两个节点上循环，首先找到父节点，然后找到第二个“span”节点的值。如果网站更改了不同页面的格式（在查看该网站的 html 之后似乎很有可能），这部分可能不是很健壮。SlowLearner 的 gsub 会清理这两个日期。我添加了 strptime 以将日期作为日期类返回，但该步骤是可选的，取决于您计划在未来如何使用这些信息。高温高压

score 2 · Accepted Answer

这不是优雅的，可能不是很健壮，但它应该适用于这种情况。

调用后的前 4 行require检索 URL 并提取文本。grep返回一个TRUE或FALSE取决于我们正在寻找的字符串是否已找到，将which其转换为列表中的索引。我们将其加 1，因为如果您查看，cleantext您会发现更新日期是列表中字符串“更新日期”之后的下一个元素。所以+1我们得到了“更新日期”之后的元素。gsub线条只是清理字符串。

“P27M”的问题在于它没有锚定在任何东西上——它只是漂浮在任意位置的自由文本。如果您确定价格始终是“P”后跟 1 到 3 位数字，然后是“M”并且页面中只有一个这样的字符串，那么 grep 或 regex 将起作用，否则很难得到。

require(XML)
require(RCurl)

myurl <- 'http://www.sulit.com.ph/index.php/view+classifieds/id/3991016/BEAUTIFUL+AYALA+HEIGHTS+QC+HOUSE+FOR+SALE'
mytext <- getURL(myurl)
myhtml <- htmlTreeParse(mytext, useInternal = TRUE)
cleantext <- xpathApply(myhtml, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)

cleantext <- cleantext[!cleantext %in% " "]
cleantext <- gsub("  "," ", cleantext)

date_updated <- cleantext[[which(grepl("Date Updated",cleantext))+1]]
date_posted <- cleantext[[which(grepl("Date Posted",cleantext))+1]]
date_posted <- gsub("^[[:space:]]+|[[:space:]]+$","",date_posted)
date_updated <- gsub("^[[:space:]]+|[[:space:]]+$","",date_updated)

print(date_updated)
print(date_posted)

r - R中的网页抓取

2 回答 2

Related

Reference