r - 如何使用 rvest 从基于 Web 的论坛中抓取消息

Question

以示例中的 vbulletin 站点为例。我希望能够只从线程中抓取短信。然而，消息的 css 选择器被称为 #post_message_xxx ，其中 xxx 是一个变量 id 号。

如何将选择器与 html_nodes 部分匹配，以便获得所有以 #post_message 开头的选择器，而不管它们如何结束？

或者我应该问一个更笼统的问题。如果我希望能够将作者归因于消息并跟踪消息顺序，我应该如何抓取页面。

谢谢。

library(rvest)
html <- html("http://www.acme.com/forums/new_rules_28429/")
cast <- html_nodes(html, "#post_message_28429")
cast

> <div id="post_message_28429">&#13;            &#13;           Thanks for posting
> this.&#13;        </div> 
> 
> attr(,"class")

[1] "XMLNodeSet"

score 6 · Accepted Answer

不要使用 css 选择器，而是使用具有starts-with()功能的 xpath 选择器

cast <- html_nodes(html, xpath="//div[starts-with(@id,'post_message')]")

score 5 · Accepted Answer

或者你实际上可以用“远没有那么强大”的 CSS 选择器来做同样的事情：

cast <- html_nodes(html, "div[id^='post_message']")

r - 如何使用 rvest 从基于 Web 的论坛中抓取消息

2 回答 2

Related

Reference