1

I'm using Rcrawler to extract the infobox of Wikipedia pages. I have a list of musicians and I'd like to extract their name, DOB, date of death, instruments, labels, etc. Then I'd like to create a dataframe of all artists in the list as rows and the data stored as columns/vectors.

The code below throws no errors but I don't get any results either. The xpath used in the code is effective when I use rvest on its own.

What is wrong with my code?

library(Rcrawler)
jazzlist<-c("Art Pepper","Horace Silver","Art Blakey","Philly Joe Jones")

Rcrawler(Website = "http://en.wikipedia.org/wiki/Special:Search/", no_cores = 4, no_conn = 4, 
     KeywordsFilter = jazzlist,
     ExtractXpathPat = c("//th","//tr[(((count(preceding-sibling::*) + 1) = 5) and parent::*)]//td",
                         "//tr[(((count(preceding-sibling::*) + 1) = 6) and parent::*)]//td"),
     PatternsNames = c("artist", "dob", "dod"), 
     ManyPerPattern = TRUE, MaxDepth=1 )
4

2 回答 2

3

从特定的维基百科 URL 列表中抓取数据

如果您想抓取具有常见模式的特定 URL 列表,请使用 ContentScraper 函数:

library(Rcrawler)
jazzlist<-c("Art Pepper","Horace Silver","Art Blakey","Philly Joe Jones")
target_pages = paste0('https://en.wikipedia.org/wiki/Special:Search/', gsub(" ", "_", jazzlist))
DATA<-ContentScraper(Url = target_pages , 
                     XpathPatterns = c("//th","//tr[(((count(preceding-sibling::*) + 1) = 5) and parent::*)]//td","//tr[(((count(preceding-sibling::*) + 1) = 6) and parent::*)]//td"),
                     PatternsName = c("artist", "dob", "dod"),
                     asDataFrame = TRUE)
View(DATA)

维基百科 URL 的抓取列表

从维基百科的链接列表中抓取和抓取数据

经过一番努力,我在维基百科中找到了一个硬波普音乐家列表,我想你会对抓取所有这些艺术家的数据感兴趣;在这种情况下,我们将使用 Rcrawler 函数自动收集和解析所有这些页面。

Rcrawler(Website = "https://en.wikipedia.org/wiki/List_of_hard_bop_musicians" ,
         no_cores = 4, no_conn = 4, MaxDepth = 1, 
         ExtractXpathPat = c("//th","//tr[(((count(preceding-sibling::*) + 1) = 5) and parent::*)]//td","//tr[(((count(preceding-sibling::*) + 1) = 6) and parent::*)]//td"),
         PatternsNames = c("artist", "dob", "dod"),
         crawlZoneXPath = "//*[@class='mw-parser-output']")
#transform data into dataframe
 df<-data.frame(do.call("rbind", DATA))
  • MaxDepth = 1 :仅抓取起始页中的链接
  • crawlZoneXPath :仅抓取页面正文中的链接(艺术家列表)
  • ExtractXpathPat :要提取的数据的 XPath 模式

爬取维基百科网页

爬虫创建者

于 2018-11-15T21:19:46.760 回答
2

我可能是错的,但我怀疑你认为 Rcrawler 包的工作方式与它的工作方式不同。您可能会将抓取与爬行混淆。

Rcrawler 只是从给定页面开始,然后从该页面爬取任何链接。您可以使用 URL 过滤器或关键字过滤器缩小路径范围,但仍需要通过爬网过程访问这些页面。它不运行搜索。

您从 Wikipedia 搜索页面开始的事实表明您可能希望它根据您在 中指定的术语运行搜索jazzlist,但它不会这样做。它只会跟随维基百科搜索页面中的所有链接,例如左侧边栏中的“主页”、“内容”、“特色内容”,它最终可能会或可能不会碰到您使用过的术语之一,在这种情况下,它将根据您的 xpath 参数抓取数据。

您指定的术语将非常少见,因此虽然最终可能会通过“精选页面”等文章交叉链接找到它们,但这将需要很长时间。

我认为您想要的是根本不使用 Rcrawler,而是rvest从搜索词的循环中调用函数。您只需要将术语附加到您提到的搜索 URL,并用下划线替换空格:

library(rvest)
target_pages = paste0('https://en.wikipedia.org/wiki/Special:Search/', gsub(" ", "_", jazzlist))

for (url in target_pages){
    webpage = read_html(url)
    # do whatever else you want here with rvest functions 
}

编辑:根据他的评论,在下面添加了针对他的具体案例的 OP 的确切代码的解决方案

library(rvest)
target_pages = paste0('https://en.wikipedia.org/wiki/Special:Search/', gsub(" ", "_", jazzlist))

for (url in target_pages){
    webpage = read_html(url)
    info<-webpage %>% html_nodes(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "plainlist", " " ))]') %>% html_text() temp<-data.frame(info, stringsAsFactors = FALSE) data = rbind(data,temp) 
}
于 2018-07-31T13:45:25.733 回答