r - R：尝试使用 unlist(xpathSApply( )) 进行网络抓取会导致 NULL

Question

我正在使用以下网站：http ://www.crowdrise.com/skollsechallenge

特别是在此页面上，有 57 个众筹活动。每个众筹活动都有文字，详细说明了他们想要筹集资金的原因、迄今为止筹集的资金总额以及团队成员。一些活动还指定了筹款目标。我想编写一些 R 代码，从 57 个站点中的每个站点中抓取和组织这些信息。

为了得出一个包含 57 家公司中每家公司的所有这些信息的表格，我首先生成了一个函数，可以让我提取 57 家公司中每家公司的名称：

  #import packages
  library("RCurl")
  library("XML")
  library("stringr")

  url <- "http://www.crowdrise.com/skollSEchallenge"
  url.data <- readLines(url) 
  #the resulting url.data is a character string
  #remove spaces
  url.data <- gsub('\r','', gsub('\t','', gsub('\n','', url.data)))  
  index.list <- grep("username:",url.data)
  #index.list is a list of integers that indicates indexes of url.data that includes name      
  #of each of the 57 campaigns  
  length.index.list<-length(index.list)
  length.index.list
  vec <-vector ()

  #store the 57 usernames in one vector
    for(i in 1:length.index.list){
      username<-url.data[index.list[i]]
      real.username <- gsub("username:","",username)
      vec[i] <- c(real.username)
    }

然后我尝试创建一个循环以允许 R 访问 57 个活动网页中的每一个，并进行 webscraping 。

 # Extract all necessary paragraphs. Unlist flattens the list to 
 #create a character vector.

    for(i in 1:length(vec)){
    end.name<-gsub('\'','',vec[i])
    end.name<-gsub(',','',end.name)
    end.name<-gsub(' ','',end.name)
    user.address<-paste(c("http://www.crowdrise.com/skollSEchallenge/",
    end.name),collapse='') 
    user.url<-getURL(user.address)

    html <- htmlTreeParse(user.url, useInternalNodes = TRUE)
    website.donor<-unlist(xpathSApply(html,'//div[@class="grid1-4 "]//h4', xmlValue))
    website.title<-unlist(xpathSApply(html,'//div[@class="project_info"]',xmlValue))
    website.story<-unlist(xpathSApply(html,'//div[@id="thestory"]',xmlValue))
    website.fund<-unlist(xpathSApply(html,'//div[@class="clearfix"]',xmlValue))

    #(NOTE: doc.text<- readHTMLTable(webpage1) doesn't work 
    #due to the poor html structure of the website)
    # Replace all \n by spaces, and eliminate all \t
    website.donor <- gsub('\\n', ' ', website.donor)
    website.donor <- gsub('\\t','',website.donor)
    website.title <- gsub('\\n', ' ', website.title)
    website.title <- gsub('\\t','',website.title)
    website.story <- gsub('\\n', ' ', website.story)
    website.story <- gsub('\\t','',website.story)
    website.fund <- gsub('\\n', ' ', website.fund)
    website.fund <- gsub('\\t','',website.fund)

    ## all those tabs and spaces are just white spaces that we can trim
    website.title <- str_trim(website.title)
    website.fund   <- str_trim(website.fund)
    website.data<- cbind(website.title, website.story, website.fund, website.donor)
    data[[i]]<- website.data
    Sys.sleep(1)
   }
  data <- data.frame(do.call(rbind,data), stringAsFactors=F)

命令

   unlist(xpathSApply(html,'//div[@class="grid1-4 "]//h4', xmlValue))
   unlist(xpathSApply(html,'//div[@class="project_info"]',xmlValue))
   unlist(xpathSApply(html,'//div[@id="thestory"]',xmlValue))
   unlist(xpathSApply(html,'//div[@class="clearfix"]',xmlValue))

给我NULL值，我不明白为什么。

为什么它们变成 NULL，我该如何解决？

谢谢你，

score 1 · Accepted Answer

如果我正确地遵循了这一点，您是否正在尝试获取此 url 字符串和其他 56 个字符串？

url <- "http://www.crowdrise.com/skollSEchallenge/Arzu"
x <- getURL(url)

但是，这只会返回您尝试查询的“页面未找到”页面。我想你想要这个 url，但我什至无法让 htmlParse 工作。

url <- "http://www.crowdrise.com/Arzu"
x <- readLines(url, encoding="latin1")
 #doc <- htmlParse(x)  # hangs

如果您使用http://validator.w3.org和 latin1 编码检查该站点，您会发现 323 错误，因此您可能需要解析 readLines 的输出

x[grep('"thestory"', x)+1]
[1] "\t\t\t<p><p><em><strong>&quot;We can overcome misunderstanding by ...

score 0 · Accepted Answer

很高兴您对 Crowdrise 感兴趣。我们提供的 API 可能比自动抓取我们的网站要好得多。使用我们的联系表格与我联系或直接给我发消息，我们将讨论您的需求以及我们如何为您提供帮助。

谢谢！

戴夫

r - R：尝试使用 unlist(xpathSApply( )) 进行网络抓取会导致 NULL

2 回答 2

Related

Reference