我正在使用以下网站:http ://www.crowdrise.com/skollsechallenge
特别是在此页面上,有 57 个众筹活动。每个众筹活动都有文字,详细说明了他们想要筹集资金的原因、迄今为止筹集的资金总额以及团队成员。一些活动还指定了筹款目标。我想编写一些 R 代码,从 57 个站点中的每个站点中抓取和组织这些信息。
为了得出一个包含 57 家公司中每家公司的所有这些信息的表格,我首先生成了一个函数,可以让我提取 57 家公司中每家公司的名称:
#import packages
library("RCurl")
library("XML")
library("stringr")
url <- "http://www.crowdrise.com/skollSEchallenge"
url.data <- readLines(url)
#the resulting url.data is a character string
#remove spaces
url.data <- gsub('\r','', gsub('\t','', gsub('\n','', url.data)))
index.list <- grep("username:",url.data)
#index.list is a list of integers that indicates indexes of url.data that includes name
#of each of the 57 campaigns
length.index.list<-length(index.list)
length.index.list
vec <-vector ()
#store the 57 usernames in one vector
for(i in 1:length.index.list){
username<-url.data[index.list[i]]
real.username <- gsub("username:","",username)
vec[i] <- c(real.username)
}
然后我尝试创建一个循环以允许 R 访问 57 个活动网页中的每一个,并进行 webscraping 。
# Extract all necessary paragraphs. Unlist flattens the list to
#create a character vector.
for(i in 1:length(vec)){
end.name<-gsub('\'','',vec[i])
end.name<-gsub(',','',end.name)
end.name<-gsub(' ','',end.name)
user.address<-paste(c("http://www.crowdrise.com/skollSEchallenge/",
end.name),collapse='')
user.url<-getURL(user.address)
html <- htmlTreeParse(user.url, useInternalNodes = TRUE)
website.donor<-unlist(xpathSApply(html,'//div[@class="grid1-4 "]//h4', xmlValue))
website.title<-unlist(xpathSApply(html,'//div[@class="project_info"]',xmlValue))
website.story<-unlist(xpathSApply(html,'//div[@id="thestory"]',xmlValue))
website.fund<-unlist(xpathSApply(html,'//div[@class="clearfix"]',xmlValue))
#(NOTE: doc.text<- readHTMLTable(webpage1) doesn't work
#due to the poor html structure of the website)
# Replace all \n by spaces, and eliminate all \t
website.donor <- gsub('\\n', ' ', website.donor)
website.donor <- gsub('\\t','',website.donor)
website.title <- gsub('\\n', ' ', website.title)
website.title <- gsub('\\t','',website.title)
website.story <- gsub('\\n', ' ', website.story)
website.story <- gsub('\\t','',website.story)
website.fund <- gsub('\\n', ' ', website.fund)
website.fund <- gsub('\\t','',website.fund)
## all those tabs and spaces are just white spaces that we can trim
website.title <- str_trim(website.title)
website.fund <- str_trim(website.fund)
website.data<- cbind(website.title, website.story, website.fund, website.donor)
data[[i]]<- website.data
Sys.sleep(1)
}
data <- data.frame(do.call(rbind,data), stringAsFactors=F)
命令
unlist(xpathSApply(html,'//div[@class="grid1-4 "]//h4', xmlValue))
unlist(xpathSApply(html,'//div[@class="project_info"]',xmlValue))
unlist(xpathSApply(html,'//div[@id="thestory"]',xmlValue))
unlist(xpathSApply(html,'//div[@class="clearfix"]',xmlValue))
给我NULL值,我不明白为什么。
为什么它们变成 NULL,我该如何解决?
谢谢你,