我做了rvest一些数据清理。希望它能为您概括:
library(rvest)
library(dplyr)
获取网址:
url<html('http://examine.com/rubric/effects/view/552/Symptoms+of+Irritable+Bowel+Syndrome/all/')
每项研究的标题都保存在<a>网络链接包装器中。获取这些并清理换行符。添加一些研究并放入一个df。
selector_name<-"a"
titles<-html_nodes(url, selector_name) %>% html_text()
titles <- gsub("[\r\n\t]", "", titles)
titles <- as.data.frame(titles)
titles$studyno <- 1:nrow(titles)
正如您所指出的,内容位于表格中,因此请使用<td>包装器获取信息并清除换行符:
selector_name<-"td"
content<-html_nodes(url, selector_name) %>% html_text()
content <- gsub("[\r\n\t]", "", content)
然后清理一下并match获得df:
df <- as.data.frame(matrix(content, ncol=2, byrow=T))
df$studyno <- cumsum(df$V1=="Change of Effect")
df$title <- titles$titles[match(df$studyno, titles$studyno)]
head(df,7)
# V1 V2 studyno
#1 Change of Effect decrease 1
#2 Trial Design meta 1
#3 Trial Length na 1
#4 Number of Subjects 392 1
#5 Gender mixed 1
#6 Change of Effect Decrease (Statistically Significant, p-value 2
#7 Trial Design Double Blind 2
#title
#1 Effect of fibre, antispasmodics, and peppermint oil in the treatment ...
#2 Effect of fibre, antispasmodics, and peppermint oil in the treatment ...
#3 Effect of fibre, antispasmodics, and peppermint oil in the treatment ...
#4 Effect of fibre, antispasmodics, and peppermint oil in the treatment ...
#5 Effect of fibre, antispasmodics, and peppermint oil in the treatment ...
#6 Treatment Of Irritable Bowel Syndrome With Peppermint Oil. A Double-...
#7 Treatment Of Irritable Bowel Syndrome With Peppermint Oil. A Double-...