我可以使用以下功能获取所有可用软件包的列表:
ap <- available.packages()
但是我怎样才能从 R 中获取这些包的描述,所以我可以有data.frame
两列:包和描述?
我可以使用以下功能获取所有可用软件包的列表:
ap <- available.packages()
但是我怎样才能从 R 中获取这些包的描述,所以我可以有data.frame
两列:包和描述?
编辑一个近十年的公认答案。您可能想要的不是刮(除非您想练习刮),而是使用现有界面:tools::CRAN_package_db()
. 例子:
> db <- tools::CRAN_package_db()[, c("Package", "Description")]
> dim(db)
[1] 18978 2
>
该函数带来(当前)66 列,这里感兴趣的是其中的一部分。
我实际上认为你想要“包”和“标题”作为“描述”可以运行到几行。所以这是前者,如果你真的想要“描述”,只需将“描述”放在最后的子集中:
R> ## from http://developer.r-project.org/CRAN/Scripts/depends.R and adapted
R>
R> require("tools")
R>
R> getPackagesWithTitle <- function() {
+ contrib.url(getOption("repos")["CRAN"], "source")
+ description <- sprintf("%s/web/packages/packages.rds",
+ getOption("repos")["CRAN"])
+ con <- if(substring(description, 1L, 7L) == "file://") {
+ file(description, "rb")
+ } else {
+ url(description, "rb")
+ }
+ on.exit(close(con))
+ db <- readRDS(gzcon(con))
+ rownames(db) <- NULL
+
+ db[, c("Package", "Title")]
+ }
R>
R>
R> head(getPackagesWithTitle()) # I shortened one Title here...
Package Title
[1,] "abc" "Tools for Approximate Bayesian Computation (ABC)"
[2,] "abcdeFBA" "ABCDE_FBA: A-Biologist-Can-Do-Everything of Flux ..."
[3,] "abd" "The Analysis of Biological Data"
[4,] "abind" "Combine multi-dimensional arrays"
[5,] "abn" "Data Modelling with Additive Bayesian Networks"
[6,] "AcceptanceSampling" "Creation and evaluation of Acceptance Sampling Plans"
R>
Dirk 提供了一个非常棒的答案,在完成我的解决方案然后看到他的解决方案后,我辩论了一段时间发布我的解决方案,因为害怕看起来很傻。但我还是决定发布它,原因有两个:
我接近这个想法我需要做一些网络抓取并选择 crantastic 作为要抓取的网站。首先,我将提供代码,然后提供两个对我学习非常有帮助的抓取资源:
library(RCurl)
library(XML)
URL <- "http://cran.r-project.org/web/checks/check_summary.html#summary_by_package"
packs <- na.omit(XML::readHTMLTable(doc = URL, which = 2, header = T,
strip.white = T, as.is = FALSE, sep = ",", na.strings = c("999",
"NA", " "))[, 1])
Trim <- function(x) {
gsub("^\\s+|\\s+$", "", x)
}
packs <- unique(Trim(packs))
u1 <- "http://crantastic.org/packages/"
len.samps <- 10 #for demo purpose; use:
#len.samps <- length(packs) # for all of them
URL2 <- paste0(u1, packs[seq_len(len.samps)])
scraper <- function(urls){ #function to grab description
doc <- htmlTreeParse(urls, useInternalNodes=TRUE)
nodes <- getNodeSet(doc, "//p")[[3]]
return(nodes)
}
info <- sapply(seq_along(URL2), function(i) try(scraper(URL2[i]), TRUE))
info2 <- sapply(info, function(x) { #replace errors with NA
if(class(x)[1] != "XMLInternalElementNode"){
NA
} else {
Trim(gsub("\\s+", " ", xmlValue(x)))
}
}
)
pack_n_desc <- data.frame(package=packs[seq_len(len.samps)],
description=info2) #make a dataframe of it all
资源:
我想尝试使用 HTML 刮板(rvest)作为练习,因为available.packages()
in OP 不包含包Descriptions。
library('rvest')
url <- 'https://cloud.r-project.org/web/packages/available_packages_by_name.html'
webpage <- read_html(url)
data_html <- html_nodes(webpage,'tr td')
length(data_html)
P1 <- html_nodes(webpage,'td:nth-child(1)') %>% html_text(trim=TRUE) # XML: The Package Name
P2 <- html_nodes(webpage,'td:nth-child(2)') %>% html_text(trim=TRUE) # XML: The Description
P1 <- P1[lengths(P1) > 0 & P1 != ""] # Remove NULL and empty ("") items
length(P1); length(P2);
mdf <- data.frame(P1, P2, row.names=NULL)
colnames(mdf) <- c("PackageName", "Description")
# This is the problem! It lists large sets column-by-column,
# instead of row-by-row. Try with the full list to see what happens.
print(mdf, right=FALSE, row.names=FALSE)
# PackageName Description
# A3 Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModels
# abbyyR Access to Abbyy Optical Character Recognition (OCR) API
# abc Tools for Approximate Bayesian Computation (ABC)
# abc.data Data Only: Tools for Approximate Bayesian Computation (ABC)
# ABC.RAP Array Based CpG Region Analysis Pipeline
# ABCanalysis Computed ABC Analysis
# For small sets we can use either:
# mdf[1:6,] #or# head(mdf, 6)
但是,尽管对于小型数组/数据框列表(子集)工作得很好,但我遇到了完整列表的显示问题,其中数据将逐列显示或未对齐。如果能以某种方式在新窗口中分页并正确格式化,我会很高兴。我尝试使用page,但我无法让它很好地工作。
编辑: 推荐的方法不是上面的,而是使用 Dirk 的建议(来自下面的评论):
db <- tools::CRAN_package_db()
colnames(db)
mdf <- data.frame(db[,1], db[,52])
colnames(mdf) <- c("Package", "Description")
print(mdf, right=FALSE, row.names=FALSE)
但是,这仍然受到提到的显示问题的影响......