r - R数据框中的十个最高列值

Question

目前，我正在开展一个从文本块中提取关键字的项目。以下是初始列表中前三个项目的示例。（为冗长道歉）

descriptest<-c("Columbia University is one of the world's most important centers of research and at the same time a distinctive and distinguished learning environment for undergraduates and graduate students in many scholarly and professional fields. The University recognizes the importance of its location in New York City and seeks to link its research and teaching to the vast resources of a great metropolis. It seeks to attract a diverse and international faculty and student body, to support research and teaching on global issues, and to create academic relationships with many countries and regions. It expects all areas of the university to advance knowledge and learning at the highest level and to convey the products of its efforts to the world.", 
"", "UMass Amherst was born in 1863 as a land-grant agricultural college set on 310 rural acres with four faculty members, four wooden buildings, 56 students and a curriculum combining modern farming, science, technical courses, and liberal arts.\n\nOver time, the curriculum, facilities, and student body outgrew the institution's original mission. In 1892 the first female student enrolled and graduate degrees were authorized. By 1931, to reflect a broader curriculum, \"Mass Aggie\" had become Massachusetts State College. In 1947, \"Mass State\" became the University of Massachusetts at Amherst.\n\nImmediately after World War II, the university experienced rapid growth in facilities, programs and enrollment, with 4000 students in 1954. By 1964, undergraduate enrollment jumped to 10,500, as Baby Boomers came of age. The turbulent political environment also brought a \"sit-in\" to the newly constructed Whitmore Administration Building. By the end of the decade, the completion of Southwest Residential Complex, the Alumni Stadium and the establishment of many new academic departments gave UMass Amherst much of its modern stature.\n\nIn the 1970s continued growth gave rise to a shuttle bus service on campus as well as several important architectural additions: the Murray D. Lincoln Campus Center, with a hotel, office space, fine dining restaurant, campus store and passageway to a multi-level parking garage; the W.E.B. Du Bois Library, named \"tallest library in the world\" upon its completion in 1973; and the Fine Arts Center, with performance space for world-class music, dance and theater.\n\nThe next two decades saw the emergence of UMass Amherst as a major research facility with the construction of the Lederle Graduate Research Center and the Conte National Polymer Research Center. Other programs excelled as well. In 1996 UMass Basketball became Atlantic 10 Conference champs and went to the NCAA Final Four. Before the millennium, both the William D. Mullins Center, a multi-purpose sports and convocation facility, and the Paul Robsham Visitors Center bustled with activity, welcoming thousands of visitors to the campus each year.\n\nUMass Amherst entered the 21st century as the flagship campus of the state's five-campus University system, and enrollment of nearly 24,000 students and a national and international reputation for excellence.")

我希望使用 tm 包在 R 中执行此操作，因为 DocumentTermMatrix 在处理大数据时是一个清晰的矩阵。此外，我使用 TfIdf 的权重对语料库中的关键字与条目本身中的关键字进行比较。

我被卡住了，因为我可以使用 max.col 来获取最大关键字，但是，我的矩阵有多个相同值的最大值，此外，我不仅想要最大值，我真的想要前十个最大值列表。下面是示例代码：

 library(RWeka)
 library(tm)
 library(koRpus)
 library(RKEA)
 library(corpora)
 library(wordcloud)
 library(plyr)
changeindextoname<-function(indexnumber){
name<-colnames(z2[indexnumber])
return(name)
}

removestuff<- function(d){
d <- tm_map(d, tolower)
d <- tm_map(d, removePunctuation)
d <- tm_map(d, removeNumbers)
d <- tm_map(d, stripWhitespace)
d <- tm_map(d, skipWords)
d <- tm_map(d, removeWords, stopwords('english'))
}

descripcorpora<-Corpus(VectorSource(descriptest))
descripcorpora<-removestuff(descripcorpora)
ddtm <- DocumentTermMatrix(descripcorpora, control = list(weighting=weightTfIdf, stopwords=T))
f2<-as.data.frame(inspect(ddtm))
z2<-f2
z3<-max.col(z2)
dfwithmax<-cbind(z3, z2)
dfwithmax$word<-lapply(dfwithmax$z3, changeindextoname)
finaldf<-subset(dfwithmax, select=c("z3", "word", "learning", "tallest", "center", "seeks", "teaching"))

finaldf 如下所示：

finaldf
  z3     word   learning     tallest     center      seeks   teaching
1 106 learning 0.04953008 0.000000000 0.00000000 0.04953008 0.04953008
2 183  tallest 0.00000000 0.000000000 0.00000000 0.00000000 0.00000000
3  35   center 0.00000000 0.007204375 0.04322625 0.00000000 0.00000000

这种方法似乎有效，但是，在第 1 行中无法适应“寻求”和“学习”和“教学”都具有相同价值的事实。

此外，当所有列都为零时（如第 2 行），max.col 返回一个索引。我该如何摆脱这个呢？

我试图避免遍历列或行，因为这需要很长时间，因为矩阵非常大。

对于如何编写可以应用或遍历每一列并将其添加到列表中的函数的任何建议或想法，我将不胜感激，然后我可以应用 changeindextoname 函数并在列表中返回列名。

先感谢您！

score 2 · Accepted Answer

对于每个文档，前五个最高值：

apply(as.matrix(ddtm),1,function(x) 
         colnames(as.matrix(ddtm))[order(x,decreasing=TRUE)[1:5]])

  Docs
       1            2            3        
  [1,] "teaching"   "york"       "center" 
  [2,] "seeks"      "year"       "umass"  
  [3,] "learning"   "worlds"     "campus" 
  [4,] "university" "worldclass" "amherst"
  [5,] "research"   "world"      "four"

请注意，您不提供代码skipWords，所以我使用这个：

skipWords <- function(x) removeWords(x, c(stopwords("english")

并查看tm_reduce重写 removestuff 函数：

removestuff <- tm_reduce(x,list(tolower,removePunctuation,...)

r - R数据框中的十个最高列值

1 回答 1

Related

Reference