7

我想使用 R 中的 tm 包对纯文本文档语料库中的文档进行词干处理。当我将 SnowballStemmer 函数应用于语料库的所有文档时,只有每个文档的最后一个单词会被词干。

library(tm)
library(Snowball)
library(RWeka)
library(rJava)
path <- c("C:/path/to/diretory")
corp <- Corpus(DirSource(path),
               readerControl = list(reader = readPlain, language = "en_US",
                                    load = TRUE))
tm_map(corp,SnowballStemmer) #stemDocument has the same problem

我认为这与将文档读入语料库的方式有关。用一些简单的例子来说明这一点:

> vec<-c("running runner runs","happyness happies")
> stemDocument(vec) 
   [1] "running runner run" "happyness happi" 

> vec2<-c("running","runner","runs","happyness","happies")
> stemDocument(vec2)
   [1] "run"    "runner" "run"    "happy"  "happi" <- 

> corp<-Corpus(VectorSource(vec))
> corp<-tm_map(corp, stemDocument)
> inspect(corp)
   A corpus with 2 text documents

   The metadata consists of 2 tag-value pairs and a data frame
   Available tags are:
     create_date creator 
   Available variables in the data frame are:
     MetaID 

   [[1]]
   run runner run

   [[2]]
   happy happi

> corp2<-Corpus(DirSource(path),readerControl=list(reader=readPlain,language="en_US" ,  load=T))
> corp2<-tm_map(corp2, stemDocument)
> inspect(corp2)
   A corpus with 2 text documents

   The metadata consists of 2 tag-value pairs and a data frame
     Available tags are:
     create_date creator 
   Available variables in the data frame are:
     MetaID 

   $`1.txt`
   running runner runs

   $`2.txt`
   happyness happies
4

2 回答 2

4

加载所需的库

library(tm)
library(Snowball)

创建向量

vec<-c("running runner runs","happyness happies")

从向量创建语料库

vec<-Corpus(VectorSource(vec))

非常重要的是检查我们的语料库的类并保存它,因为我们想要一个 R 函数可以理解的标准语料库

class(vec[[1]])

vec[[1]]
<<PlainTextDocument (metadata: 7)>>
running runner runs

这可能会告诉你纯文本文档

所以现在我们修改我们错误的 stemDocument 函数。首先,我们将纯文本转换为字符,然后拆分文本,应用现在可以正常工作的 stemDocument 并将其粘贴回去。最重要的是,我们将输出重新转换为 tm 包给出的 PlainTextDocument。

stemDocumentfix <- function(x)
{
    PlainTextDocument(paste(stemDocument(unlist(strsplit(as.character(x), " "))),collapse=' '))
}

现在我们可以在我们的语料库上使用标准的 tm_map

vec1 = tm_map(vec, stemDocumentfix)

结果是

vec1[[1]]
<<PlainTextDocument (metadata: 7)>>
run runner run

您需要记住的最重要的事情是始终在语料库中保存文档类。我希望这是使用加载的 2 个库中的函数来解决您的问题的简化解决方案。

于 2014-08-22T05:42:48.663 回答
3

The problem I see is that wordStem takes in a vector of words but Corpus plainTextReader assumes that in the documents that it reads, each word is on its own line. In other words, this would confuse plainTextReader as you will end up with 3 "words" in your document

From ancient grudge break to new mutiny,
Where civil blood makes civil hands unclean.
From forth the fatal loins of these two foes

Instead the document should be

From
ancient
grudge
break
to
new
mutiny
where 
civil
...etc...

Note also that punctuation also confuses wordStem so you would have to take them out as well.

Another way to do this without modifying your actual documents is defining a function that would do the separation and remove non-alphanumerics that appear before or after a word. Here is a simple one:

wordStem2 <- function(x) {
    mywords <- unlist(strsplit(x, " "))
    mycleanwords <- gsub("^\\W+|\\W+$", "", mywords, perl=T)
    mycleanwords <- mycleanwords[mycleanwords != ""]
    wordStem(mycleanwords)
}

corpA <- tm_map(mycorpus, wordStem2);
corpB <- Corpus(VectorSource(corpA));

Now just use corpB as your usual Corpus.

于 2011-09-04T22:19:29.737 回答