r - 计算术语文档矩阵，同时在字符串中查找单词

Question

将其单独发布，因为它可以帮助其他用户轻松找到它。

问题是关于目前按包term document matrix计算的方式。tm我想稍微调整一下，如下所述。

目前，任何术语文档矩阵都是通过在文档中查找单词“milky”作为单独的单词（而不是字符串）来创建的。例如，让我们假设 2 个文档

 document 1: "this is a milky way galaxy"
 document 2: "this is a milkyway galaxy"

根据当前算法的工作方式（tm包），“milky”将在第一个文档中找到，但在第二个文档中找不到，因为该算法将术语milky作为单独的单词查找。但是，如果该算法milky像函数grepl一样查找字符串，它也会在第二个文档中找到术语“milky”。

grepl('milky', 'this is a milkyway galaxy')
TRUE

有人可以帮我创建一个满足我要求的术语文档矩阵（即能够milky在两个文档中找到术语。请注意，我不想要一个特定于单词的解决方案，或者milky，我想要一个通用的解决方案，我将更大规模地申请处理所有此类案件）？即使解决方案不使用tm包，也可以。我只需要最终得到一个满足我要求的术语文档矩阵。最终，我希望能够获得一个术语文档矩阵，以便其中的每个术语都应该在相关文档的所有字符串中作为字符串（而不仅仅是单词）进行查找（grepl例如计算术语文档矩阵时的功能）。

我用来获取术语文档矩阵的当前代码是

doc1 <-  "this is a document about milkyway"
doc2 <-  "milky way is huge"

library(tm)
tmp.text<-data.frame(rbind(doc1,doc2))
tmp.corpus<-Corpus(DataframeSource(tmp.text))
tmpDTM<-TermDocumentMatrix(tmp.corpus, control= list(tolower = T, removeNumbers = T, removePunctuation = TRUE,stopwords = TRUE,wordLengths = c(2, Inf)))
tmp.df<-as.data.frame(as.matrix(tmpDTM))
tmp.df

         1 2
document 1 0
huge     0 1
milky    0 1
milkyway 1 0
way      0 1

score 0 · Accepted Answer

我不确定tm是否可以轻松（或可能）根据正则表达式选择或分组功能。但是文本包quanteda在构建其文档特征矩阵时，通过thesaurus根据字典对术语进行分组的参数来实现。

（quanteda使用通用术语“特征”，因为在这里，您的类别是包含短语乳白色而不是原始“术语”的术语。）

valuetype参数可以是“glob”格式（默认）、正则表达式 ( )"regex"或原样固定 ( "fixed")。下面我展示了带有 glob 和正则表达式的版本。

require(quanteda)
myDictGlob <- dictionary(list(containsMilky = c("milky*")))
myDictRegex <- dictionary(list(containsMilky = c("^milky")))

(plainDfm <- dfm(c(doc1, doc2)))
## Creating a dfm from a character vector ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 2 documents
## ... indexing features: 9 feature types
## ... created a 2 x 9 sparse dfm
## ... complete. 
## Elapsed time: 0.008 seconds.
## Document-feature matrix of: 2 documents, 9 features.
## 2 x 9 sparse Matrix of class "dfmSparse"
## features
## docs    this is a document about milkyway milky way huge
## text1    1  1 1        1     1        1     0   0    0
## text2    0  1 0        0     0        0     1   1    1

dfm(c(doc1, doc2), thesaurus = myDictGlob, valuetype = "glob", verbose = FALSE)
## Document-feature matrix of: 2 documents, 8 features.
## 2 x 8 sparse Matrix of class "dfmSparse"
##       this is a document about way huge CONTAINSMILKY
## text1    1  1 1        1     1   0    0             1
## text2    0  1 0        0     0   1    1             1
dfm(c(doc1, doc2), thesaurus = myDictRegex, valuetype = "regex")
## Document-feature matrix of: 2 documents, 8 features.
## 2 x 8 sparse Matrix of class "dfmSparse"
##       this is a document about way huge CONTAINSMILKY
## text1    1  1 1        1     1   0    0             1
## text2    0  1 0        0     0   1    1             1

r - 计算术语文档矩阵，同时在字符串中查找单词

1 回答 1

Related

Reference