r - R 如何使用 TermDocumentMatrix() 保持标点符号

Question

我有一个大型数据框，我在其中识别字符串中的模式，然后提取它们。我提供了一个小子集来说明我的任务。我通过创建一个包含多个单词的 TermDocumentMatrix 来生成我的模式。我将这些模式与 stringi 和 stringr 包中的 stri_extract 和 str_replace 一起使用，以在“punct_prob”数据框中进行搜索。

我的问题是我需要在“punct_prob$description”中保持标点符号的完整性，以保持每个字符串中的字面含义。例如，我不能让 2.35 毫米变成 235 毫米。然而，我正在使用的 TermDocumentMatrix 过程正在删除标点符号（或至少是句点），因此我的模式搜索功能无法匹配它们。

简而言之...生成 TDM 时如何保持标点符号？我尝试在 TermDocumentMatrix 控制参数中包含 removePunctuation=FALSE 但没有成功。

library(tm)
punct_prob = data.frame(description = tolower(c("CONTRA ANGLE HEAD 2:1 FOR 2.35mm BUR",
                                    "TITANIUM LINE MINI P.B F.O. TRIP SPRAY",
                                    "TITANIUM LINE POWER P. B F.O. TRIP SPR",
                                    "MEDESY SPECIAL ITEM")))

punct_prob$description = as.character(punct_prob$description)

# a control for the number of words in phrases
max_ngram = max(sapply(strsplit(punct_prob$description, " "), length))

#set up ngrams and tdm
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = max_ngram, max = max_ngram))}
punct_prob_corpus = Corpus(VectorSource(punct_prob$description))
punct_prob_tdm <- TermDocumentMatrix(punct_prob_corpus, control = list(tokenize = BigramTokenizer, removePunctuation=FALSE))
inspect(punct_prob_tdm)

检查结果 - 没有标点符号....

                                   Docs
Terms                              1 2 3 4
  angle head 2 1 for 2 35mm bur    1 0 0 0
  contra angle head 2 1 for 2 35mm 1 0 0 0
  line mini p b f o trip spray     0 1 0 0
  line power p b f o trip spr      0 0 1 0
  titanium line mini p b f o trip  0 1 0 0
  titanium line power p b f o trip 0 0 1 0

提前感谢您的帮助:)

score 3 · Accepted Answer

问题不在于 termdocumentmatrix，而在于基于 RWEKA 的 ngram tokenizer。Rweka 在进行标记时会删除标点符号。

如果您使用 nlp 分词器，它会保留标点符号。请参阅下面的代码。

PS 我在你的第 3 个文本行中删除了一个空格，所以 PB 是 PB，就像它在第 2 行一样。

library(tm)
punct_prob = data.frame(description = tolower(c("CONTRA ANGLE HEAD 2:1 FOR 2.35mm BUR",
                                                "TITANIUM LINE MINI P.B F.O. TRIP SPRAY",
                                                "TITANIUM LINE POWER P.B F.O. TRIP SPR",
                                                "MEDESY SPECIAL ITEM")))
punct_prob$description = as.character(punct_prob$description)

max_ngram = max(sapply(strsplit(punct_prob$description, " "), length))

punct_prob_corpus = Corpus(VectorSource(punct_prob$description))




NLPBigramTokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), max_ngram), paste, collapse = " "), use.names = FALSE)
}


punct_prob_tdm <- TermDocumentMatrix(punct_prob_corpus, control = list(tokenize = NLPBigramTokenizer))
inspect(punct_prob_tdm)

<<TermDocumentMatrix (terms: 3, documents: 4)>>
Non-/sparse entries: 3/9
Sparsity           : 75%
Maximal term length: 38
Weighting          : term frequency (tf)

                                        Docs
Terms                                    1 2 3 4
  contra angle head 2:1 for 2.35mm bur   1 0 0 0
  titanium line mini p.b f.o. trip spray 0 1 0 0
  titanium line power p.b f.o. trip spr  0 0 1 0

score 1 · Accepted Answer

quanteda包足够聪明，可以在不将词内标点符号视为“标点符号”的情况下进行标记。这使得构建矩阵变得非常容易：

txt <- c("CONTRA ANGLE HEAD 2:1 FOR 2.35mm BUR",
         "TITANIUM LINE MINI P.B F.O. TRIP SPRAY",
         "TITANIUM LINE POWER P.B F.O. TRIP SPR",
         "MEDESY SPECIAL ITEM")

require(quanteda)
myDfm <- dfm(txt, ngrams = 6:8, concatenator = " ")
t(myDfm)
#                                        docs
# features                                text1 text2 text3 text4
#   contra angle head for 2.35mm bur          1     0     0     0
#   titanium line mini p.b f.o trip           0     1     0     0
#   line mini p.b f.o trip spray              0     1     0     0
#   titanium line mini p.b f.o trip spray     0     1     0     0
#   titanium line power p.b f.o trip          0     0     1     0
#   line power p.b f.o trip spr               0     0     1     0
#   titanium line power p.b f.o trip spr      0     0     1     0

如果您想保留“标点符号”，它将在结束一个术语时被标记为一个单独的标记：

myDfm2 <- dfm(txt, ngrams = 8, concatenator = " ", removePunct = FALSE)
t(myDfm2)
#                                          docs
# features                                  text1 text2 text3 text4
#   titanium line mini p.b f.o . trip spray     0     1     0     0
#   titanium line power p.b f.o . trip spr      0     0     1     0

请注意，该ngrams参数是完全灵活的，并且可以采用 ngram 大小的向量，如在第一个示例中ngrams = 6:8表示它应该形成 6-、7- 和 8-gram。

r - R 如何使用 TermDocumentMatrix() 保持标点符号

2 回答 2

Related

Reference