r - 如何使用 quanteda 保留句子标记的开头和结尾

Question

我正在尝试使用 R 的quanteda包创建 3-grams。

我正在努力寻找一种方法来保留句子标记的 n-gram 开头和结尾，就像下面的代码中的<s>and</s>一样。

我认为使用keptFeatures与那些匹配的正则表达式应该保持它们，但人字形标记总是被删除。

如何防止 V 形标记被移除，或者用分隔句子开头和结尾的最佳方法是什么quanteda？

作为一个额外的问题docfreq(mydfm)over的优势是什么colSums(mydfm)， str(colSums(mydfm)) 和 str(docfreq(mydfm)) 的结果几乎相同（Named num [1:n]前者，Named int [1:n]后者）？

library(quanteda)
text <- "<s>I'm a sentence and I'd better be formatted properly!</s><s>I'm a second sentence</s>"

qc <- corpus(text)

mydfm  <- dfm(qc, ngram=3, removeNumbers = F, stem=T, keptFeatures="\\</?s\\>")

names(colSums(mydfm))

# Output:
# [1] "s_i'm_a"    "i'm_a_sentenc"    "a_sentenc_and"    "sentenc_and_i'd"
# [2] "and_i'd_better"   "i'd_better_be"    "better_be_format"   
# [3] "be_format_proper" "format_proper_s"  "proper_s_s"   "s_s_i'm"    
# [4] "i'm_a_second"   "a_second_sentenc"   "second_sentenc_s"

编辑：

将代码片段中的 keepFeatures 更正为 keepFeatures。

score 2 · Accepted Answer

要返回一个简单的向量，只需取消列出tokenizedText" object returned fromtokenize() (which is a specially classed list, with additional attributes). Here I used thewhat = "fasterword" which splits on "\\s" -- it's a tiny bit smarter thanwhat = "fastestword" which splits on" "`。

# how to not remove the <s>, and return a vector 
unlist(toks <- tokenize(text, ngrams = 3, what = "fasterword"))
## [1] "<s>I'm_a_sentence"                "a_sentence_and"                  
## [3] "sentence_and_I'd"                 "and_I'd_better"                  
## [5] "I'd_better_be"                    "better_be_formatted"             
## [7] "be_formatted_properly!</s><s>I'm" "formatted_properly!</s><s>I'm_a" 
## [9] "properly!</s><s>I'm_a_second"     "a_second_sentence</s>"

为了让它保持在句子中，对对象进行两次标记，第一次是句子，第二次是fasterword。

# keep it within sentence
(sents <- unlist(tokenize(text, what = "sentence")))
## [1] "<s>I'm a sentence and I'd better be formatted properly!"
## [2] "</s><s>I'm a second sentence</s>" 
tokenize(sents, ngrams = 3, what = "fasterword")
## tokenizedText object from 2 documents.
## Component 1 :
## [1] "<s>I'm_a_sentence"      "a_sentence_and"         "sentence_and_I'd"       "and_I'd_better"        
## [5] "I'd_better_be"          "better_be_formatted"    "be_formatted_properly!"
## 
## Component 2 :
## [1] "</s><s>I'm_a_second"   "a_second_sentence</s>"

要在 dfm 中保留人字形标记，您可以传递在tokenize()调用中使用的相同选项，因为dfm()调用tokenize()但具有不同的默认值 - 它使用大多数用户可能想要的选项，而tokenize()更为保守。

# Bonus questions:
myDfm <- dfm(text, verbose = FALSE, what = "fasterword", removePunct = FALSE)
# "chevron" markers are not removed
features(myDfm)
## [1] "<s>i'm"              "a"                   "sentence"            "and"                 "i'd"                
## [6] "better"              "be"                  "formatted"           "properly!</s><s>i'm" "second"             
## [11] "sentence</s>"

docfreq()奖金问题的最后一部分是和之间的区别colSums()。前者返回出现术语的文档的计数，后者将列求和以获得跨文档的总术语频率。请参阅下面这些对于术语的不同之处"representatives"。

# Difference between docfreq() and colSums():
myDfm2 <- dfm(inaugTexts[1:4], verbose = FALSE)
myDfm2[, "representatives"]
docfreq(myDfm2)["representatives"]
colSums(myDfm2)["representatives"]
## Document-feature matrix of: 4 documents, 1 feature.
## 4 x 1 sparse Matrix of class "dfmSparse"
##                  features
## docs              representatives
##   1789-Washington               2
##   1793-Washington               0
##   1797-Adams                    2
##   1801-Jefferson                0
docfreq(myDfm2)["representatives"]
## representatives 
##               2 
colSums(myDfm2)["representatives"]
## representatives 
##               4

更新：quanteda v0.9.9 中的一些命令和行为发生了变化：

返回一个简单的向量，保留人字形：

as.character(toks <- tokens(text, ngrams = 3, what = "fasterword"))
#  [1] "<s>I'm_a_sentence"                "a_sentence_and"                   "sentence_and_I'd"                
#  [4] "and_I'd_better"                   "I'd_better_be"                    "better_be_formatted"             
#  [7] "be_formatted_properly!</s><s>I'm" "formatted_properly!</s><s>I'm_a"  "properly!</s><s>I'm_a_second"    
# [10] "a_second_sentence</s>"

保持在句子中：

(sents <- as.character(tokens(text, what = "sentence")))
# [1] "<s>I'm a sentence and I'd better be formatted properly!" "</s><s>I'm a second sentence</s>"                       
tokens(sents, ngrams = 3, what = "fasterword")
# tokens from 2 documents.
# Component 1 :
# [1] "<s>I'm_a_sentence"      "a_sentence_and"         "sentence_and_I'd"       "and_I'd_better"         "I'd_better_be"         
# [6] "better_be_formatted"    "be_formatted_properly!"
# 
# Component 2 :
# [1] "</s><s>I'm_a_second"   "a_second_sentence</s>"

奖金问题第 1 部分：

featnames(dfm(text, verbose = FALSE, what = "fasterword", removePunct = FALSE))
#  [1] "<s>i'm"              "a"                   "sentence"            "and"                 "i'd"                
#  [6] "better"              "be"                  "formatted"           "properly!</s><s>i'm" "second"             
# [11] "sentence</s>"

奖金问题第 2 部分保持不变。

score 1 · Accepted Answer

像这样的方法怎么样：

ngrams(
  tokenize(
    unlist(
      segment(text, what = "other", delimiter = "(?<=\\</s\\>)", perl = TRUE)),
    what = "fastestword", simplify = TRUE),
  n = 3L)

# [1] "<s>I'm_a_sentence"              "a_sentence_and"                
# [3] "sentence_and_I'd"               "and_I'd_better"                
# [5] "I'd_better_be"                  "better_be_formatted"           
# [7] "be_formatted_properly!</s>"     "formatted_properly!</s>_<s>I'm"
# [9] "properly!</s>_<s>I'm_a"         "<s>I'm_a_second"               
#[11] "a_second_sentence</s>"

或者，如果您不想要跨越句子边界的 ngram：

unlist(
  ngrams(
    tokenize(
      unlist(
        segment(text, what = "other", delimiter = "(?<=\\</s\\>)", perl = TRUE)),
      what = "fastestword"),
    n = 3L))
#[1] "<s>I'm_a_sentence"          "a_sentence_and"            
#[3] "sentence_and_I'd"           "and_I'd_better"            
#[5] "I'd_better_be"              "better_be_formatted"       
#[7] "be_formatted_properly!</s>" "<s>I'm_a_second"           
#[9] "a_second_sentence</s>"

我将自定义选项（例如removePunct = TRUE，等）留给您。

r - 如何使用 quanteda 保留句子标记的开头和结尾

2 回答 2

Related

Reference