要返回一个简单的向量,只需取消列出tokenizedText" object returned from
tokenize() (which is a specially classed list, with additional attributes). Here I used the
what = "fasterword" which splits on "\\s" -- it's a tiny bit smarter than
what = "fastestword" which splits on
" "`。
# how to not remove the <s>, and return a vector
unlist(toks <- tokenize(text, ngrams = 3, what = "fasterword"))
## [1] "<s>I'm_a_sentence" "a_sentence_and"
## [3] "sentence_and_I'd" "and_I'd_better"
## [5] "I'd_better_be" "better_be_formatted"
## [7] "be_formatted_properly!</s><s>I'm" "formatted_properly!</s><s>I'm_a"
## [9] "properly!</s><s>I'm_a_second" "a_second_sentence</s>"
为了让它保持在句子中,对对象进行两次标记,第一次是句子,第二次是fasterword
。
# keep it within sentence
(sents <- unlist(tokenize(text, what = "sentence")))
## [1] "<s>I'm a sentence and I'd better be formatted properly!"
## [2] "</s><s>I'm a second sentence</s>"
tokenize(sents, ngrams = 3, what = "fasterword")
## tokenizedText object from 2 documents.
## Component 1 :
## [1] "<s>I'm_a_sentence" "a_sentence_and" "sentence_and_I'd" "and_I'd_better"
## [5] "I'd_better_be" "better_be_formatted" "be_formatted_properly!"
##
## Component 2 :
## [1] "</s><s>I'm_a_second" "a_second_sentence</s>"
要在 dfm 中保留人字形标记,您可以传递在tokenize()
调用中使用的相同选项,因为dfm()
调用tokenize()
但具有不同的默认值 - 它使用大多数用户可能想要的选项,而tokenize()
更为保守。
# Bonus questions:
myDfm <- dfm(text, verbose = FALSE, what = "fasterword", removePunct = FALSE)
# "chevron" markers are not removed
features(myDfm)
## [1] "<s>i'm" "a" "sentence" "and" "i'd"
## [6] "better" "be" "formatted" "properly!</s><s>i'm" "second"
## [11] "sentence</s>"
docfreq()
奖金问题的最后一部分是和之间的区别colSums()
。前者返回出现术语的文档的计数,后者将列求和以获得跨文档的总术语频率。请参阅下面这些对于术语的不同之处"representatives"
。
# Difference between docfreq() and colSums():
myDfm2 <- dfm(inaugTexts[1:4], verbose = FALSE)
myDfm2[, "representatives"]
docfreq(myDfm2)["representatives"]
colSums(myDfm2)["representatives"]
## Document-feature matrix of: 4 documents, 1 feature.
## 4 x 1 sparse Matrix of class "dfmSparse"
## features
## docs representatives
## 1789-Washington 2
## 1793-Washington 0
## 1797-Adams 2
## 1801-Jefferson 0
docfreq(myDfm2)["representatives"]
## representatives
## 2
colSums(myDfm2)["representatives"]
## representatives
## 4
更新:quanteda v0.9.9 中的一些命令和行为发生了变化:
返回一个简单的向量,保留人字形:
as.character(toks <- tokens(text, ngrams = 3, what = "fasterword"))
# [1] "<s>I'm_a_sentence" "a_sentence_and" "sentence_and_I'd"
# [4] "and_I'd_better" "I'd_better_be" "better_be_formatted"
# [7] "be_formatted_properly!</s><s>I'm" "formatted_properly!</s><s>I'm_a" "properly!</s><s>I'm_a_second"
# [10] "a_second_sentence</s>"
保持在句子中:
(sents <- as.character(tokens(text, what = "sentence")))
# [1] "<s>I'm a sentence and I'd better be formatted properly!" "</s><s>I'm a second sentence</s>"
tokens(sents, ngrams = 3, what = "fasterword")
# tokens from 2 documents.
# Component 1 :
# [1] "<s>I'm_a_sentence" "a_sentence_and" "sentence_and_I'd" "and_I'd_better" "I'd_better_be"
# [6] "better_be_formatted" "be_formatted_properly!"
#
# Component 2 :
# [1] "</s><s>I'm_a_second" "a_second_sentence</s>"
奖金问题第 1 部分:
featnames(dfm(text, verbose = FALSE, what = "fasterword", removePunct = FALSE))
# [1] "<s>i'm" "a" "sentence" "and" "i'd"
# [6] "better" "be" "formatted" "properly!</s><s>i'm" "second"
# [11] "sentence</s>"
奖金问题第 2 部分保持不变。