1

使用 text2vec 包,我创建了一个词汇表。

vocab = create_vocabulary(it_0, ngram = c(2L, 2L)) 

词汇看起来像这样

> vocab
Number of docs: 120 
0 stopwords:  ... 
ngram_min = 2; ngram_max = 2 
Vocabulary: 
                    terms terms_counts doc_counts
    1:    knight_severely            1          1
    2:       movie_expect            1          1
    3: recommend_watching            1          1
    4:        nuke_entire            1          1
    5:      sense_keeping            1          1
   ---                                           
14467:         stand_idly            1          1
14468:    officer_loyalty            1          1
14469:    willingness_die            1          1
14470:         fight_bane            3          3
14471:     bane_beginning            1          1

如何检查列 terms_counts 的范围?我需要这个,因为它在修剪过程中对我有帮助,这是我的下一步

pruned_vocab = prune_vocabulary(vocab, term_count_min = <BLANK>)

下面的代码是可重现的

library(text2vec)

text <- c(" huge fan superhero movies expectations batman begins viewing christopher 
          nolan production pleasantly shocked huge expectations dark knight christopher 
          nolan blew expectations dust happen film dark knight rises simply big expectations 
          blown production true cinematic experience behold movie exceeded expectations terms 
          action entertainment",                                                       
          "christopher nolan outdone morning tired awake set film films genuine emotional 
          eartbeat felt flaw nolan films vision emotion hollow bought felt hero villain 
          alike christian bale typically brilliant batman felt bruce wayne heavily embraced
          final installment bale added emotional depth character plot point astray dark knight")

it_0 = itoken( text,
               tokenizer = word_tokenizer,
               progressbar = T)

vocab = create_vocabulary(it_0, ngram = c(2L, 2L)) 
vocab
4

2 回答 2

1

尝试range(vocab$vocab$terms_counts)

于 2016-11-26T07:17:00.793 回答
1

vocab是一些元信息(文档数量、ngram 大小等)的列表,主要data.frame/data.table包含字数和每个字数的文档。

如前所述vocab$vocab,您需要的是(data.table有计数)。

您可以通过调用找到内部结构str(vocab)

List of 5
 $ vocab         :Classes ‘data.table’ and 'data.frame':    82 obs. of  3 variables:
  ..$ terms       : chr [1:82] "plot_point" "depth_character" "emotional_depth" "bale_added" ...
  ..$ terms_counts: int [1:82] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ doc_counts  : int [1:82] 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, ".internal.selfref")=<externalptr> 
 $ ngram         : Named int [1:2] 2 2
  ..- attr(*, "names")= chr [1:2] "ngram_min" "ngram_max"
 $ document_count: int 2
 $ stopwords     : chr(0) 
 $ sep_ngram     : chr "_"
 - attr(*, "class")= chr "text2vec_vocabulary"
于 2016-11-26T07:22:24.470 回答