r - 整洁的文本：从以下术语-文档矩阵计算 Zipf 定律

Question

我尝试了来自http://tidytextmining.com/tfidf.html的代码。我的结果可以在这张图片中看到。

我的问题是：如何重写代码以产生词频和排名之间的负关系？

以下是术语-文档矩阵。任何意见都受到高度赞赏。

 # Zipf 's law

freq_rk < -DTM_words %>%
group_by(document) %>%
mutate(rank=row_number(),
       'term_frequency'=count/total)

freq_rk %>%
ggplot(aes(rank,term_frequency,color=document)) +
geom_line(size=1.2,alpha=0.8)


DTM_words
 # A tibble: 4,530 x 5
     document       term count     n total
        <chr>      <chr> <dbl> <int> <dbl>
 1        1      activ     1     1   109
 2        1 agencydebt     1     1   109
 3        1     assess     1     1   109
 4        1      avail     1     1   109
 5        1     balanc     2     1   109
 # ... with 4,520 more rows

score 2 · Accepted Answer

要使用row_number()排名，您需要确保您的数据框按排序n，即一个单词在文档中的使用次数。让我们看一个例子。听起来您是从正在整理的文档术语矩阵开始的？（我将使用一些类似于 quanteda 的 DTM 的示例数据。）

library(tidyverse)
library(tidytext)

data("data_corpus_inaugural", package = "quanteda")
inaug_dfm <- quanteda::dfm(data_corpus_inaugural, verbose = FALSE)

ap_td <- tidy(inaug_dfm)
ap_td
#> # A tibble: 44,725 x 3
#>           document   term count
#>              <chr>  <chr> <dbl>
#>  1 1789-Washington fellow     3
#>  2 1793-Washington fellow     1
#>  3      1797-Adams fellow     3
#>  4  1801-Jefferson fellow     7
#>  5  1805-Jefferson fellow     8
#>  6    1809-Madison fellow     1
#>  7    1813-Madison fellow     1
#>  8     1817-Monroe fellow     6
#>  9     1821-Monroe fellow    10
#> 10      1825-Adams fellow     3
#> # ... with 44,715 more rows

请注意，在这里，您有一个整齐的数据框，每行一个单词，但它不是按count每个单词在每个文档中使用的次数排序的。如果我们row_number()在这里尝试分配排名，那是没有意义的，因为这些单词都是按顺序排列的。

相反，我们可以通过递减计数来安排。

ap_td <- tidy(inaug_dfm) %>%
  group_by(document) %>%
  arrange(desc(count)) 

ap_td
#> # A tibble: 44,725 x 3
#> # Groups:   document [58]
#>         document  term count
#>            <chr> <chr> <dbl>
#>  1 1841-Harrison   the   829
#>  2 1841-Harrison    of   604
#>  3     1909-Taft   the   486
#>  4 1841-Harrison     ,   407
#>  5     1845-Polk   the   397
#>  6   1821-Monroe   the   360
#>  7 1889-Harrison   the   360
#>  8 1897-McKinley   the   345
#>  9 1841-Harrison    to   318
#> 10 1881-Garfield   the   317
#> # ... with 44,715 more rows

现在我们可以使用row_number()来获得排名，因为数据框实际上是排名/排列/排序/排序/无论你想说什么。

ap_td <- tidy(inaug_dfm) %>%
  group_by(document) %>%
  arrange(desc(count)) %>%
  mutate(rank = row_number(),
         total = sum(count),
         `term frequency` = count / total)

ap_td
#> # A tibble: 44,725 x 6
#> # Groups:   document [58]
#>         document  term count  rank total `term frequency`
#>            <chr> <chr> <dbl> <int> <dbl>            <dbl>
#>  1 1841-Harrison   the   829     1  9178       0.09032469
#>  2 1841-Harrison    of   604     2  9178       0.06580954
#>  3     1909-Taft   the   486     1  5844       0.08316222
#>  4 1841-Harrison     ,   407     3  9178       0.04434517
#>  5     1845-Polk   the   397     1  5211       0.07618499
#>  6   1821-Monroe   the   360     1  4898       0.07349939
#>  7 1889-Harrison   the   360     1  4744       0.07588533
#>  8 1897-McKinley   the   345     1  4383       0.07871321
#>  9 1841-Harrison    to   318     4  9178       0.03464807
#> 10 1881-Garfield   the   317     1  3240       0.09783951
#> # ... with 44,715 more rows

ap_td %>%
  ggplot(aes(rank, `term frequency`, color = document)) +
  geom_line(alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10()

score 0 · Accepted Answer

描述线性回归（即不是 Zipf 定律）的图只会添加带有线性回归模型 (lm) 的平滑。

freq_rk %>%
ggplot(aes(rank,term_frequency,color=document)) +
geom_line(size=1.2,alpha=0.8) +
geom_smooth(method = lm)

要识别 Austen 的分布与您的分布之间的差异，请运行以下代码：

奥斯汀：

ggplot(freq_by_rank, aes(rank, fill = book) + geom_density(alpha = 0.5) + labs(title = "Austen linear")
ggplot(freq_by_rank, aes(rank, fill = book) + geom_density(alpha = 0.5) + scale_x_log10() + labs(title = "Austen Logarithmic")

汤姆的样品

ggplot(freq_rk, aes(rank, fill = document) + geom_density(alpha = 0.5) + labs(title = "Sample linear")
ggplot(freq_rk, aes(rank, fill = document) + geom_density(alpha = 0.5) + scale_x_log10() + labs(title = "Sample Logarithmic")

r - 整洁的文本：从以下术语-文档矩阵计算 Zipf 定律

2 回答 2

Related

Reference