r - R tidytext stop_words 没有从gutenbergr 下载中始终如一地过滤

Question

这是一个奇怪的谜题。我从古腾堡下载了 2 篇文章——爱丽丝梦游仙境和尤利西斯。停用词从 Alice 身上消失了，但它们仍在 Ulysses 中。即使将 anti_join 替换为过滤器 (!word %in% stop_words$word)，此问题仍然存在。

如何从 Ulysses 中获取 stop_words？

谢谢你的帮助！

Alice & Ulysses 的前 15 个 tf_idf 图表

library(gutenbergr)
library(dplyr)
library(stringr)
library(tidytext)
library(ggplot2)

titles <- c("Alice's Adventures in Wonderland", "Ulysses")


books <- gutenberg_works(title %in% titles) %>%
  gutenberg_download(meta_fields = c("title", "author"))


data(stop_words)


tidy_books <- books %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(title, word, sort=TRUE) %>%
  ungroup()


plot_tidy_books <- tidy_books %>%
  bind_tf_idf(word, title, n) %>%
  arrange(desc(tf_idf))       %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  mutate(title = factor(title, levels = unique(title)))


plot_tidy_books %>%
  group_by(title) %>%
  arrange(desc(n))%>%
  top_n(15, tf_idf) %>%
  mutate(word=reorder(word, tf_idf)) %>%
  ggplot(aes(word, tf_idf, fill=title)) +
  geom_col(show.legend = FALSE) +
  labs(x=NULL, y="tf-idf") +
  facet_wrap(~title, ncol=2, scales="free") +
  coord_flip()

score 3 · Accepted Answer

在对标记化的 Ulysses 进行了一些挖掘之后，文本“it's”实际上是使用右单引号而不是撇号。 stop_wordsintidytext使用撇号。您必须用撇号替换正确的单引号。

我通过以下方式发现了这一点：

> utf8ToInt('it’s')
[1]  105  116 8217  115

谷歌搜索 8217 将我带到这里。从那里获取 C++/Java 源代码\u2019并在.mutategsubanti-join

tidy_books <- books %>%
  unnest_tokens(word, text) %>%
  mutate(word = gsub("\u2019", "'", word)) %>% 
  anti_join(stop_words) %>%
  count(title, word, sort=TRUE) %>%
  ungroup()

结果是：

r - R tidytext stop_words 没有从gutenbergr 下载中始终如一地过滤

1 回答 1

Related

Reference