2

TidyText 挖掘第 3.3 节中有一段可爱的代码,我试图在我自己的数据集中复制。但是,在我的数据中,我无法让 ggplot “记住”我想要按降序排列的数据,并且我想要某个top_n.

我可以运行 TidyText Mining 中的代码,并得到与书中显示的相同的图表。但是,当我在自己的数据集上运行此程序时,构面包装不显示 top_n (它们似乎显示随机数量的类别),并且每个构面中的数据未按降序排序。

我可以用一些随机文本数据和完整代码来复制这个问题——但我也可以用它来复制这个问题mtcars——这真的让我很困惑。

我希望下面的图表按降序显示每个方面的 mpg,并且每个方面只给我前1 个类别。它不适合我。

require(tidyverse)

mtcars %>%
  arrange (desc(mpg)) %>%
  mutate (gear = factor(gear, levels = rev(unique(gear)))) %>%
  group_by(am) %>%
  top_n(1) %>%
  ungroup %>%
  ggplot (aes (gear, mpg, fill = am)) +
  geom_col (show.legend = FALSE) +
  labs (x = NULL, y = "mpg") +
  facet_wrap(~am, ncol = 2, scales = "free") + 
  coord_flip()

但我真正想要的是有一个像 TidyText 书中那样排序的图表(仅数据示例)。

require(tidyverse)
require(tidytext)

starwars <- tibble (film = c("ANH", "ESB", "ROJ"),
                  text = c("It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire. During the battle, Rebel spies managed to steal secret plans to the Empire's ultimate weapon, the DEATH STAR, an armored space station with enough power to destroy an entire planet. Pursued by the Empire's sinister agents, Princess Leia races home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy.....",
                           "It is a dark time for the Rebellion. Although the Death Star has been destroyed, Imperial troops have driven the Rebel forces from their hidden base and pursued them across the galaxy. Evading the dreaded Imperial Starfleet, a group of freedom fighters led by Luke Skywalker has established a new secret base on the remote ice world of Hoth. The evil lord Darth Vader, obsessed with finding young Skywalker, has dispatched thousands of remote probes into the far reaches of space....",
                           "Luke Skywalker has returned to his home planet of Tatooine in an attempt to rescue his friend Han Solo from the clutches of the vile gangster Jabba the Hutt. Little does Luke know that the GALACTIC EMPIRE has secretly begun construction on a new armored space station even more powerful than the first dreaded Death Star. When completed, this ultimate weapon will spell certain doom for the small band of rebels struggling to restore freedom to the galaxy...")) %>%
  unnest_tokens(word, text) %>%
  mutate(film = as.factor(film)) %>%
  count(film, word, sort = TRUE) %>%
  ungroup()

total_wars <- starwars %>%
  group_by(film) %>%
  summarize(total = sum(n))

starwars <- left_join(starwars, total_wars)

starwars <- starwars %>%
  bind_tf_idf(word, film, n)

starwars %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(film) %>%
  top_n(10) %>%
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = film)) +
  geom_col(show.legend = FALSE) +
  labs (x = NULL, y = "tf-idf") +
  facet_wrap(~film, ncol = 2, scales = "free") +
  coord_flip()
4

1 回答 1

3

我相信在这里绊倒你的是top_n()默认为表中的最后一个变量,除非你告诉它使用什么变量进行排序。在我们书中的示例中,数据框中的最后一个变量tf_idf是用于排序的。在 mtcars 示例中,top_n()使用数据框中的最后一列进行排序;那恰好是carb

您始终可以top_n()通过将其作为参数传递来告诉您要用于排序的变量。例如,使用 diamonds 数据集查看这个类似的工作流程。

library(tidyverse)

diamonds %>%
  arrange(desc(price)) %>%
  group_by(clarity) %>%
  top_n(10, price) %>%
  ungroup %>%
  ggplot(aes(cut, price, fill = clarity)) +
  geom_col(show.legend = FALSE, ) +
  facet_wrap(~clarity, scales = "free") + 
  scale_x_discrete(drop=FALSE) +
  coord_flip()

reprex 包(v0.2.0)于 2018 年 5 月 17 日创建。

这些示例数据集并不是完全平行的,因为它们不像整洁的文本数据框那样,每个特征组合都有一行。不过,我很确定问题top_n()是问题所在。

于 2018-05-17T17:57:09.100 回答