我正在使用一个名为HappyDB的数据集进行课堂演示并分析词频的人口统计学差异。我使用 tidytext 进行大部分分析,并使用他们的在线指南来创建我的大部分视觉效果。但是,我在创建带有标签的单词的频率图时遇到了问题。我的数据集的结构与他们的不同,我认为我正在考虑它,但显然我没有。这是他们生成图表的示例代码(将 Jane Austen 与 Bronte 姐妹和 HG Wells 进行比较)
library(tidyr)
frequency <- bind_rows(mutate(tidy_bronte, author = "Brontë Sisters"),
mutate(tidy_hgwells, author = "H.G. Wells"),
mutate(tidy_books, author = "Jane Austen")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(author, word) %>%
group_by(author) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
spread(author, proportion) %>%
gather(author, proportion, `Brontë Sisters`:`H.G. Wells`)
library(scales)
# expect a warning about rows with missing values being removed
ggplot(frequency, aes(x = proportion, y = `Jane Austen`, color = abs(`Jane Austen` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~author, ncol = 2) +
theme(legend.position="none") +
labs(y = "Jane Austen", x = NULL)
该代码生成此图:
我希望用我的数据集中的人口统计数据来模拟这一点,但不断出错。这是我的代码,它使用了我已经整理过的数据集:
library(dplyr)
library(tidyr)
library(ggplot2)
library(tidytext)
library(stringr)
windowsFonts(Franklin=windowsFont("Franklin Gothic Demi"))
marriedmen <- tidy_hm[which(tidy_hm$marital =="married" &
tidy_hm$gender == "m"),]
marriedwomen <- tidy_hm[which(tidy_hm$marital =="married" &
tidy_hm$gender == "f"),]
singlemen <- tidy_hm[which(tidy_hm$marital =="single" &
tidy_hm$gender == "m"),]
frequency <- bind_rows(mutate(marriedmen, status = "Married men"),
mutate(marriedwomen, status = "Married women"),
mutate(singlemen, status = "Single men")) %>%
count(status, word) %>%
group_by(status) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
spread(status, proportion) %>%
gather(status, proportion, `Married women`:`Single men`)
library(scales)
# expect a warning about rows with missing values being removed
ggplot(frequency, aes(x = proportion, y = 'Married men', color = abs(`Married men` - proportion)) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~status, ncol = 2) +
theme(legend.position="none") +
labs(y = NULL, x = NULL)
但我不断收到此错误:
Error in log(x, base) : non-numeric argument to mathematical function
我尝试删除比例行,但这会导致一堆数据被删除,并且绘图看起来不像它应该的那样,并且没有线条、标签或颜色。我对 r 和一般的编码都很陌生,所以任何帮助表示赞赏。