为了像原始问题一样汇总短语,我做了
anti <-
hate_crime %>%
filter(DATA_YEAR %in% c("2009", "2017")) %>%
mutate(
ANTI_WHITE = grepl("Anti-White", BIAS_DESC),
ANTI_BLACK = grepl("Anti-Black", BIAS_DESC),
ANTI_HISPANIC = grepl("Anti-Hispanic", BIAS_DESC)
) %>%
select(DATA_YEAR, starts_with("ANTI"))
group_by()然后我用and创建了每次出现的计数summarize_all()(注意sum()逻辑向量的 是出现的TRUE次数),并用于pivot_longer()创建一个“整洁”的摘要
anti %>%
group_by(DATA_YEAR) %>%
summarize_all(~ sum(.)) %>%
tidyr::pivot_longer(starts_with("ANTI"), "BIAS", values_to = "COUNT")
结果类似于(导入数据时出现错误read_csv(),我没有调查)
# A tibble: 6 x 3
DATA_YEAR BIAS COUNT
<dbl> <chr> <int>
1 2009 ANTI_WHITE 539
2 2009 ANTI_BLACK 2300
3 2009 ANTI_HISPANIC 486
4 2017 ANTI_WHITE 722
5 2017 ANTI_BLACK 2101
6 2017 ANTI_HISPANIC 444
可视化似乎是第二个独立的问题。
代码可以通过定义一个函数来简化一点
n_with_bias <- function(x, bias)
sum(grepl(bias, x))
然后避免需要单独改变数据
hate_crime %>%
filter(DATA_YEAR %in% c("2009", "2017")) %>%
group_by(DATA_YEAR) %>%
summarize(
ANTI_WHITE = n_with_bias(BIAS_DESC, "Anti-White"),
ANTI_BLACK = n_with_bias(BIAS_DESC, "Anti-Black"),
ANTI_HISPANIC = n_with_bias(BIAS_DESC, "Anti-Hispanic")
) %>%
tidyr::pivot_longer(starts_with("ANTI"), names_to = "BIAS", values_to = "N")
另一方面,基础 R 方法可能会为感兴趣的年份和所有偏差创建向量(strsplit()用于隔离复合偏差的组成部分)
years <- c("2009", "2017")
biases <- unique(unlist(strsplit(hate_crime$BIAS_DESC, ";")))
然后在感兴趣的每一年创建偏差向量
bias_by_year <- split(hate_crime$BIAS_DESC, hate_crime$DATA_YEAR)[years]
并迭代每年和偏差(当元素数量很大(例如,10,000 个)时,嵌套迭代可能效率低下,但这不是问题)
sapply(bias_by_year, function(bias) sapply(biases, n_with_bias, x = bias))
结果是一个经典的data.frame,每年都有所有的偏差
2009 2017
Anti-Black or African American 2300 2101
Anti-White 539 722
Anti-Jewish 932 983
Anti-Arab 0 106
Anti-Protestant 38 42
Anti-Other Religion 111 85
Anti-Islamic (Muslim) 0 0
Anti-Gay (Male) 0 0
Anti-Asian 128 133
Anti-Catholic 52 72
Anti-Heterosexual 21 33
Anti-Hispanic or Latino 486 444
Anti-Other Race/Ethnicity/Ancestry 296 280
Anti-Multiple Religions, Group 48 52
Anti-Multiple Races, Group 180 202
Anti-Lesbian (Female) 0 0
Anti-Lesbian, Gay, Bisexual, or Transgender (Mixed Group) 0 0
Anti-American Indian or Alaska Native 68 244
Anti-Atheism/Agnosticism 10 6
Anti-Bisexual 24 24
Anti-Physical Disability 24 66
Anti-Mental Disability 70 89
Anti-Gender Non-Conforming 0 13
Anti-Female 0 48
Anti-Transgender 0 117
Anti-Native Hawaiian or Other Pacific Islander 0 15
Anti-Male 0 25
Anti-Jehovah's Witness 0 7
Anti-Mormon 0 12
Anti-Buddhist 0 15
Anti-Sikh 0 18
Anti-Other Christian 0 24
Anti-Hindu 0 10
Anti-Eastern Orthodox (Russian, Greek, Other) 0 0
Unknown (offender's motivation not known) 0 0
这避免了在summarize()步骤中输入每个偏差的需要。我不确定如何在可读的整洁风格分析中进行计算。
请注意,在上表中,任何带有 a 的偏差(在这两年都为零。这是因为grepl()将(偏差视为分组符号;通过添加解决此问题fixed = TRUE
n_with_bias <- function(x, bias)
sum(grepl(bias, x, fixed = TRUE))
和更新的结果
2009 2017
Anti-Black or African American 2300 2101
Anti-White 539 722
Anti-Jewish 932 983
Anti-Arab 0 106
Anti-Protestant 38 42
Anti-Other Religion 111 85
Anti-Islamic (Muslim) 107 284
Anti-Gay (Male) 688 692
Anti-Asian 128 133
Anti-Catholic 52 72
Anti-Heterosexual 21 33
Anti-Hispanic or Latino 486 444
Anti-Other Race/Ethnicity/Ancestry 296 280
Anti-Multiple Religions, Group 48 52
Anti-Multiple Races, Group 180 202
Anti-Lesbian (Female) 186 133
Anti-Lesbian, Gay, Bisexual, or Transgender (Mixed Group) 311 287
Anti-American Indian or Alaska Native 68 244
Anti-Atheism/Agnosticism 10 6
Anti-Bisexual 24 24
Anti-Physical Disability 24 66
Anti-Mental Disability 70 89
Anti-Gender Non-Conforming 0 13
Anti-Female 0 48
Anti-Transgender 0 117
Anti-Native Hawaiian or Other Pacific Islander 0 15
Anti-Male 0 25
Anti-Jehovah's Witness 0 7
Anti-Mormon 0 12
Anti-Buddhist 0 15
Anti-Sikh 0 18
Anti-Other Christian 0 24
Anti-Hindu 0 10
Anti-Eastern Orthodox (Russian, Greek, Other) 0 22
Unknown (offender's motivation not known) 0 0