0

I have a data frame with the letters of the English alphabet and their frequency. Now it would be nice to also know the frequency of the vowels and the consonants and the total number of occurrences - and since I want to plot all of this information, I need it to be in one data frame.

So I often find myself in a situation like this:

df <- data.frame(letter = letters, freq = sample(1:100, length(letters)))

df_vowels <- data.frame(letter = "vowels", freq = sum(df[df$letter %in% c("a", "e", "i", "o", "u"), ]$freq))
df_consonants <- data.frame(letter = "consonants", freq = sum(df[!df$letter %in% c("a", "e", "i", "o", "u"), ]$freq))
df_totals <- data.frame(letter = "totals", freq = sum(df$freq))

df <- rbind(df, df_vowels, df_consonants, df_totals)

Am I doing this the right way or is there a more elegant solution for this?

Looks like my description was terribly confusing:

Basically, I want to add new categories (rows) to the data frame. In this very simple example, it's simply summarized data.

(For time series plots I'm using the aggregate function.)

enter image description here

4

2 回答 2

2

编辑:这是您问题的第三个版本的一个非常优雅的答案:

df <- data.frame(letter = letters, freq = sample(1:100, length(letters)),
                 stringsAsFactors=F)

df = df %>% group_by(letter) %>% summarize(freq = sum(freq))

df.tots = df %>% group_by(is_vowel = letter %in% c('a','e','i','o','u')) %>%
                 summarize(freq=sum(freq))

# Now we just rbind your three summary rows onto the df, then pipe it into your ggplot  
df %>%
  rbind(c('vowels',     df.tots[df.tots$is_vowel==T,]$freq)) %>%
  rbind(c('consonants', df.tots[df.tots$is_vowel==F,]$freq)) %>%
  rbind(c('total',      sum(df.tots$freq)))                  %>%
  ggplot( ... your_ggplot_command_goes_here ...)

  #qplot(data=..., x=letter, y=freq, stat='identity', geom='histogram')
  # To keep your x-axis in order, i.e. our summary rows at bottom,
  # you have to explicitly set order of factor levels:
  # df$letter = factor(df$letter, levels=df$letter)

瞧!

笔记:

  1. 我们需要data.frame(... stringsAsFactors=F)这样我们以后可以附加行“元音”、“辅音”、“总”,因为这些不会出现在“字母”的因子级别中
  2. 请注意,dplyr group_by(is_vowel = ...) 允许我们同时插入一个新列 ( mutate),然后在该表达式 ( group_by) 上拆分,所有这些都在一个紧凑的行中。整洁的。从来不知道能做到这一点。
  3. 你应该能够bind_rows在最后工作,我不能。

编辑:第二个版本。你说你想做一个聚合,所以我们认为每个字母在 df 中都有 >1 条记录。您似乎只是将 df 拆分为元音和辅音,然后再次合并,所以我认为除了is_vowel. 一种方法是使用 dplyr:

require(dplyr)
#  I don't see why you don't just overwrite df here with df2, the df of totals...
df2 = df %>% group_by(letter) %>% summarize(freq = sum(freq))
   letter     freq
1       a      150
2       b       33
3       c       54
4       d      258
5       e      285
6       f      300
7       g      198
8       h       27
9       i       36
10      j      189
..    ...      ...

# Now add a logical column, so we can split on it when aggregating
# df or df2 ....
df$is_vowel = df$letter %in% c('a','e','i','o','u')

# Then your total vowels are:
df %>% filter(is_vowel==T) %>% summarize(freq = sum(freq))
     freq
      312
# ... and total consonants ...
df %>% filter(is_vowel==F) %>% summarize(freq = sum(freq))
     freq
     1011

这是另一种方式,如果您想避免使用 dplyr,请使用单线:

split(df, df$letter %in% c("a", "e", "i", "o", "u") )

顺便说一句,只需从所有字母中减去元音,您就可以更轻松地形成辅音列表(/集合):

setdiff(letters, c("a", "e", "i", "o", "u"))
# "b" "c" "d" "f" "g" "h" "j" "k" "l" "m" "n" "p" "q" "r" "s" "t" "v" "w" "x" "y" "z"
于 2015-07-04T18:33:38.510 回答
2

你可以试试

 v2 <- with(df, tapply(freq, c('consonants', 'vowels')[letter %in% 
              v1+1L], FUN=sum))

 df1 <- rbind(df, data.frame(letter=c(names(v2),"Total"), 
            freq=c(v2, sum(v2)), stringsAsFactors=FALSE))
 library(ggplot2)
 ggplot(df1, aes(x=letter, y=freq)) +
                  geom_bar(stat='identity')

数据

set.seed(24)
df <- data.frame(letter= sample(letters,200, replace=TRUE),
 freq = sample(1:100, 200, replace=TRUE), stringsAsFactors=FALSE)
v1 <- c("a", "e", "i", "o", "u")
于 2015-07-04T18:34:13.787 回答