0

我在 excel 文件中有很多旧的不整洁的数据(50 张,每张 400-500 行)。我的部分数据如下所示:

Elements= c("Project name ONE","John","Smith","Sara","Project name TWO","stardust","soil","sunflower","juice","doe","tobacco", "Project name THREE","phi","rho","omega")

Units= c("NA", "3", "5", "6", "NA", "21", "19", "31", "24", "1", "5", "NA", "21", "21", "22")

df= data.frame(Elements, Units)

在我的大型数据集中,每个项目的行数都非常不同。

我想创建新列“组”,其中定义了每个项目的每一行。对于上述示例,结果将是这样的

Group =c(1,1,1,1,2,2,2,2,2,2,2,3,3,3,3)

df =c(Elements, Units, Group)

但我也想将每个“空”单元格下方的所有值的“单位”列中的值相加到一个新的“总和”列中。

Sum= c("14", "NA", "NA", "NA", "101", "NA", "NA", "NA", "NA", "NA", "NA", "9", "NA", "NA", "NA")

我的最终产品将如下所示:

df =c(Elements, Units, Group, Sum)
4

2 回答 2

0

你也可以这样做:

df %>%
  type_convert() %>%
  group_by(grp = cumsum(is.na(Units))) %>%
  mutate(Sum = (NA^(row_number() != 1))* sum(Units, na.rm = TRUE))

# A tibble: 15 x 4
# Groups:   grp [3]
   Elements           Units   grp   Sum
   <chr>              <dbl> <int> <dbl>
 1 Project name ONE      NA     1    14
 2 John                   3     1    NA
 3 Smith                  5     1    NA
 4 Sara                   6     1    NA
 5 Project name TWO      NA     2   101
 6 stardust              21     2    NA
 7 soil                  19     2    NA
 8 sunflower             31     2    NA
 9 juice                 24     2    NA
10 doe                    1     2    NA
11 tobacco                5     2    NA
12 Project name THREE    NA     3    64
13 phi                   21     3    NA
14 rho                   21     3    NA
15 omega                 22     3    NA
于 2022-02-15T18:43:23.890 回答
0

你可以做某事。像这样:

请注意,我更改了您的输入示例,以便未将缺失声明为“NA”字符串,而是真正的缺失 (NA):

df <- data.frame(Elements = c("Project name ONE","John","Smith","Sara","Project name TWO","stardust","soil","sunflower","juice","doe","tobacco", "Project name THREE","phi","rho","omega"),
                 Units    = c(NA, "3", "5", "6", NA, "21", "19", "31", "24", "1", "5", NA, "21", "21", "22"))

library(tidyverse)
df %>%
  mutate(project = if_else(is.na(Units), Elements, NA_character_),
         Units   = as.numeric(Units)) %>%
  fill(project) %>%
  group_by(project) %>%
  filter(row_number() != 1) %>%
  mutate(Sum = if_else(row_number() == 1, sum(Units, na.rm = TRUE), NA_real_)) %>%
  ungroup()

# A tibble: 12 x 4
   Elements  Units project              Sum
   <chr>     <dbl> <chr>              <dbl>
 1 John          3 Project name ONE      14
 2 Smith         5 Project name ONE      NA
 3 Sara          6 Project name ONE      NA
 4 stardust     21 Project name TWO     101
 5 soil         19 Project name TWO      NA
 6 sunflower    31 Project name TWO      NA
 7 juice        24 Project name TWO      NA
 8 doe           1 Project name TWO      NA
 9 tobacco       5 Project name TWO      NA
10 phi          21 Project name THREE    64
11 rho          21 Project name THREE    NA
12 omega        22 Project name THREE    NA

那么我们在做什么呢?

  • 我们通过获取 Units 为 NA 的“元素”来定义项目(或组),然后向下填充。
  • 我们还将您的 Units 列转换为数字(在您的示例中,它是一个字符变量)。
  • 然后我们按项目分组。
  • 我们过滤掉第一行,因为它包含现在有自己的列的项目名称。
  • 然后我们计算每个项目的 Units 总和,并将 taht 放入每个项目的第一行。

如果您不想在“元素”中删除包含项目名称的第一行,您可以简单地删除带有filter(...).

于 2022-02-15T18:26:22.493 回答