0

我包含一个数据集“区域”

House_No. Info_On_Area
1a        Names of neighbouringhouse in 100m  1b   1c    1d    1e 
1a        Area of neighbouringhouse  in 100m  500  1000  1500  300
1a        Names of neighbouringhouse in 300m  1b   1c    1d    1e   1f    1g   1h
1a        Area of neighbouringhouse  in 300m  500  1000  1500  300  600   400  2000
2a        Names of neighbouringhouse in 100m  2b   2c    2d    2e 
2a        Area of neighbouringhouse  in 100m  500  1000  1500  300
2a        Names of neighbouringhouse in 300m  2b   2c    2d    2e   2f    2g   2h
2a        Area of neighbouringhouse  in 300m  500  1000  1500  300  600   400  2000

我想创建一个数据框,我可以让表格显示为

House_No. Area of neighbouringhouse in 100m Area of neighbouringhouse  in 300m 

我使用 dplyr 并将不同的门牌号 CT <- data.frame(but %>% group_by( House_No.)) 分组并尝试使用 rowSums。但是,我收到错误消息说信息不是数字。我认为这是因为我需要将行值中的数字设为数字,但我不知道该怎么做。我被困在这个阶段,无法继续前进。

我确实研究了类似的解决方案,但他们似乎没有一个数据框,他们正在努力对行值求和,例如sum rows in data.frame 或 matrixSum by Rows in R

如果有任何帮助,我将不胜感激!谢谢 :)

4

2 回答 2

3

用于stringr::str_extract_*检索数字然后spread使用pivot_wider

library(tidyverse)
df %>%  
   #extract everything up to 1+ digits followed by m
   mutate(flag = str_extract(Info_On_Area,'.*\\d+m'), 
          #extract any 1 or more digits followed by space or at the end
          SumArea = map_dbl(Info_On_Area, ~sum(as.numeric(str_extract_all(.x, '\\d+(?=\\s|$)', simplify = TRUE))))) %>% 
   filter(str_detect(Info_On_Area, 'Area')) %>% 
   #As suggested by @Uwe
   pivot_wider(id_cols = House_No., names_from = flag, values_from = SumArea)

# A tibble: 2 x 3
  House_No. `Area of neighbouringhouse  in 100m` `Area of neighbouringhouse  in 300m`
  <chr>                                    <dbl>                                <dbl>
1 1a                                        3300                                 6300
2 2a                                        3300                                 6300

数据

df <- structure(list(House_No. = c("1a", "1a", "1a", "1a", "2a", "2a", 
"2a", "2a"), Info_On_Area = c("Names of neighbouringhouse in 100m  1b   1c    1d    1e", 
"Area of neighbouringhouse  in 100m  500  1000  1500  300", "Names of neighbouringhouse in 300m  1b   1c    1d    1e   1f    1g   1h", 
"Area of neighbouringhouse  in 300m  500  1000  1500  300  600   400  2000", 
"Names of neighbouringhouse in 100m  2b   2c    2d    2e", "Area of neighbouringhouse  in 100m  500  1000  1500  300", 
"Names of neighbouringhouse in 300m  2b   2c    2d    2e   2f    2g   2h", 
"Area of neighbouringhouse  in 300m  500  1000  1500  300  600   400  2000"
)), class = "data.frame", row.names = c(NA, -8L))
于 2019-12-15T06:19:04.377 回答
2

这里的困难在于信息以宽格式和长格式的混合形式呈现。Info_On_Area是一个字符列,其中包含变量名称以及由空格分隔的任意数量的值。因此,Info_On_Area需要分两步进行。首先,提取变量名称,然后提取数字,以便随后转换为数字和求和。

幸运的是,OP 只对简化问题的区域信息感兴趣。

1. tidyverse 方法

library(dplyr)
library(purrr)
library(stringr)
library(tidyr)
area %>% 
  filter(Info_On_Area %>% str_detect("^Area")) %>% 
  separate(Info_On_Area, c("var", "val"), sep = "(?<=00m)") %>% 
  mutate(Area = map_int(val, ~ str_extract_all(. , "\\d+") %>% unlist() %>% as.integer() %>% sum())) %>%
  pivot_wider(id_cols = House_No., names_from = var, values_from = Area)
# A tibble: 2 x 3
  House_No. `Area of neighbouringhouse  in 100m` `Area of neighbouringhouse  in 300m`
  <chr>                                    <int>                                <int>
1 1a                                        3300                                 6300
2 2a                                        3300                                 6300

结果各有一行House_No.这与A. Suliman 的解决方案不同,该解决方案每个显示两行House_No.(不再在A. Suliman 的答案的编辑版本中)。其他差异包括使用separate()and函数,一个带有lookbehindpivot_wider()的正则表达式,以及作为管道中的第一步应用。 "(?<=00m)"filter()

2. data.table 方法

为了完整起见,这里还有一个data.table解决方案:

library(data.table)
library(magrittr)
setDT(area)[Info_On_Area %like% "^Area", 
            c(.(House_No.= House_No.), tstrsplit(Info_On_Area, "(?<=00m)", perl = TRUE))][
              , str_extract_all(V3, "\\d+") %>% unlist() %>% as.integer() %>% sum(), by = .(House_No., V2)][
                , dcast(.SD, House_No. ~ V2, value.var = "V1")]
   House_No. Area of neighbouringhouse  in 100m Area of neighbouringhouse  in 300m
1:        1a                               3300                               6300
2:        2a                               3300                               6300

数据

area <- structure(list(House_No. = c("1a", "1a", "1a", "1a", "2a", "2a", 
"2a", "2a"), Info_On_Area = c("Names of neighbouringhouse in 100m  1b   1c    1d    1e", 
"Area of neighbouringhouse  in 100m  500  1000  1500  300", "Names of neighbouringhouse in 300m  1b   1c    1d    1e   1f    1g   1h", 
"Area of neighbouringhouse  in 300m  500  1000  1500  300  600   400  2000", 
"Names of neighbouringhouse in 100m  2b   2c    2d    2e", "Area of neighbouringhouse  in 100m  500  1000  1500  300", 
"Names of neighbouringhouse in 300m  2b   2c    2d    2e   2f    2g   2h", 
"Area of neighbouringhouse  in 300m  500  1000  1500  300  600   400  2000"
)), class = "data.frame", row.names = c(NA, -8L))
于 2019-12-15T19:37:59.240 回答