python - 具有列和行中的变量的宽到长数据表转换

Question

我有一个包含多个表的 csv，其中变量存储在行和列中。
关于这个 csv：

我想从“宽”到“长”
一个csv中有多个“数据框”
每个“数据框”都有不同类型的变量

> df3
     V1          V2    V3     V4      V5     V6      V7    V8
1   nyc 123 main st month      1       2      3       4     5
2   nyc 123 main st     x  58568  567567 567909   35876 56943
3   nyc 123 main st     y   5345    3673   3453    3467   788
4   nyc 123 main st     z  53223  563894 564456   32409 56155
5                                                            
6    la  63 main st month      1       2      3       4     5
7    la  63 main st     a  87035 7467456   3363     863 43673
8    la  63 main st     b    345     456    345     678   345
9    la  63 main st     c  86690 7467000   3018     185 43328
10                                                           
11   sf 953 main st month      1       2      3       4     5
12   sf 953 main st     x 457456    3455 345345   56457  3634
13   sf 953 main st     b   5345    3673   3453    3467   788
14   sf 953 main st     z 452111    -218 341892   52990  2846

> df4
18 city     address month      x       y      z       a     b       c
19  nyc 123 main st     1  58568    5345  53223    null  null    null
20  nyc 123 main st     2 567567    3673 563894    null  null    null
21  nyc 123 main st     3 567909    3453 564456    null  null    null
22  nyc 123 main st     4  35876    3467  32409    null  null    null
23  nyc 123 main st     5  56943     788  56155    null  null    null
24   la  63 main st     1   null    null   null   87035   345   86690
25   la  63 main st     2   null    null   null 7467456   456 7467000
26   la  63 main st     3   null    null   null    3363   345    3018
27   la  63 main st     4   null    null   null     863   678     185
28   la  63 main st     5   null    null   null   43673   345   43328
29   sf 953 main st     1 457456    null 452111    null  5345    null
30   sf 953 main st     2   3455    null   -218    null  3673    null
31   sf 953 main st     3 345345    null 341892    null  3453    null
32   sf 953 main st     4  56457    null  52990    null  3467    null
33   sf 953 main st     5   3634    null   2846    null   788    null

上面是我拥有的数据，下面是我想要的转换。

我在 R 中最舒服，但我正在练习 Python，所以任何方法都有效。

score 0 · Accepted Answer

OP 提供的样本数据集表明 csv 文件中的所有数据帧

具有相同的结构，即相同的列数、名称和位置
对于所有“子框架”，月度列指的是相同的月份 1到5 V4。V8

如果这是真的，那么我们可以将整个 csv 文件视为一个数据帧，并通过使用melt()和dcast()从data.table包中重塑将其转换为所需的格式：

library(data.table)
setDT(df3)[, melt(.SD, id.vars = paste0("V", 1:3), na.rm = TRUE)][
  V3 != "month", dcast(.SD, V1 + V2 + rleid(variable) ~ forcats::fct_inorder(V3))][
    , setnames(.SD, 1:3, c("city", "address", "month"))]

    city     address month      x    y      z       a    b       c
 1:   la  63 main st     1     NA   NA     NA   87035  345   86690
 2:   la  63 main st     2     NA   NA     NA 7467456  456 7467000
 3:   la  63 main st     3     NA   NA     NA    3363  345    3018
 4:   la  63 main st     4     NA   NA     NA     863  678     185
 5:   la  63 main st     5     NA   NA     NA   43673  345   43328
 6:  nyc 123 main st     1  58568 5345  53223      NA   NA      NA
 7:  nyc 123 main st     2 567567 3673 563894      NA   NA      NA
 8:  nyc 123 main st     3 567909 3453 564456      NA   NA      NA
 9:  nyc 123 main st     4  35876 3467  32409      NA   NA      NA
10:  nyc 123 main st     5  56943  788  56155      NA   NA      NA
11:   sf 953 main st     1 457456   NA 452111      NA 5345      NA
12:   sf 953 main st     2   3455   NA   -218      NA 3673      NA
13:   sf 953 main st     3 345345   NA 341892      NA 3453      NA
14:   sf 953 main st     4  56457   NA  52990      NA 3467      NA
15:   sf 953 main st     5   3634   NA   2846      NA  788      NA

fct_inorder()这里使用Hadley包中的函数forcats来按列的第一次出现而不是按字母顺序 a、b、c、x、y、z 对列进行排序。

请注意，城市也是按字母顺序排列的。如果这很重要（但我怀疑是），原始订单也可以通过使用来保留

forcats::fct_inorder(V1) + V2 + rleid(variable) ~ forcats::fct_inorder(V3)

作为dcast()公式。

数据

不幸的是，OP 没有提供结果，dput(df3)这使得重现问题中打印的数据集变得不必要地困难：

df3 <- readr::read_table(
  "     V1          V2    V3     V4      V5     V6      V7    V8
  1   nyc 123 main st month      1       2      3       4     5
  2   nyc 123 main st     x  58568  567567 567909   35876 56943
  3   nyc 123 main st     y   5345    3673   3453    3467   788
  4   nyc 123 main st     z  53223  563894 564456   32409 56155
  5                                                            
  6    la  63 main st month      1       2      3       4     5
  7    la  63 main st     a  87035 7467456   3363     863 43673
  8    la  63 main st     b    345     456    345     678   345
  9    la  63 main st     c  86690 7467000   3018     185 43328
  10                                                           
  11   sf 953 main st month      1       2      3       4     5
  12   sf 953 main st     x 457456    3455 345345   56457  3634
  13   sf 953 main st     b   5345    3673   3453    3467   788
  14   sf 953 main st     z 452111    -218 341892   52990  2846"
)
library(data.table)
setDT(df3)[, V2 := paste(X3, V2)][, c("X1", "X3") := NULL]
setDF(df3)[]

    V1          V2    V3     V4      V5     V6    V7    V8
1  nyc 123 main st month      1       2      3     4     5
2  nyc 123 main st     x  58568  567567 567909 35876 56943
3  nyc 123 main st     y   5345    3673   3453  3467   788
4  nyc 123 main st     z  53223  563894 564456 32409 56155
5              NA            NA      NA     NA    NA    NA
6   la  63 main st month      1       2      3     4     5
7   la  63 main st     a  87035 7467456   3363   863 43673
8   la  63 main st     b    345     456    345   678   345
9   la  63 main st     c  86690 7467000   3018   185 43328
10             NA            NA      NA     NA    NA    NA
11  sf 953 main st month      1       2      3     4     5
12  sf 953 main st     x 457456    3455 345345 56457  3634
13  sf 953 main st     b   5345    3673   3453  3467   788
14  sf 953 main st     z 452111    -218 341892 52990  2846

score 0 · Accepted Answer

如果您的 df 有正确的列名，首先会有所帮助，请在读取数据后插入列名。

我使用了以下库，dplyr并stringr为此分析并重命名了前 3 列：

df <- data.frame(stringsAsFactors=FALSE,
        city = c("nyc", "nyc", "nyc"),
     address = c("123 main st", "123 main st", "123 main st"),
       month = c("x", "y", "z"),
          X1 = c(58568L, 5345L, 53223L),
          X2 = c(567567L, 3673L, 563894L),
          X3 = c(567909L, 3453L, 564456L),
          X4 = c(35876L, 3467L, 32409L),
          X5 = c(56943L, 788L, 56155L)
)

df %>% gather(Type, Value, -c(city:month)) %>% 
        spread(month, Value) %>%
        mutate(month = str_sub(Type, 2, 2)) %>%
        select(-Type) %>%
        select(c(city, address, month, x:z))

city     address month      x    y      z
1  nyc 123 main st     1  58568 5345  53223
2  nyc 123 main st     2 567567 3673 563894
3  nyc 123 main st     3 567909 3453 564456
4  nyc 123 main st     4  35876 3467  32409
5  nyc 123 main st     5  56943  788  56155

python - 具有列和行中的变量的宽到长数据表转换

2 回答 2

数据

Related

Reference