python - 根据值合并列名以创建另一列

Question

我有一个包含各种电影类型以及电影是否属于该类型的电影数据集。例如

Index Biography Comedy  Crime   Documentary Drama   Family  Fantasy
0   0   1   0   0   1   1   0
1   0   1   0   0   0   1   0
2   0   0   0   0   1   0   0
3   0   1   0   0   0   0   0
4   0   1   0   0   1   0   0
5   0   0   1   0   1   0   0
6   0   1   0   0   0   0   0

如果电影属于那种类型，我想获得一个新列，其中电影类型名称用空格或逗号分隔

Index  New column
0    Comedy Drama Family
1    Comedy Family
2    Drama
3    Comedy
4    Comedy Drama
5    Crime Drama

请分享 R 或 Python 中的代码。谢谢您的帮助。

score 2 · Accepted Answer

在 Python 中使用矩阵乘法：

df.dot(df.columns + " ")

要得到

Index
0    Comedy Drama Family
1          Comedy Family
2                  Drama
3                 Comedy
4           Comedy Drama
5            Crime Drama
6                 Comedy

使其更通用：

sep = ", "
df.dot(df.columns + sep).str.rstrip(sep)

即，将分隔符添加到列名，执行矩阵向量乘法，然后在末尾右剥离分隔符。

score 1 · Accepted Answer

基本的python代码：

import pandas as pd
df = pd.read_csv('test.csv')

def check_genre(row):
    s = ""
    if row['biography'] == 1:
        s = s + ' biography'
    if row['comedy'] == 1:
        s = s + ' comedy'
    if row['crime'] == 1:
        s = s + ' crime'
    if row['Documentary'] == 1:
        s = s + ' Documentary'
    if row['Drama'] == 1:
        s = s + ' Drama'
    if row['Family'] == 1:
        s = s + ' Family'
    if row['Fantasy'] == 1:
        s = s + ' Fantasy'

    return s

df['genre'] = df.apply(lambda row: check_genre(row), axis=1)

print(df)

score 1 · Accepted Answer

在 pandas 中，您可以提取等于 1 的行值的索引值，然后将它们转换为字符串：

df.apply(lambda row: " ".join(row[row == 1].index), axis=1)

# Index
# 0    Comedy Drama Family
# 1          Comedy Family
# 2                  Drama
# 3                 Comedy
# 4           Comedy Drama
# 5            Crime Drama
# 6                 Comedy

score 1 · Accepted Answer

基数 R -

df$new_col <- apply(df, 1, function(x) paste0(names(x)[x == 1], collapse = ' '))

dplyr-

library(dplyr)

df %>%
  group_by(Index) %>%
  summarise(new_col = paste0(names(.[-1])[cur_data() == 1], collapse = ' '))

#  Index new_col            
#  <int> <chr>              
#1     0 Comedy Drama Family
#2     1 Comedy Family      
#3     2 Drama              
#4     3 Comedy             
#5     4 Comedy Drama       
#6     5 Crime Drama        
#7     6 Comedy

数据

df <- structure(list(Index = 0:6, Biography = c(0L, 0L, 0L, 0L, 0L, 
0L, 0L), Comedy = c(1L, 1L, 0L, 1L, 1L, 0L, 1L), Crime = c(0L, 
0L, 0L, 0L, 0L, 1L, 0L), Documentary = c(0L, 0L, 0L, 0L, 0L, 
0L, 0L), Drama = c(1L, 0L, 1L, 0L, 1L, 1L, 0L), Family = c(1L, 
1L, 0L, 0L, 0L, 0L, 0L), Fantasy = c(0L, 0L, 0L, 0L, 0L, 0L, 
0L)), class = "data.frame", row.names = c(NA, -7L))

score 1 · Accepted Answer

df %>%
  apply(1, function(x){which(x == 1)}) %>% 
  lapply(function(x){
    paste(names(x), collapse = " ")
    }) %>%
  unlist() -> df$your_new_column

score 1 · Accepted Answer

my.movies <- read.table(text = 'Index Biography Comedy  Crime   Documentary Drama   Family  Fantasy
0   0   1   0   0   1   1   0
1   0   1   0   0   0   1   0
2   0   0   0   0   1   0   0
3   0   1   0   0   0   0   0
4   0   1   0   0   1   0   0
5   0   0   1   0   1   0   0
6   0   1   0   0   0   0   0', header = T)
library(tidyverse)
my.movies %>%
  pivot_longer(!Index, names_to = 'genre') %>%
  filter(value !=0) %>%
  group_by(Index) %>%
  summarise(genre = toString(genre))
#> # A tibble: 7 x 2
#>   Index genre                
#>   <int> <chr>                
#> 1     0 Comedy, Drama, Family
#> 2     1 Comedy, Family       
#> 3     2 Drama                
#> 4     3 Comedy               
#> 5     4 Comedy, Drama        
#> 6     5 Crime, Drama         
#> 7     6 Comedy

^{由reprex 包于 2021-05-30 创建 (v2.0.0 )}

score 0 · Accepted Answer

减少到一个

拆垛
筛选
总计的

import io

df = pd.read_csv(io.StringIO("""Index Biography Comedy  Crime   Documentary Drama   Family  Fantasy
0   0   1   0   0   1   1   0
1   0   1   0   0   0   1   0
2   0   0   0   0   1   0   0
3   0   1   0   0   0   0   0
4   0   1   0   0   1   0   0
5   0   0   1   0   1   0   0
6   0   1   0   0   0   0   0"""), sep="\s+").set_index("Index")

df.unstack().to_frame().loc[lambda d: d[0].eq(1)].reset_index().groupby("Index").agg({"level_0":" ".join})

指数	level_0
0	喜剧剧情家庭
1	喜剧家庭
2	戏剧
3	喜剧
4	喜剧
5	罪案剧
6	喜剧

score 0 · Accepted Answer

在 R/dplyr 中发布响应

如果“main_df”是根据第一张图片的 DataFrame。使数据框更长，以便所有流派列的格式都整齐。 group_by基于索引，因为这是每部电影，并使用折叠流派列paste

main_df%>%
  pivot_longer(cols=-index)%>%
  filter(value>0)%>% # filter where movie is part of the genre i.e 1
  group_by(index)%>%
  mutate(new_genre = paste(name,collapse = ","))%>%
  ungroup()%>%
  distinct(index,new_genre)-> main_df2

# if you want to merge back to the original data frame use left_join

left_join(main_df, main_df2,by="index")

python - 根据值合并列名以创建另一列

8 回答 8

Related

Reference