1

我有一个包含各种电影类型以及电影是否属于该类型的电影数据集。例如

Index Biography Comedy  Crime   Documentary Drama   Family  Fantasy
0   0   1   0   0   1   1   0
1   0   1   0   0   0   1   0
2   0   0   0   0   1   0   0
3   0   1   0   0   0   0   0
4   0   1   0   0   1   0   0
5   0   0   1   0   1   0   0
6   0   1   0   0   0   0   0

如果电影属于那种类型,我想获得一个新列,其中电影类型名称用空格或逗号分隔

Index  New column
0    Comedy Drama Family
1    Comedy Family
2    Drama
3    Comedy
4    Comedy Drama
5    Crime Drama

请分享 R 或 Python 中的代码。谢谢您的帮助。

4

8 回答 8

2

在 Python 中使用矩阵乘法:

df.dot(df.columns + " ")

要得到

Index
0    Comedy Drama Family
1          Comedy Family
2                  Drama
3                 Comedy
4           Comedy Drama
5            Crime Drama
6                 Comedy

使其更通用:
sep = ", "
df.dot(df.columns + sep).str.rstrip(sep)

即,将分隔符添加到列名,执行矩阵向量乘法,然后在末尾右剥离分隔符。

于 2021-05-30T07:12:30.427 回答
1

基本的python代码:

import pandas as pd
df = pd.read_csv('test.csv')

def check_genre(row):
    s = ""
    if row['biography'] == 1:
        s = s + ' biography'
    if row['comedy'] == 1:
        s = s + ' comedy'
    if row['crime'] == 1:
        s = s + ' crime'
    if row['Documentary'] == 1:
        s = s + ' Documentary'
    if row['Drama'] == 1:
        s = s + ' Drama'
    if row['Family'] == 1:
        s = s + ' Family'
    if row['Fantasy'] == 1:
        s = s + ' Fantasy'

    return s

df['genre'] = df.apply(lambda row: check_genre(row), axis=1)

print(df)
于 2021-05-30T07:06:18.507 回答
1

在 pandas 中,您可以提取等于 1 的行值的索引值,然后将它们转换为字符串:

df.apply(lambda row: " ".join(row[row == 1].index), axis=1)

# Index
# 0    Comedy Drama Family
# 1          Comedy Family
# 2                  Drama
# 3                 Comedy
# 4           Comedy Drama
# 5            Crime Drama
# 6                 Comedy
于 2021-05-30T07:11:52.403 回答
1

基数 R -

df$new_col <- apply(df, 1, function(x) paste0(names(x)[x == 1], collapse = ' '))

dplyr-

library(dplyr)

df %>%
  group_by(Index) %>%
  summarise(new_col = paste0(names(.[-1])[cur_data() == 1], collapse = ' '))

#  Index new_col            
#  <int> <chr>              
#1     0 Comedy Drama Family
#2     1 Comedy Family      
#3     2 Drama              
#4     3 Comedy             
#5     4 Comedy Drama       
#6     5 Crime Drama        
#7     6 Comedy             

数据

df <- structure(list(Index = 0:6, Biography = c(0L, 0L, 0L, 0L, 0L, 
0L, 0L), Comedy = c(1L, 1L, 0L, 1L, 1L, 0L, 1L), Crime = c(0L, 
0L, 0L, 0L, 0L, 1L, 0L), Documentary = c(0L, 0L, 0L, 0L, 0L, 
0L, 0L), Drama = c(1L, 0L, 1L, 0L, 1L, 1L, 0L), Family = c(1L, 
1L, 0L, 0L, 0L, 0L, 0L), Fantasy = c(0L, 0L, 0L, 0L, 0L, 0L, 
0L)), class = "data.frame", row.names = c(NA, -7L))
于 2021-05-30T06:57:07.120 回答
1
df %>%
  apply(1, function(x){which(x == 1)}) %>% 
  lapply(function(x){
    paste(names(x), collapse = " ")
    }) %>%
  unlist() -> df$your_new_column
于 2021-05-30T06:53:21.770 回答
1
my.movies <- read.table(text = 'Index Biography Comedy  Crime   Documentary Drama   Family  Fantasy
0   0   1   0   0   1   1   0
1   0   1   0   0   0   1   0
2   0   0   0   0   1   0   0
3   0   1   0   0   0   0   0
4   0   1   0   0   1   0   0
5   0   0   1   0   1   0   0
6   0   1   0   0   0   0   0', header = T)
library(tidyverse)
my.movies %>%
  pivot_longer(!Index, names_to = 'genre') %>%
  filter(value !=0) %>%
  group_by(Index) %>%
  summarise(genre = toString(genre))
#> # A tibble: 7 x 2
#>   Index genre                
#>   <int> <chr>                
#> 1     0 Comedy, Drama, Family
#> 2     1 Comedy, Family       
#> 3     2 Drama                
#> 4     3 Comedy               
#> 5     4 Comedy, Drama        
#> 6     5 Crime, Drama         
#> 7     6 Comedy

reprex 包于 2021-05-30 创建 (v2.0.0 )

于 2021-05-30T06:55:22.007 回答
0

减少到一个

  • 拆垛
  • 筛选
  • 总计的
import io

df = pd.read_csv(io.StringIO("""Index Biography Comedy  Crime   Documentary Drama   Family  Fantasy
0   0   1   0   0   1   1   0
1   0   1   0   0   0   1   0
2   0   0   0   0   1   0   0
3   0   1   0   0   0   0   0
4   0   1   0   0   1   0   0
5   0   0   1   0   1   0   0
6   0   1   0   0   0   0   0"""), sep="\s+").set_index("Index")

df.unstack().to_frame().loc[lambda d: d[0].eq(1)].reset_index().groupby("Index").agg({"level_0":" ".join})
指数 level_0
0 喜剧 剧情 家庭
1 喜剧家庭
2 戏剧
3 喜剧
4 喜剧
5 罪案剧
6 喜剧
于 2021-05-30T07:00:44.223 回答
0

在 R/dplyr 中发布响应

如果“main_df”是根据第一张图片的 DataFrame。使数据框更长,以便所有流派列的格式都整齐。 group_by基于索引,因为这是每部电影,并使用折叠流派列paste

main_df%>%
  pivot_longer(cols=-index)%>%
  filter(value>0)%>% # filter where movie is part of the genre i.e 1
  group_by(index)%>%
  mutate(new_genre = paste(name,collapse = ","))%>%
  ungroup()%>%
  distinct(index,new_genre)-> main_df2

# if you want to merge back to the original data frame use left_join

left_join(main_df, main_df2,by="index")
于 2021-05-30T06:54:28.843 回答