2

https://www.kaggle.com/shivamb/netflix-shows-and-movies-exploratory-analysis/data ---- 包含数据集。

这是对 Netflix 数据集中的节目进行的探索性数据分析。数据整理过程有两个主要目标。首先是从 date_added 列中单独获取年份部分。其次是创建一个新列,其中包含持续时间列中特定节目的季节数。我已经依靠包中的separate功能dplyr来实现上述两个目标。

代码如下:-

# Neitlix EDA ----
# https://www.kaggle.com/shivamb/netflix-shows-and-movies-exploratory-analysis

library(tidyverse)
library(lubridate)    

net_flix <- read.csv("netflix_titles_nov_2019.csv")

net_flix_wrangled_tbl <- net_flix %>%
    separate(date_added, 
             into = c("date","month","year"),
             sep = "-",
             remove = FALSE)%>%
    separate(duration,
             into = c("count","show_type"),
             sep = " ",
             remove = FALSE)%>%
    glimpse()

不想下载数据的可以使用下面包含的数据框代码:

sf <- data.frame(date_added = c("30-11-19", "29-11-19", "", "12-07-19", "", "16-09-19"), 
duration = c("1 Season", "67 min", "135 min", "2 Seasons", "107 min", "3 Seasons"))

输出与separate()从持续时间列中获取日期和过滤季节数的函数一起使用。

但是,这是否可以通过使用lubridate包获取年份和ifelse()/filter()Regex函数来获取仅获取季节数而不获取电影分钟数以更好和更强大的方式完成?

4

1 回答 1

1

这是一种选择:

library(dplyr)
library(lubridate)


sf %>%
  mutate(date_added = dmy(date_added), 
         date = day(date_added), month = month(date_added),
         year = year(date_added), 
         count = readr::parse_number(as.character(duration)),
         show_type = stringr::str_remove(duration, as.character(count)))


#  date_added  duration date month year count show_type
#1 2019-11-30  1 Season   30    11 2019     1    Season
#2 2019-11-29    67 min   29    11 2019    67       min
#3       <NA>   135 min   NA    NA   NA   135       min
#4 2019-07-12 2 Seasons   12     7 2019     2   Seasons
#5       <NA>   107 min   NA    NA   NA   107       min
#6 2019-09-16 3 Seasons   16     9 2019     3   Seasons
于 2020-04-22T09:01:43.747 回答