0

我是 R 新手,正在练习使用来自 Kaggle 的 Titanic 数据集。我试图将姓氏、名字、称呼和额外信息分开到单独的列中,以便我可以尝试对乘客的年龄进行分类 - 成人或儿童。

以下是来自训练数据集的示例数据:

head(traindf,5)
# Source: local data frame [5 x 12]
# 
# PassengerId Survived Pclass
# 1           1        0      3
# 2           2        1      1
# 3           3        1      3
# 4           4        1      1
# 5           5        0      3
# Variables not shown: Name (chr), Sex (fctr), Age (dbl), SibSp (int), Parch
# (int), Ticket (fctr), Fare (dbl), Cabin (fctr), Embarked (fctr)

以下是包含名称的示例:

select(traindf,Survived,Pclass,Name,Sex)
# Source: local data frame [891 x 4]
# 
# Survived Pclass                                                Name    Sex
# 1         0      3                             Braund, Mr. Owen Harris   male
# 2         1      1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female
# 3         1      3                              Heikkinen, Miss. Laina female
# 4         1      1        Futrelle, Mrs. Jacques Heath (Lily May Peel) female
# 5         0      3                            Allen, Mr. William Henry   male
# 6         0      3                                    Moran, Mr. James   male
# 7         0      1                             McCarthy, Mr. Timothy J   male
# 8         0      3                      Palsson, Master. Gosta Leonard   male
# 9         1      3   Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female
# 10        1      2                 Nasser, Mrs. Nicholas (Adele Achem) female

我可以使用以下代码将姓氏与列的其余部分分开:

require(tidyr) # for the separate() function

traindfnames <- traindf %>%
  separate(Name, c("Lastname","Salutation"), sep = ",")

traindfnames 
# Source: local data frame [891 x 13]
# 
# PassengerId Survived Pclass  Lastname
# 1            1        0      3    Braund
# 2            2        1      1   Cumings
# 3            3        1      3 Heikkinen
# 4            4        1      1  Futrelle
# 5            5        0      3     Allen
# 6            6        0      3     Moran
# 7            7        0      1  McCarthy
# 8            8        0      3   Palsson
# 9            9        1      3   Johnson
# 10          10        1      2    Nasser
# ..         ...      ...    ...       ...
# Variables not shown: Salutation (chr), Sex (fctr), Age (dbl), SibSp (int),
# Parch (int), Ticket (fctr), Fare (dbl), Cabin (fctr), Embarked (fctr)

但是,当我尝试为名字添加字段时:

traindfnames <- traindf %>%
separate(Name, c("Lastname","Salutation","firstname"), sep =",,")

我收到此错误:

# Error: Values not split into 3 pieces at 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 2

我是否使用了不正确的语法或一列中的 3 个字段是不可能的?

4

1 回答 1

1

看过这些数据后,我认为最简单的方法是使用类似str_match()from package的东西stringr。如果您假设data$Name格式为“[Lastname], [Salutation]. [Firstname]”,则匹配此的正则表达式为

str_match(data$Name, "([A-Za-z]*),\\s([A-Za-z]*)\\.\\s(.*)")
#      [,1]                                                  [,2]        [,3]   [,4]                                   
# [1,] "Braund, Mr. Owen Harris"                             "Braund"    "Mr"   "Owen Harris"                          
# [2,] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Cumings"   "Mrs"  "John Bradley (Florence Briggs Thayer)"
# [3,] "Heikkinen, Miss. Laina"                              "Heikkinen" "Miss" "Laina"                                
# [4,] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"        "Futrelle"  "Mrs"  "Jacques Heath (Lily May Peel)"        
# [5,] "Allen, Mr. William Henry"                            "Allen"     "Mr"   "William Henry"                        
# [6,] "Moran, Mr. James"                                    "Moran"     "Mr"   "James" 

因此,您需要将上面的第 2 到 4 列添加到原始数据框中。我不确定你是否能做到这一点separate。写作

separate(data, Name, c("Lastname", "Salutation", "Firstname"), sep = "[,\\.]") 

将尝试用逗号或点分隔每个条目,但在第 514 个条目中遇到问题,看起来像“罗斯柴尔德,马丁夫人(伊丽莎白 L.巴雷特)”(注意第二个点)。

简而言之,我能看到的做你想做的最简单的方法是

data[c("Firstname", "Salutation", "Lastname")] <-
    str_match(data$Name, "([A-Za-z]*),\\s([A-Za-z]*)\\.\\s(.*)")[, 2:4]
于 2014-10-06T21:52:13.850 回答