2

再会,

我将为您的出色评论提出两个[可能]非常微不足道的问题。

问题 #1

我有一个相对整洁的 df (dat),暗淡 10299 x 563。两个数据集 [创建的] dat共有的 563 个变量是“主题”(数字)、“标签”(数字)、3:563(来自文本文件)。观察 1:2947 来自“测试”数据集,而观察 2948:10299 来自“训练”数据集。

我想在 dat 中插入一列(header = 'type'),它基本上是由字符串测试组成的第 1:2947 行和字符串训练的第 2948:10299 行,这样我以后可以在数据集或其他类似的聚合函数中分组dplyr/tidyr。

我创建了一个测试 df (testdf = 1:10299: dim(testdf) = 102499 x 1) 然后:

testdat[1:2947 , "type"] <- c("test")
testdat[2948:10299, "type"] <- c("train")
> head(ds, 2);tail(ds, 2)
  X1.10299 type
1        1 test
2        2 test
      X1.10299  type
10298    10298 train
10299    10299 train

所以我真的不喜欢现在有一列X1.10299。

问题:

  • 根据上面的用例,是否有更好、更方便的方法来创建包含我正在寻找的内容的列?
  • 什么是将该列实际插入“dat”以便我以后可以使用它与 dplyr 进行分组的好方法?

问题 #2

我从上面到达我的 [几乎] 整洁的 df (dat) 的方式是分别取两个形式为 dim(2947 x 563 和 7352 x 563) 的 dfs (test 和 train),然后将它们绑定在一起。

我通过以下方式确认我的所有变量名在绑定工作之后都存在:

test.names <- names(test)
train.names <- names(train)
identical(test.names, train.names)
> TRUE

有趣且主要关注的是,如果我尝试使用“dplyr”中的bind_rows函数来执行相同的绑定练习:

dat <- bind_rows(test, train)

它返回一个数据框,显然保留了我所有的观察结果(x:10299),但现在我的变量计数从 563 减少到 470!

问题:

  • 有谁知道为什么我的变量被砍掉了?
  • 这是将两个相同结构的 dfs 结合起来以便以后使用 dplyr/ 进行切片/切块的最佳方法吗?

整理?

感谢您抽出时间考虑这些问题。

样本测试/训练 dfs 以供审查(最左边的数字是 df 指数):

测试 df 测试[1:10, 1:5]

   subject labels tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z
1        2      5         0.2571778       -0.02328523       -0.01465376
2        2      5         0.2860267       -0.01316336       -0.11908252
3        2      5         0.2754848       -0.02605042       -0.11815167
4        2      5         0.2702982       -0.03261387       -0.11752018
5        2      5         0.2748330       -0.02784779       -0.12952716
6        2      5         0.2792199       -0.01862040       -0.11390197
7        2      5         0.2797459       -0.01827103       -0.10399988
8        2      5         0.2746005       -0.02503513       -0.11683085
9        2      5         0.2725287       -0.02095401       -0.11447249
10       2      5         0.2757457       -0.01037199       -0.09977589

train df train[1:10, 1:5]

   subject label tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z
1        1     5         0.2885845      -0.020294171        -0.1329051
2        1     5         0.2784188      -0.016410568        -0.1235202
3        1     5         0.2796531      -0.019467156        -0.1134617
4        1     5         0.2791739      -0.026200646        -0.1232826
5        1     5         0.2766288      -0.016569655        -0.1153619
6        1     5         0.2771988      -0.010097850        -0.1051373
7        1     5         0.2794539      -0.019640776        -0.1100221
8        1     5         0.2774325      -0.030488303        -0.1253604
9        1     5         0.2772934      -0.021750698        -0.1207508
10       1     5         0.2805857      -0.009960298        -0.1060652

实际代码(忽略函数调用/我正在通过控制台进行大部分测试)。

[ http://archive.ics.uci.edu/ml/machine-learning-databases/00240/ ]我在这段代码中使用的数据集。1

run_analysis <- function () {
    #Vars available for use throughout the function that should be preserved
    vars <- read.table("features.txt", header = FALSE, sep = "")
    lookup_table <- data.frame(activitynum = c(1,2,3,4,5,6), 
                               activity_label = c("walking", "walking_up", 
                                                  "walking_down", "sitting", 
                                                  "standing", "laying"))
    test <- test_read_process(vars, lookup_table)
    train <- train_read_process(vars, lookup_table)
}

test_read_process <- function(vars, lookup_table) {
    #read in the three documents for cbinding later
    test.sub <- read.table("test/subject_test.txt", header = FALSE)
    test.labels <- read.table("test/y_test.txt", header = FALSE)
    test.obs <- read.table("test/X_test.txt", header = FALSE, sep = "")

    #cbind the cols together and set remaining colNames to var names in vars
    test.dat <- cbind(test.sub, test.labels, test.obs)  
    colnames(test.dat) <- c("subject", "labels", as.character(vars[,2]))

    #Use lookup_table to set the "test_labels" string values that correspond
    #to their integer IDs
    #test.lookup <- merge(test, lookup_table, by.x = "labels", 
    #               by.y ="activitynum", all.x = T)

    #Remove temporary symbols from globalEnv/memory
    rm(test.sub, test.labels, test.obs)

    #return
    return(test.dat)
}

train_read_process <- function(vars, lookup_table) {
    #read in the three documents for cbinding
    train.sub <- read.table("train/subject_train.txt", header = FALSE)
    train.labels <- read.table("train/y_train.txt", header = FALSE)
    train.obs <- read.table("train/X_train.txt", header = FALSE, sep = "")

    #cbind the cols together and set remaining colNames to var names in vars
    train.dat <- cbind(train.sub, train.labels, train.obs)    
    colnames(train.dat) <- c("subject", "label", as.character(vars[,2]))

    #Clean up temporary symbols from globalEnv/memory
    rm(train.sub, train.labels, train.obs, vars)

    return(train.dat)
}
4

1 回答 1

1

您面临的问题源于您用于创建数据框对象的变量列表中有重复名称。如果您确保列名是唯一的并且在对象之间共享,则代码将运行。我已经包含了一个基于您上面使用的代码的完整工作示例(在评论中注明了修复和各种编辑):

vars <- read.table(file="features.txt", header=F, stringsAsFactors=F)

##  FRS: This is the source of original problem:
duplicated(vars[,2])
vars[317:340,2]
duplicated(vars[317:340,2])
vars[396:419,2]

##  FRS: I edited the following to both account for your data and variable
##    issues:
test_read_process <- function() {
  #read in the three documents for cbinding later
  test.sub <- read.table("test/subject_test.txt", header = FALSE)
  test.labels <- read.table("test/y_test.txt", header = FALSE)
  test.obs <- read.table("test/X_test.txt", header = FALSE, sep = "")

  #cbind the cols together and set remaining colNames to var names in vars
  test.dat <- cbind(test.sub, test.labels, test.obs)  
  #colnames(test.dat) <- c("subject", "labels", as.character(vars[,2]))
  colnames(test.dat) <- c("subject", "labels", paste0("V", 1:nrow(vars)))

  return(test.dat)
}

train_read_process <- function() {
  #read in the three documents for cbinding
  train.sub <- read.table("train/subject_train.txt", header = FALSE)
  train.labels <- read.table("train/y_train.txt", header = FALSE)
  train.obs <- read.table("train/X_train.txt", header = FALSE, sep = "")

  #cbind the cols together and set remaining colNames to var names in vars
  train.dat <- cbind(train.sub, train.labels, train.obs)    
  #colnames(train.dat) <- c("subject", "labels", as.character(vars[,2]))
  colnames(train.dat) <- c("subject", "labels", paste0("V", 1:nrow(vars)))

  return(train.dat)
}


test_df <- test_read_process()
train_df <- train_read_process()

identical(names(test_df), names(train_df))


library("dplyr")

## FRS: These could be piped together but I've kept them separate for clarity:
train_df %>%
  mutate(test="train") -> 
  train_df

test_df %>%
  mutate(test="test") -> 
  test_df

test_df %>% 
  bind_rows(train_df) -> 
  out_df

head(out_df)
out_df

##  FRS: You can set your column names to those of the original 
##    variable list but you still have duplicates to deal with:
names(out_df) <- c("subject", "labels", as.character(vars[,2]), "test")

duplicated(names(out_df))
于 2015-05-11T18:02:59.410 回答