r - 如何动态地将数据添加到数据框中？

Question

我有需要为文件中的每一行清理的数据，并且我想将清理后的数据插入 SQLite3 数据库中。我正在使用需要数据框的 RSQLite库。这是我试图开始工作的代码：

# Select feature names for use as column names in X train/test loading
feature_names <- unlist(dbGetQuery(con, "select feature_name from features order by feature_id"), use.names = FALSE);

# Load X training data
X_train_lines <- readLines("data/train/X_train.txt"); # Space delimited with leading and trailing spaces
X_train_values <- vector("list", length(X_train_lines));
names(X_train_values) <- feature_names; # colnames or names?
for (index in 1:length(X_train_lines)) {
  cleaned_line <- gsub("^ *|(?<= ) | *$", "", X_train_lines[index], perl=TRUE); # remove extraneous whitespaces
  X_train_values[index] <- strsplit(cleaned_line, " "); # Wondering if X_train_values[index] is correct? 
}
# Write features data to features table
dbWriteTable(con, "X_train", as.data.frame(X_train_values), row.names = FALSE);

虽然代码执行没有意外，但当我尝试使用 DbVisualizer 查看数据库时出现错误：

执行操作时发生错误：
格式错误的数据库架构 (X_train) - X_train 上的列太多

我唯一的猜测是行和列以某种方式转置。我的列名应该是feature_names向量的值。

另外，如果有人对更好的方法有任何建议......

更新

我试着做一个 dput，虽然我不知道我在看什么。这是摘要的顶部：

head(summary(X_train_values))

                   Length Class    Mode       
tBodyAcc-mean()-X "561"  "-none-" "character"
tBodyAcc-mean()-Y "561"  "-none-" "character"
tBodyAcc-mean()-Z "561"  "-none-" "character"
tBodyAcc-std()-X  "561"  "-none-" "character"
tBodyAcc-std()-Y  "561"  "-none-" "character"
tBodyAcc-std()-Z  "561"  "-none-" "character"

再一次，这让我相信数据都是混在一起的。它应该有 561 列，其中一些在上面表示为 tBodyAcc-mean()-X 等。列值应该是我在上面看不到的浮点数。

表命令不起作用：

table(X_train_values)
Error in table(X_train_values) : 
  attempt to make a table with >= 2^31 elements

我应该有 7,352 行和 561 列。

更新 2

我相信我的问题是我正在尝试使用一个或多个数组之类的列表。例如，在 Ruby 中，我可以这样做：

x_train_values = []
x_train_lines.each { |line| x_train_values << line.split(' ') }

score 0 · Accepted Answer

在以下几行中

for (index in 1:length(X_train_lines)) {
    cleaned_line <- gsub("^ *|(?<= ) | *$", "", X_train_lines[index], perl=TRUE);
    X_train_values[index] <- strsplit(cleaned_line, " ");
}

[当您应该使用双方括号 ( ) 时，您正在使用单方括号 ( ) 来访问数据框的列[[。使用时X_train_lines[index]，会返回一个数据框，其中有一列，等于X_train_lines[index]。但是，当您使用X_train_lines[[index]]时，将返回该列的实际内容（有关详细信息，请参阅http://adv-r.had.co.nz/Subsetting.html ）。

现在，可行的方法gsub是，它首先将其参数转换为使用字符as.character，然后对其进行处理。在您的情况下，X_train_lines[index]返回一个 data.frame，其单列是一个因子（我猜），因此当强制转换为一个字符时，您将获得因子级别，而不是实际内容！所以你实际上是在调用gsub一个看起来像“1:2:3:...”的字符串。如果您改用双括号，则会gsub将一个因子（而不是数据框）强制转换为一个字符，这将按需要工作。

顺便说一句，在 R 中，您不需要以;. 这只需要在同一行上分隔多个语句。

最后，最好尽量避免for循环，因为它们可能很慢，并且因为有更有效的函数和更简单的语法，通常可以完成你需要的东西（如lapply, apply,sweep等）。对于数据框/矩阵/等上的列/行/元素操作，您可以使用apply，在这种情况下，您的代码将如下所示

apply(X_train_values, 2, gsub, pattern = "^ *|(?<= ) | *$",
    replacement = "", perl = T)

r - 如何动态地将数据添加到数据框中？

1 回答 1

Related

Reference