r - R split() 函数大小增加问题

Question

我有以下数据集

> head(data)
  X    UserID NPS V3 V4 V5                                   Event              V7          Element                            ElementValue 
1 1 254727216  10  0 19 10 nps.agent.14b.no other attempt was made 10/4/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
2 2 298379949   0  0 28 11 nps.agent.14b.no other attempt was made 9/30/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
3 3 254710917   0  0 20 12 nps.agent.14b.no other attempt was made 9/15/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
4 4 238919392   7  0 17  9 nps.agent.14b.no other attempt was made 9/17/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
5 5 144693025  10  0 18 10 nps.agent.14b.no other attempt was made 9/17/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
6 6 249978568   5  0 21 12 nps.agent.14b.no other attempt was made 9/18/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made

当我将数据集拆分为：

data_splitted <- split(data,data$UserID)

这里的问题是当我用整个数据集而不是这个样本尝试这个时，大小的巨大增加超过了我的内存

> format(object.size(data),units="Mb")
[1] "0.2 Mb"
> format(object.size(data_splitted),units="Mb")
[1] "45.7 Mb"

任何有关为什么会发生这种情况以及是否有任何解决方法的见解将不胜感激。

score 3 · Accepted Answer

尝试这个：

data$UserID <- as.character(data$UserID)
data_splitted <- split(data,data$UserID)

在您的情况下发生的情况是，由于 ID 是数字，因此该数字被用作创建列表中的索引（位置），这显然是不正确的。由于 id 的数量很大，R 用尽可能多的空列表填补了空白（因此对象大小很大）。通过使 id 成为字符变量，我们避免了这种情况。

将 id 变量完整保留在 1 行数据帧中的另一种方法是：

data_splitted <- list()
for(i in 1:nrow(data))
  data_splitted[[as.character(data$UserID[i])]] <- data[i,]

要访问新创建的列表中的元素，如果使用$运算符，则需要引用数字：

data_splitted$"144693025"
data_splitter[["144693025"]]

另一种选择是在数字 id 前面添加字符。例如：

data$UserID <- paste0("id",data$UserID)
data_splitted <- split(data,data$UserID)

这使得访问列表项更加方便：

data_splitted$id144693025
data_splitted$id238919392

score 1 · Accepted Answer

如果您有很多类似的字符串，请使用因子而不是字符串。（如果您不需要处理它们的内容，则根本不存储它们，或者仅存储例如主机名，再次作为因素。您可以使用grep正则表达式并且仅捕获字段，例如主机名和错误代码，并扔掉其他所有东西）。

接下来，通过更改或后处理您的日志文件，让您的生活变得轻松，从：

nps.agent.14b.no other attempt was made

至：

nps.agent.14b:no other attempt was made

现在你只需拆分':'（或'|'）看看日志文件的一些最佳实践，上面写了很多好东西。如果保证每一行都有一个且只有一个主机名和一个错误代码，则可以将它们存储为单独的主机名和错误代码字段。

所以，你的代码应该很简单：

> as.factor(strsplit(s, ':')
[1] 'nps.agent.14b'             'no other attempt was made'

同样，如果您不需要处理“没有进行其他尝试”，请不要存储它。或者您的日志文件消息可以将其压缩为“NEA”。或者，如果它没有传达任何额外信息，就将其丢弃。

我建议你重新审视你的日志文件格式，并尽可能地使其简洁和信息丰富。

r - R split() 函数大小增加问题

2 回答 2

Related

Reference