5

I use fread to import very big .CSV-files. Some columns have whitespace after the text that I need to remove. This takes too much time (hours).

The following code works but the command at system.time is very slow (about 12 seconds on my computer, and the real files are much bigger).

library(data.table)
library(stringr)

# Create example-data
df.1 <- rbind(c("Text1        ", 1, 2), c("Text2        ", 3, 4), c("Text99       ", 5, 6))

colnames(df.1) <- c("Tx", "Nr1", "Nr2")
dt.1 <- data.table(df.1)
for (i in 1:15) {
  dt.1 <- rbind(dt.1, dt.1)
}

# Trim the "Tx"-column
dt.1[, rowid := 1:nrow(dt.1)]
setkey(dt.1, rowid)
system.time( dt.1[, Tx2 :={ str_trim(Tx) }, by=rowid] )
dt.1[, rowid:=NULL]
dt.1[, Tx:=NULL]
setnames(dt.1, "Tx2", "Tx")

Is there a faster way to trim whitespace in data.tables?

4

3 回答 3

3

您只能对“Tx”的唯一值进行操作(假设您实际上有一些重复,如您的示例所示):

dt.1[, Tx2:=str_trim(Tx),     by=1:nrow(dt.1)]
dt.1[, Tx3:=str_trim(Tx),     by=Tx]

dt.1[, all.equal(Tx2,Tx3)]    # TRUE

使用gsub而不是str_trim@DWin 的答案也可以加快速度,无论您是否重复了“Tx”值。

编辑:正如@DWin 指出的那样,没有理由一开始就按行进行,因此str_trim不需要矢量化。所以,我改变了我的答案。

于 2013-10-08T19:32:44.993 回答
3
system.time( dt.1[, Tx2 :={ str_trim(Tx) }, by=rowid] )
   user  system elapsed 
 19.026   0.105  19.021 

system.time(  dt.1[,  Tx2 := gsub("\\s+$", "", as.character(Tx)), by=rowid]) 
   user  system elapsed 
  4.789   0.053   4.773 
于 2013-10-08T19:33:58.370 回答
1

您可以使用str_trimstringr 包和mutatedplyr

df %>%
mutate(column1 = str_trim(column1))
于 2018-03-31T14:29:28.470 回答