0

我已阅读常见问题解答,但现在仍然清楚在相当大的 data.table 的串联列表中使用键与使用该键的含义是什么?

从我的实验中,我只看到性能,但不确定是否还有其他东西。

# install.packages(c("data.table", "stringi"), dependencies = TRUE)
library(data.table)
library(stringi)
download.file("https://www.ssa.gov/oact/babynames/state/namesbystate.zip", dest="namesbystate.zip", mode="wb")
unzip("namesbystate.zip", exdir=".")
# Read the list of all text files in variable "filelist"
filelist = list.files(path=".",pattern = ".*.TXT")
colnamelist=c("State","gender","year","name","frequency")
#Read the CSV from all the text files into a data.frame
babynames =lapply(filelist, FUN=read.csv, header=FALSE,col.name=colnamelist);
nametable = rbindlist(babynames,use.names = FALSE,fill = FALSE)
DT = data.table(nametable)
dim(DT) #[1] 5647426       5
setkey(DT,NULL)
system.time(head(DT[,( stri_length(name)),by=c("name", "year")]))
#    user  system elapsed 
#  156.47    0.03  157.64 

setkey(DT,year)
system.time(head(DT[,( stri_length(name)),by=name]))
#    user  system elapsed 
#    8.90    0.00    8.99 

两种情况下的输出相同

      name year V1
1:    Anna 1910  4
2:   Annie 1910  5
3: Dorothy 1910  7
4:   Elsie 1910  5
5:   Helen 1910  5
6:    Lucy 1910  4
4

0 回答 0