5

问题

我目前正在尝试编写一个disk.frame使用正则表达式过滤对象的某些行的函数。不幸的是,我在过滤器函数中评估我的搜索字符串时遇到了一些问题。我的想法是将正则表达式作为字符串传递给函数参数(例如storm_name),然后将该参数传递给我的过滤调用。我使用了%like%包含在{data.table}过滤行中的函数。

我的问题是storm_name对象在disk.frame 内被评估。但是,由于storm_name仅包含在函数环境中,而不包含在 disk.frame 对象中,因此出现以下错误:

Error in .checkTypos(e, names_x) : 
  Object 'storm_name' not found amongst name, year, month, day, hour and 8 more

我已经尝试使用 评估storm_name父框架中的对象eval(sotm_name, env = parent.env()),但这也没有帮助。有趣的是,这个问题只发生在{disk.frame}对象上,而不是{data.table}对象上。

现在我找到了一个解决方案{dplyr}。但是,我将不胜感激有关如何解决此问题的任何想法{data.table}

可重现的例子

# Load packages
library(data.table)
library(disk.frame)

# Create data table and diskframe object of storm data
storms_df <- as.disk.frame(storms)
storms_dt <- as.data.table(storms)

# Create search function
grep_storm_name <- function(dfr, storm_name){
  
  dfr[name %like% storm_name]
  
}

# Check function with data.table object
grep_storm_name(storms_dt, "^A")

# Check function with diskframe object
grep_storm_name(storms_df, "^A")

会话信息

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_Sweden.1252  LC_CTYPE=English_Sweden.1252    LC_MONETARY=English_Sweden.1252
[4] LC_NUMERIC=C                    LC_TIME=English_Sweden.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] disk.frame_0.5.0  purrr_0.3.4       dplyr_1.0.7       data.table_1.14.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7            benchmarkmeData_1.0.4 pryr_0.1.4            pillar_1.6.4         
 [5] compiler_4.1.0        iterators_1.0.13      tools_4.1.0           digest_0.6.27        
 [9] bit_4.0.4             jsonlite_1.7.2        lifecycle_1.0.1       tibble_3.1.6         
[13] lattice_0.20-44       pkgconfig_2.0.3       rlang_0.4.12          Matrix_1.3-3         
[17] foreach_1.5.1         rstudioapi_0.13       DBI_1.1.1             parallel_4.1.0       
[21] bigassertr_0.1.4      bigreadr_0.2.4        httr_1.4.2            stringr_1.4.0        
[25] globals_0.14.0        generics_0.1.1        fs_1.5.0              vctrs_0.3.8          
[29] bit64_4.0.5           grid_4.1.0            tidyselect_1.1.1      glue_1.6.0           
[33] listenv_0.8.0         R6_2.5.1              future.apply_1.7.0    parallelly_1.25.0    
[37] fansi_1.0.0           magrittr_2.0.1        codetools_0.2-18      ellipsis_0.3.2       
[41] fst_0.9.4             assertthat_0.2.1      future_1.21.0         benchmarkme_1.0.7    
[45] utf8_1.2.2            stringi_1.7.6         doParallel_1.0.16     crayon_1.4.2 
4

2 回答 2

4

虽然我不知道其确切原因,但它与环境、搜索路径等有关。例如,这些工作:

storms_df[name %like% "^A"]

nm <- "^A"
storms_df[name %like% nm]

grep1 <- function(dfr, storm_name) { dfr[name %like% "^A"]; }
grep1(storms_df)

但这不会:

grep2 <- function(dfr, storm_name) { dfr[name %like% storm_name]; }
grep2(storms_df, "^A")
# Error in .checkTypos(e, names_x) : 
#   Object 'storm_name' not found amongst name, year, month, day, hour and 8 more

我们可以使用eval(substitute(..)).

grep3 <- function(dfr, storm_name) { 
  eval(substitute(dfr[name %like% storm_name], list(storm_name = storm_name)))
}
grep3(storms_df, "^A")
#        name  year month   day  hour   lat  long              status category  wind pressure ts_diameter hu_diameter
#      <char> <num> <num> <int> <num> <num> <num>              <char>    <ord> <int>    <int>       <num>       <num>
#   1:    Amy  1975     6    27     0  27.5 -79.0 tropical depression       -1    25     1013          NA          NA
#   2:    Amy  1975     6    27     6  28.5 -79.0 tropical depression       -1    25     1013          NA          NA
#   3:    Amy  1975     6    27    12  29.5 -79.0 tropical depression       -1    25     1013          NA          NA
# ...

(也grep3(storms_dt, "^A")可以)

这是通过将-expression内部的符号从更改为文字字符串来实现的。由于这是在未评估的表达式上完成的,因此还没有查找,也没有通过 this 和继承的环境进行搜索来 find 。storm_name[storm_namestorm_name

如果您手动检查:

debug(grep3)
grep3(storms_df, "^A")
# debugging in: grep3(storms_df, "^A")
# debug at #1: {
#     eval(substitute(dfr[name %like% storm_name], list(storm_name = storm_name)))
# }
# Browse[2]> 
substitute(dfr[name %like% storm_name], list(storm_name = storm_name))
# dfr[name %like% "^A"]

我认为这与如何disk.frame影响内部环境[和调用/父环境有关。有趣的是(对我来说),你可以看到变量的搜索路径不是空的,这不是我们所期望的:

grep2 <- function(dfr, storm_name) { dfr[name %like% storm_name]; }
grep2(storms_df, "^A")
# Error in .checkTypos(e, names_x) : 
#   Object 'storm_name' not found amongst name, year, month, day, hour and 8 more

### but let's pre-define `storm_name` outside of the function,
### then re-define the function (no change)
storm_name <- "^A"
grep2 <- function(dfr, storm_name) { dfr[name %like% storm_name]; }
head(grep2(storms_df, "^A"), 2)
#      name  year month   day  hour   lat  long              status category  wind pressure ts_diameter hu_diameter
#    <char> <num> <num> <int> <num> <num> <num>              <char>    <ord> <int>    <int>       <num>       <num>
# 1:    Amy  1975     6    27     0  27.5   -79 tropical depression       -1    25     1013          NA          NA
# 2:    Amy  1975     6    27     6  28.5   -79 tropical depression       -1    25     1013          NA          NA

这似乎可行,但我们可以看到它使用的是storm_name副参数版本的外部版本nameA尽管更改为"^B".

head(grep2(storms_df, "^B"), 2)
#      name  year month   day  hour   lat  long              status category  wind pressure ts_diameter hu_diameter
#    <char> <num> <num> <int> <num> <num> <num>              <char>    <ord> <int>    <int>       <num>       <num>
# 1:    Amy  1975     6    27     0  27.5   -79 tropical depression       -1    25     1013          NA          NA
# 2:    Amy  1975     6    27     6  28.5   -79 tropical depression       -1    25     1013          NA          NA

坦率地说,我对 's 的内部结构了解得不够多,disk.frame无法知道这是错误还是必需品,因为它必须为非完全内存数据集的非标准data.table评估做些什么。


如果您关心性能(公平问题),该eval(substitute(..))方法似乎不会受到太大影响:

bench::mark(
  raw = dfr[name %like% "^A"],
  subst = eval(substitute(dfr[name %like% storm_name], list(storm_name = storm_name))),
  iterations = 1000
)
# # A tibble: 2 x 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                  memory               time               gc                  
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                  <list>               <list>             <list>              
# 1 raw          12.9ms   16.8ms      55.2    1.69MB     3.97   933    67      16.9s <data.table [990 x 13]> <Rprofmem [669 x 3]> <bench_tm [1,000]> <tibble [1,000 x 3]>
# 2 subst        12.8ms   15.8ms      60.5    1.69MB     3.25   949    51      15.7s <data.table [990 x 13]> <Rprofmem [669 x 3]> <bench_tm [1,000]> <tibble [1,000 x 3]>

在重复的基准测试中,我实际上看到subst 稍微快一点,这表明部分性能差异与添加eval(substitute(..)). 这种差异(从 55.2 到 60.5 `itr/sec`)是我见过的最差的……刚才重复了 57.1 和 57.5,所以我认为性能下降不是问题。

于 2022-01-20T16:23:19.103 回答
2

它现在从 disk.frame v0.6 开始工作

于 2022-01-30T11:13:49.433 回答