r - 通过从变量 url 读取数据将列添加到 data.table 的最佳方法

Question

我有一个 .csv 文件，其中包含与比特币钱包（发送和接收的交易）相关的近 100 万笔交易的交易 ID，我将其作为数据表读入 R。现在我正在尝试在表格中添加另一列，其中列出了每笔交易的费用。这可以使用 API 调用来完成。

For example, to get the fee for the txid 73336c8b2f8bbf9c4165de515765463d6e835a9f3f87bf822d8bcb23c074ae7f, I have to open: https://blockchain.info/q/txfee/73336c8b2f8bbf9c4165de515765463d6e835a9f3f87bf822d8bcb23c074ae7f and read the data there directly.

我所做的：首先，我使用 Excel 编辑 .csv 文件，为每一行的 url 添加一个新列。然后在R中编写了以下代码：

for(i in 1:nrow(transactions))
transactions$fee[i] <- scan(transactions$url[i])

但这样它在 1 秒内只更新 2-3 行。由于我是新手，因此必须有更有效的方法来做同样的事情。

score 1 · Accepted Answer

scan()与使用相比，我们可以做得更好（~15x）curl::curl_fetch_memory，例如使用您的 URL：

URL <- "https://blockchain.info/q/txfee/73336c8b2f8bbf9c4165de515765463d6e835a9f3f87bf822d8bcb23c074ae7f"

microbenchmark::microbenchmark(
  times = 50L,
  scan = scan(URL, what = integer(), quiet = TRUE),
  GET = as.integer(httr::content(httr::GET(URL))),
  curl = as.integer(rawToChar(curl::curl_fetch_memory(URL)$content))
)
# Unit: microseconds
#  expr      min       lq       mean    median        uq       max neval
#  scan 9388.292 9885.680 10216.9262 10164.120 10502.839 11016.553    50
#   GET 7195.900 7611.485  8342.2855  7832.446  7948.521 22781.104    50
#  curl  511.834  565.067   611.4956   610.391   642.799   790.482    50

identical(
  scan(URL, what = integer(), quiet = TRUE),
  as.integer(rawToChar(curl::curl_fetch_memory(URL)$content))
)
# [1] TRUE

注意：我使用integer了因为您的特定 URL 适合，但as.numeric可能更合适。

也就是说，我仍然认为访问网络是最大的瓶颈，你可能会发现尝试一次获得大于 1 个事务的有效负载会有所回报。如果没有，您最大的性能改进将来自并行化。

r - 通过从变量 url 读取数据将列添加到 data.table 的最佳方法

1 回答 1

Related

Reference