ruby - 大括号的 URL 编码问题

Question

我在从GitHub Archive获取数据时遇到问题。

主要问题是我的编码问题{}和..我的 URL。也许我误读了 Github API 或者没有正确理解编码。

require 'open-uri'
require 'faraday'

conn = Faraday.new(:url => 'http://data.githubarchive.org/') do |faraday|
  faraday.request  :url_encoded             # form-encode POST params
  faraday.response :logger                  # log requests to STDOUT
  faraday.adapter  Faraday.default_adapter  # make requests with Net::HTTP
end

#query = '2015-01-01-15.json.gz' #this one works!!
query = '2015-01-01-{0..23}.json.gz' #this one doesn't work
encoded_query = URI.encode(query)

response = conn.get(encoded_query)
p response.body

score 1 · Accepted Answer

用于检索一系列文件的 GitHub 存档示例是：

wget http://data.githubarchive.org/2015-01-01-{0..23}.json.gz

该{0..23}部分被 wget 本身解释为 0 .. 23 的范围。您可以通过使用-v返回的标志执行该命令来测试这一点：

wget -v http://data.githubarchive.org/2015-01-01-{0..1}.json.gz
--2015-06-11 13:31:07--  http://data.githubarchive.org/2015-01-01-0.json.gz
Resolving data.githubarchive.org... 74.125.25.128, 2607:f8b0:400e:c03::80
Connecting to data.githubarchive.org|74.125.25.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2615399 (2.5M) [application/x-gzip]
Saving to: '2015-01-01-0.json.gz'

2015-01-01-0.json.gz                                        100%[===========================================================================================================================================>]   2.49M  3.03MB/s   in 0.8s

2015-06-11 13:31:09 (3.03 MB/s) - '2015-01-01-0.json.gz' saved [2615399/2615399]

--2015-06-11 13:31:09--  http://data.githubarchive.org/2015-01-01-1.json.gz
Reusing existing connection to data.githubarchive.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 2535599 (2.4M) [application/x-gzip]
Saving to: '2015-01-01-1.json.gz'

2015-01-01-1.json.gz                                        100%[===========================================================================================================================================>]   2.42M   867KB/s   in 2.9s

2015-06-11 13:31:11 (867 KB/s) - '2015-01-01-1.json.gz' saved [2535599/2535599]

FINISHED --2015-06-11 13:31:11--
Total wall clock time: 4.3s
Downloaded: 2 files, 4.9M in 3.7s (1.33 MB/s)

换句话说，wget 将值替换到 URL 中，然后获取该新 URL。这不是明显的行为，也没有很好的记录，但你可以在“外面”找到它的提及。例如在“你应该知道的所有 Wget 命令”中：

7. Download a list of sequentially numbered files from a server
wget http://example.com/images/{1..20}.jpg

为了做你想做的事，你需要在 Ruby 中使用类似这样的未经测试的代码来迭代范围：

0.upto(23) do |i|
  response = conn.get("/2015-01-01-#{ i }.json.gz")
  p response.body
end

score 1 · Accepted Answer

为了更好地了解问题所在，让我们从 GitHub 文档中给出的示例开始：

wget http://data.githubarchive.org/2015-01-01-{0..23}.json.gz

这里要注意的是，它{0..23}会自动被 bash 扩展。您可以通过运行以下命令来查看这一点：

echo {0..23}
> 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

这意味着wget不会只调用一次，而是总共调用 24 次。您遇到的问题是 Ruby 不会{0..23}像 bash 那样自动扩展，而是您正在对http://data.githubarchive.org/2015-01-01-{0..23}.json.gz不存在的进行字面调用。

相反，您需要遍历0..23自己并每次拨打一个电话：

(0..23).each do |n|
  query = "2015-01-01-#{n}.json.gz"
  encoded_query = URI.encode(query)
  response = conn.get(encoded_query)
  p response.body
end

ruby - 大括号的 URL 编码问题

2 回答 2

Related

Reference