wget - wget 返回一个 3 字节的主页

Question

我正在尝试在 www.oabt.org 下载网页。使用浏览器，可以正常获取所有 html 代码，但使用 wget 我只能得到一个 3 字节的页面。

➜  spider git:(master) wget http://www.oabt.org/
--2013-02-06 01:45:11--  http://www.oabt.org/
Resolving www.oabt.org... 125.64.93.243
Connecting to www.oabt.org|125.64.93.243|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3 [text/html]
Saving to: ‘index.html’


100%[===============================================================================>] 3           --.-K/s   in 0s      

2013-02-06 01:45:12 (117 KB/s) - ‘index.html’ saved [3/3]

➜  spider git:(master) ✗ xxd -l 100 ./index.html 
0000000: efbb bf

如何正确获取该站点的主页？

score 1 · Accepted Answer

我转储了 http 连接，wireshark并在发送的标头wget和browser. 我尝试使用 wget 的--header参数复制相同的 http 请求，直到我发现网站需要Accept-Encoding: gzip标头才能正确回复。

简而言之，工作命令变为：

 wget --header='Accept-Encoding: gzip' http://www.oabt.org/index.php

但这将保存gzipped内容...

如果要动态解压缩页面，请使用以下命令：

wget -O- --header='Accept-Encoding: gzip' \
http://www.oabt.org/index.php | gunzip - > index.html

...并且gzip压缩的内容将被解压缩并重定向到index.html文件

wget - wget 返回一个 3 字节的主页

1 回答 1

Related

Reference