0

I've to crawl website http://docbao.com.vn/ using wget, but wget always message

HTTP request sent, awaiting response... No data received.
Retrying.

For example, I crawled all webpages in a category http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec , then the result was

congnh@congnh-pc:~/Source/datasection/congnh-crawler/sh$ wget "http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec" -O -
--2013-02-20 23:53:16--  http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Resolving docbao.com.vn (docbao.com.vn)... 123.30.51.174
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2013-02-20 23:53:17--  (try: 2)  http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2013-02-20 23:53:19--  (try: 3)  http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2013-02-20 23:53:22--  (try: 4)  http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2013-02-20 23:53:27--  (try: 5)  http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2013-02-20 23:53:32--  (try: 6)  http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2013-02-20 23:53:38--  (try: 7)  http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2013-02-20 23:53:45--  (try: 8)  http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

--2013-02-20 23:53:53--  (try: 9)  http://docbao.com.vn/chuyenmuc/muc-1/Quoc_te.dec
Connecting to docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected.
HTTP request sent, awaiting response... No data received.
Retrying.
...

Why wget retry "unlimitedly"? or what's the problem?
Thanks
Cong

4

1 回答 1

0

很抱歉说的很明显,但是:wget重试,因为它没有收到任何数据。它发送 HTTP 标头,然后远程主机立即关闭连接。我只能猜测这种非标准行为是由于服务器端的错误配置,可能是故意的。

经过一番探索,我发现一旦您发出信号,您可以处理 gzip 编码的响应,内容就会被提供。您可以通过添加--header="accept-encoding: gzip"到您的wget命令来做到这一点。这对于使用 进行爬网也是有问题的wget,因为它不能递归到 gzip 压缩的内容中。您将需要编写一个脚本来处理这种情况,或者使用其他可以处理此类内容的工具。

附注:请注意,并非所有网站都允许抓取其内容。请在这样做之前检查他们的服务条款。

于 2013-03-01T14:13:54.170 回答