我有一个包含数百万个网址的文件,例如:数据文件如下:
http://wonderland.cjfallon.ie/
http://www.youtube.com/
http://www.starfall.com/
http://education.scholastic.co.uk/
http://www.scoilnet.ie/
http://www.nessy.com/
http://www.senteacher.org/
http://scoop.it/
http://www.moviemaker.com/
http://learni.st/
http://www.twitter.com/
http://www.facebook.com/
http://www.gutenberg.org/
http://www.gutenberg.org/cache/epub/42361/pg42361.txt
我想爬取它们,所以绑定的是网络IO,所以我想使用多个线程或gevent来解决它。
我的多线程代码适用于: https ://gist.github.com/young001/5449751
但是在使用 gevent 时,代码是:https ://gist.github.com/young001/baa3eebbf7342c5ac077 它总是出错:
status is 200
status is 200
Internal error in evhttp
the url is down http://web2.socialcomputingmagazine.com/the_social_graph_issues_and_strategies_in_2008.htm
the reason
status is 200
status is 200
status is 200
status is 200
status is 200
status is 200
status is 301
status is 200
status is 301
status is 200
status is 200
Internal error in evhttp
然后它停止了。不知道为什么会这样?
有什么帮助吗?
似乎一切都应该顺利,但事实并非如此,这让我发疯。