0

使用时出现BadStatusLine: ''错误tldextract.extract(url)

subdomain, domain, tld = tldextract.extract(url)
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 194, in extract
    return TLD_EXTRACTOR(url)
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 128, in __call__
    return self._extract(netloc)
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 132, in _extract
    registered_domain, tld = self._get_tld_extractor().extract(netloc)
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 165, in _get_tld_extractor
    tlds = frozenset(tld for tld_source in tld_sources for tld in tld_source())
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 165, in <genexpr>
    tlds = frozenset(tld for tld_source in tld_sources for tld in tld_source())
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 204, in _PublicSuffixListSource
    page = _fetch_page('http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1')
  File "/usr/local/venv/local/lib/python2.7/site-packages/tldextract/tldextract.py", line 198, in _fetch_page
    return unicode(urllib2.urlopen(url).read(), 'utf-8')
  File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 400, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 418, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1180, in do_open
    r = h.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1030, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 407, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
    raise BadStatusLine(line)
BadStatusLine: ''
4

3 回答 3

4

这是由于堆栈跟踪 ( http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1 )中的 mozilla.org URL不可用,并tldextract尝试从中更新首次安装时的 URL。可以禁用此实时更新(见下文),但未捕获的异常是一个tldextract错误。它应该只记录异常,并无缝回退到包的捆绑 PSL。

这已在 tldextract 1.2.1 中修复,刚刚发布到PyPI。它切换到PSL 的 GitHub 镜像。因此升级应该解决未捕获的异常。

当例如 GitHub PSL 镜像不可用时,即将发布的另一个版本将避免未来未捕获的异常。

关闭默认提取

在以前的版本中,您可以通过关闭默认的首次安装获取来避免此问题。用构造你自己的TLDExtract可调用对象fetch=False。从文档

import tldextract
no_fetch_extract = tldextract.TLDExtract(fetch=False)
no_fetch_extract('http://www.google.com')
于 2013-10-09T20:23:52.540 回答
2

该软件包正在尝试从当前不起作用的 URL 下载公共后缀列表:

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

这是由于对该 URL 的 DDOS 攻击,Mozilla 目前已阻止该 URL。

已经报告给项目,并且已经提出了一个修复方案,尽管后者仅在您已经拥有公共后缀列表的缓存副本时才有效。

同时,请改用该publicsuffix软件包;它将数据捆绑在包本身中,不需要 URL 请求。

更新:Mozilla 现在将文件托管在https://publicsuffix.org/list/effective_tld_names.dat并且任何对 MXR 源存储库的访问都没有 mxr.mozilla.org 引用标头会将您重定向到该新位置。

于 2013-10-09T10:14:15.647 回答
0

这是由于http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1未提供服务。

如果你想继续使用 tldextract 来获取子域、域、tld,一个临时的解决方案是使用缓存,例如在project/tldextractor/__init__.py

import os 
import tldextract
TLD_CACHE_PATH = os.path.join(
    os.path.abspath(os.path.dirname(__file__)), 'tldextract_cache')
tldextractor = tldextract.TLDExtract(cache_file=TLD_CACHE_PATH, fetch=False)

project/tldextractor/tldextract_cachehttps ://gist.github.com/AJamesPhillips/6899560

然后:

from .tldextractor import tldextractor
tldextractor('http://subdomain.domain.tld')
于 2013-10-09T10:13:40.363 回答