python - 解码 Unicode 时出现 Python3 RecursionError（用于 BeautifulSoup/RoboBrowser）

Question

我正在使用 BeautifulSoup 和 RoboBrowser 开发一个网络抓取组件，并且在一个案例中遇到了一个奇怪的问题。有问题的页面包含与所有其他正常工作的案例相同的 chrome 和结构，但它的主要数据字段（一个整齐标记的 div）是一个巨大的行（大约 3000 个日语文本字符），没有换行符。它充满了很多 BR 标签（他们以一种相当可怕的方式使用它们来格式化表格......）和一些用于格式化的 SPAN 标签，但整个正文文本是单行的。

这似乎不应该是一个问题，但我的刮刀RecursionError: maximum recursion depth exceeded in comparison在吐出数百（可能数千）对这些相同的行之后死了：

File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1126, in decode
  indent_contents, eventual_encoding, formatter)
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1195, in decode_contents
  formatter))

我最初是在指责 BeautifulSoup，并认为 BR 标签的数量过多，但似乎问题实际上出在 Unicode 上。这是抛出它的代码：

File "/Users/myself/Projects/Scraper/scrape.py", line 207, in articles
  self._childtext = re.sub('<[^<]+?>', '', str(self._one_child).replace('<br/>', '\n'))
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1039, in __unicode__
  return self.decode()

我认为这可能是行长，因此我为什么要逐个解析 DIV 块，而不是一次完成整个事情，但这丝毫没有帮助。无论块多么小，该str(bsObject)函数似乎都会使 unicode 解析器陷入疯狂。

稍微加厚情节；我将页面源的整个文本作为一个长字符串复制到一个新的 Python 沙箱中，这样我就可以针对它测试不同的代码，而无需经常登录网站。即使在我通过 vi 运行文本并强制将其保存为 UTF8 之后，Python 也立即拒绝编译代码（抱怨它包含非 UTF8 字符）。但是，在文本中插入换行符以将其分成更小的块阻止了此错误的出现，尽管没有更改或删除文本本身的单个字符，此时脚本完美地编译并抓取了页面。

我不知道如何从这里开始。我无法控制我正在抓取的网站；我曾想过在 BeautifulSoup 接触之前将换行符强制插入 RoboBrowser 中的响应对象，这是一个可怕的 hack，但似乎可以解决问题，但我不知道该怎么做。谁能建议另一种方法？

（不幸的是，我无法链接到我正在从中抓取数据的页面，因为它是一个研究数据供应商，需要登录并且没有用于单个数据的永久 URL。）

编辑：在下面添加完整的堆栈跟踪...

Traceback (most recent call last):
  File "scrape.py", line 112, in <module>
    dataScrape()
  File "scrape.py", line 39, in dataScrape
    for article in scraper.articles():
  File "/Users/myself/Projects/Scraper/scrape.py", line 207, in articles
    self._childtext = re.sub('<[^<]+?>', '', str(self._one_child).replace('<br/>', '\n'))
  File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1039, in __unicode__
    return self.decode()
  File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1126, in decode
    indent_contents, eventual_encoding, formatter)
  File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1195, in decode_contents
    formatter))
  File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1126, in decode
    indent_contents, eventual_encoding, formatter)
  File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1195, in decode_contents
    formatter))
#
# These lines repeat identically several hundred times, then...
#
  File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1192, in decode_contents
    text = c.output_ready(formatter)
  File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 716, in output_ready
    output = self.format_string(self, formatter)
  File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 158, in format_string
    if not isinstance(formatter, collections.Callable):
  File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/abc.py", line 182, in __instancecheck__
    if subclass in cls._abc_cache:
  File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/_weakrefset.py", line 75, in __contains__
    return wr in self.data
RecursionError: maximum recursion depth exceeded in comparison

python - 解码 Unicode 时出现 Python3 RecursionError（用于 BeautifulSoup/RoboBrowser）

0 回答 0

Related

Reference