python - Python：充满 BOM 的 Youtube HTML

Question

我正在尝试使用 Python 2.7 中的 BeautifulSoup 4 解析 youtube 评论。当我尝试任何 youtube 视频时，我会收到充满 BOM 的文本，而不仅仅是在文件开头：

<p> thank you kind sirï»¿ :)</p>

几乎每条评论中都会出现一个。其他网站 (guardian.co.uk) 并非如此。我正在使用的代码：

# Source (should be taken from file to allow updating but not during wip):
source_url = 'https://www.youtube.com/watch?v=aiYzrCjS02k&feature=related'

# Get html from source:
response = urllib2.urlopen(source_url)
html = response.read()

# html comes with BOM everywhere, which is real ***, get rid of it!
html = html.decode("utf-8-sig")

soup = BeautifulSoup(html)

strings = soup.findAll("div", {"class" : "comment-body"})
print strings

正如你所看到的，我已经尝试过解码，但只要我喝汤，它就会带回 BOM 字符。有任何想法吗？

score 1 · Accepted Answer

这在 YouTube 方面似乎是无效的，但你不能只是告诉他们修复它，你需要一个解决方法。

所以，这里有一个简单的解决方法：

# html comes with BOM everywhere, which is real ***, get rid of it!
html = html.replace(b'\xEF\xBB\xBF', b'')
html = html.decode("utf-8")

（b前缀对于 Python 2.7 来说是不必要但无害的，但它们会让你的代码在 Python 3 中工作……另一方面，它们会在 Python 2.5 中破坏它，所以如果这对你更重要，请摆脱它们。 )

Alternatively, you can first decode and then replace(u'\uFEFF', u''). This should have the exact same effect (decoding extra BOMs should work harmlessly). But I think it makes more sense to fix the UTF-8 then decode it, rather than trying to decode and then fixing the result.

python - Python：充满 BOM 的 Youtube HTML

1 回答 1

Related

Reference