1

我正在尝试使用 Python 2.7 中的 BeautifulSoup 4 解析 youtube 评论。当我尝试任何 youtube 视频时,我会收到充满 BOM 的文本,而不仅仅是在文件开头:

<p> thank you kind sir :)</p>

几乎每条评论中都会出现一个。其他网站 (guardian.co.uk) 并非如此。我正在使用的代码:

# Source (should be taken from file to allow updating but not during wip):
source_url = 'https://www.youtube.com/watch?v=aiYzrCjS02k&feature=related'

# Get html from source:
response = urllib2.urlopen(source_url)
html = response.read()

# html comes with BOM everywhere, which is real ***, get rid of it!
html = html.decode("utf-8-sig")

soup = BeautifulSoup(html)

strings = soup.findAll("div", {"class" : "comment-body"})
print strings

正如你所看到的,我已经尝试过解码,但只要我喝汤,它就会带回 BOM 字符。有任何想法吗?

4

1 回答 1

1

这在 YouTube 方面似乎是无效的,但你不能只是告诉他们修复它,你需要一个解决方法。

所以,这里有一个简单的解决方法:

# html comes with BOM everywhere, which is real ***, get rid of it!
html = html.replace(b'\xEF\xBB\xBF', b'')
html = html.decode("utf-8")

b前缀对于 Python 2.7 来说是不必要但无害的,但它们会让你的代码在 Python 3 中工作……另一方面,它们会在 Python 2.5 中破坏它,所以如果这对你更重要,请摆脱它们。 )

Alternatively, you can first decode and then replace(u'\uFEFF', u''). This should have the exact same effect (decoding extra BOMs should work harmlessly). But I think it makes more sense to fix the UTF-8 then decode it, rather than trying to decode and then fixing the result.

于 2012-10-30T22:05:09.813 回答