我正在尝试使用 Python 2.7 中的 BeautifulSoup 4 解析 youtube 评论。当我尝试任何 youtube 视频时,我会收到充满 BOM 的文本,而不仅仅是在文件开头:
<p> thank you kind sir :)</p>
几乎每条评论中都会出现一个。其他网站 (guardian.co.uk) 并非如此。我正在使用的代码:
# Source (should be taken from file to allow updating but not during wip):
source_url = 'https://www.youtube.com/watch?v=aiYzrCjS02k&feature=related'
# Get html from source:
response = urllib2.urlopen(source_url)
html = response.read()
# html comes with BOM everywhere, which is real ***, get rid of it!
html = html.decode("utf-8-sig")
soup = BeautifulSoup(html)
strings = soup.findAll("div", {"class" : "comment-body"})
print strings
正如你所看到的,我已经尝试过解码,但只要我喝汤,它就会带回 BOM 字符。有任何想法吗?