python - 使用带有口音和不同字符的美丽汤

Question

我正在使用 Beautiful Soup 来吸引过去奥运会的奖牌获得者。在某些赛事和运动员姓名中使用口音会让人感到不安。我在网上看到过类似的问题，但我是 Python 新手，无法将它们应用到我的代码中。

如果我打印我的汤，口音看起来很好。但是当我开始解析汤（并将其写入 CSV 文件）时，重音字符会变得乱码。'Louis Perrée' 变成 'Louis Perr√©e'

from BeautifulSoup import BeautifulSoup
import urllib2

response = urllib2.urlopen('http://www.databaseolympics.com/sport/sportevent.htm?sp=FEN&enum=130')
html = response.read()
soup = BeautifulSoup(html)

g = open('fencing_medalists.csv','w"')
t = soup.findAll("table", {'class' : 'pt8'})

for table in t:
    rows = table.findAll('tr')
    for tr in rows:
        cols = tr.findAll('td')
        for td in cols:
            theText=str(td.find(text=True))
            #theText=str(td.find(text=True)).encode("utf-8")
            if theText!="None":
                g.write(theText)
            else: 
                g.write("")
            g.write(",")
        g.write("\n")

非常感谢您的帮助。

score 3 · Accepted Answer

如果您正在处理 unicode，请始终将从磁盘或网络读取的响应视为字节包而不是字符串。

CSV 文件中的文本可能是 utf-8 编码的，应该先解码。

import codecs
# ...
content = response.read()
html = codecs.decode(content, 'utf-8')

您还需要将 unicode 文本编码为 utf-8，然后再将其写入输出文件。用于codecs.open打开输出文件，指定编码。它将透明地为您处理输出编码。

g = codecs.open('fencing_medalists.csv', 'wb', encoding='utf-8')

并对字符串编写代码进行以下更改：

    theText = td.find(text=True)
    if theText is not None:
        g.write(unicode(theText))

编辑： BeautifulSoup 可能会自动进行 unicode 解码，因此您可以跳过codecs.decodeon 响应。

python - 使用带有口音和不同字符的美丽汤

1 回答 1

Related

Reference