python - Beautiful Soup, Python and the swedish characters ÅÄÖ

Question

I'm using BeautifulSoup to scrape a Swedish web page. On the web page, the information I want to extract looks like this:

"Öhman Företagsobligationsfond"

When I print the information from the Python script it looks like this:

"Ã&ndash;hman FÃ¶retagsobligationsfond"

I'm new to Python and I have searched for answers and tried using # -- coding: utf-8 -- in the beginning of the code but it does not work.

I'm thinking of moving from Sweden to solve this issue.

score 3 · Accepted Answer

使用# -- coding: utf-8 --时只需指定源代码文档的编码。您正在解析的页面可能声明了错误的编码（或根本没有），因此 Beautiful Soup 失败。尝试在构建汤时指定编码。这是一个小例子：

markup = '''
<html>
    <head>
        <title>Övriga fakta</title>
        <meta charset="latin-1" />
    </head>
    <body>
        <h1>Öhman Företagsobligationsfond</h1>
        <p>Detta är en svensk sida.</p>
    </body>
</html>
'''

soup = BeautifulSoup(markup)
print soup.find('h1')

try:
    # Version 4
    soup = BeautifulSoup(markup, from_encoding='utf-8')
except TypeError:
    # Version 3
    soup = BeautifulSoup(markup, fromEncoding='utf-8')

print soup.find('h1')

输出是：

<h1>Ãhman FÃ¶retagsobligationsfond</h1>
<h1>Öhman Företagsobligationsfond</h1>

Beautiful Soup 4 中的参数是from_encoding，而版本 3 中的参数是fromEncoding。

python - Beautiful Soup, Python and the swedish characters ÅÄÖ

1 回答 1

Related

Reference