3

I'm using BeautifulSoup to scrape a Swedish web page. On the web page, the information I want to extract looks like this:

"Öhman Företagsobligationsfond"

When I print the information from the Python script it looks like this:

"Öhman Företagsobligationsfond"

I'm new to Python and I have searched for answers and tried using # -- coding: utf-8 -- in the beginning of the code but it does not work.

I'm thinking of moving from Sweden to solve this issue.

4

1 回答 1

3

使用# -- coding: utf-8 --时只需指定源代码文档的编码。您正在解析的页面可能声明了错误的编码(或根本没有),因此 Beautiful Soup 失败。尝试在构建汤时指定编码。这是一个小例子:

markup = '''
<html>
    <head>
        <title>Övriga fakta</title>
        <meta charset="latin-1" />
    </head>
    <body>
        <h1>Öhman Företagsobligationsfond</h1>
        <p>Detta är en svensk sida.</p>
    </body>
</html>
'''

soup = BeautifulSoup(markup)
print soup.find('h1')

try:
    # Version 4
    soup = BeautifulSoup(markup, from_encoding='utf-8')
except TypeError:
    # Version 3
    soup = BeautifulSoup(markup, fromEncoding='utf-8')

print soup.find('h1')

输出是:

<h1>Ãhman Företagsobligationsfond</h1>
<h1>Öhman Företagsobligationsfond</h1>

Beautiful Soup 4 中的参数是from_encoding,而版本 3 中的参数是fromEncoding

于 2012-11-11T10:01:34.590 回答