python - 使用 Beautiful Soup 从 HTML 文件中提取挪威文本，丢失挪威字符

Question

我有一个 Python 脚本，它使用 Beautiful Soup 从目录中的 HTML 文件中提取文本。但是，我无法让编码正常工作。起初，我认为 HTML 文件本身可能存在问题。但是，当我在 Notepad.exe 中查看 HTML 文件的源代码时，我会看到以下内容：Vi er her for deg, og du må gjerne ta kontakt med oss på 815 32 000 eller på Facebook om du har noen spørsmål.

但是，当我在 Internet Explorer 中查看相同的 HTML 文件时，我看到了：Vi er her for deg, og du mÃ¥ gjerne ta kontakt med oss pÃ¥ 815 32 000 eller pÃ¥ Facebook om du har noen spÃ¸rsmÃ¥l.

而且，Internet Explorer 文本与我的 Python 脚本附加到我的文本文件中的文本相同。因此，显然编码是可检测的，IE 不理解它也就不足为奇了，但我似乎无法弄清楚为什么 Python 不能处理它。编码应该是latin-1，我认为这不是问题。这是我的代码：

import os
import glob
from bs4 import BeautifulSoup

path = "c:\\users\\me\\downloads\\"

for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(open(markup, "r").read())
    with open("example.txt", "a") as myfile:
        myfile.write(soup.get_text())
        myfile.close()

由于这似乎破坏了编码，我想我可以通过 latin-1 编码，如下所示：

soup = BeautifulSoup(open(markup, "r").read())
soup = soup.prettify("latin-1")

但这给了我错误：

Traceback (most recent call last):
  File "bsoup.py", line 12, in <module>
    myfile.write(soup.get_text())
AttributeError: 'bytes' object has no attribute 'get_text'

score 2 · Accepted Answer

.prettify()已经返回字节，因此您只需将其直接写入文件，但您必须以二进制模式打开该文件（注意'ab'下面使用的模式）：

soup = BeautifulSoup(open(markup, "r").read())
with open("example.txt", "ab") as myfile:
    myfile.write(soup.prettify('latin-1'))

无需致电myfile.close()；该with声明已经处理了这一点。

要仅保存文本，请以文本模式 ( 'a') 打开文件并指定保存时使用的编码：

soup = BeautifulSoup(open(markup, "r").read())
with open("example.txt", "a", encoding='latin-1') as myfile:
    myfile.write(soup.get_text())

现在 Python 会自动为您将 unicode 文本编码为 latin-1。

当您看到类似Ã¥而不是的内容时å，您将 UTF-8 字节解释为 Latin-1。

你可能想阅读 Python 和 Unicode：

每个软件开发人员绝对、绝对必须了解 Unicode 和字符集（没有任何借口！）作者：Joel Spolsky
Python Unicode HOWTO
Ned Batchelder 的实用 Unicode

python - 使用 Beautiful Soup 从 HTML 文件中提取挪威文本，丢失挪威字符

1 回答 1

Related

Reference