python - Unicode 在 html.parser 中消失

Question

我正在从一些带有 Unicode 字符的网页中提取 HTML，如下所示：

def extract(url):
     """ Adapted from Python3_Google_Search.py """
     user_agent = ("Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
                   "AppleWebKit/525.13 (KHTML,     like Gecko)"
                   "Chrome/0.2.149.29 Safari/525.13")
     request = urllib.request.Request(url)
     request.add_header("User-Agent",user_agent)
     response = urllib.request.urlopen(request)
     html = response.read().decode("utf8")
     return html

如您所见，我正在正确解码。现在html是一个 unicode 字符串。打印 html 时，我可以看到 Unicode 字符。

我html.parser用来解析 HTML 并将其子类化：

from html.parser import HTMLParser
class Parser(HTMLParser):
  def __init__(self):
    ## some init stuff
  #### rest of class

使用类解析 HTML 时handle_data，似乎 Unicode 字符被删除/突然消失。文档没有提到任何关于编码的内容。为什么 HTML Parser 会删除非 ascii 字符，我该如何解决这个问题？

score 0 · Accepted Answer

显然，只要遇到非 ascii 字符html.parser就会调用。handle_entityref它传递命名字符引用，并将其转换为 unicode 字符，我使用：

html.entities.html5[name]

Python 的文档没有提到这一点。我从未见过比 Python 更糟糕的文档。

python - Unicode 在 html.parser 中消失

1 回答 1

Related

Reference