python - 使用 Python 解码 HTML 实体

Question

我正在尝试从这里NYTimes.com解码 HTML 条目，但我无法弄清楚我做错了什么。

举个例子：

"U.S. Adviser&#8217;s Blunt Memo on Iraq: Time &#8216;to Go Home&#8217;"

我试过 BeautifulSoup、decode('iso-8859-1') 和 django.utils.encoding 的 smart_str 都没有成功。

score 22 · Accepted Answer

>>> from HTMLParser import HTMLParser
>>> print HTMLParser().unescape('U.S. Adviser&#8217;s Blunt Memo on Iraq: '
...                             'Time &#8216;to Go Home&#8217;')
U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’

The function is undocumented in Python 2. It is fixed in Python 3.4+: it is exposed as html.unescape() there.

score 20 · Accepted Answer

Actually what you have are not HTML entities. There are THREE varieties of those &.....; thingies -- for example       all mean U+00A0 NO-BREAK SPACE.

  (the type you have) is a "numeric character reference" (decimal).
  is a "numeric character reference" (hexadecimal).
  is an entity.

Further reading: http://htmlhelp.com/reference/html40/entities/

Here you will find code for Python2.x that does all three in one scan through the input: http://effbot.org/zone/re-sub.htm#unescape-html

score 18 · Accepted Answer

这确实有效：

from BeautifulSoup import BeautifulStoneSoup
s = "U.S. Adviser&#8217;s Blunt Memo on Iraq: Time &#8216;to Go Home&#8217;"
decoded = BeautifulStoneSoup(s, convertEntities=BeautifulStoneSoup.HTML_ENTITIES)

如果您想要一个字符串而不是 Unicode 对象，则需要将其解码为支持所使用字符的编码；ISO-8859-1 没有：

result = decoded.encode("UTF-8")

It's unfortunate that you need an external module for something like this; simple HTML/XML entity decoding should be in the standard library, and not require me to use a library with meaningless class names like "BeautifulStoneSoup". (Class and function names should not be "creative", they should be meaningful.)

score 5 · Accepted Answer

试试这个：

import re

def _callback(matches):
    id = matches.group(1)
    try:
        return unichr(int(id))
    except:
        return id

def decode_unicode_references(data):
    return re.sub("&#(\d+)(;|(?=\s))", _callback, data)

data = "U.S. Adviser&#8217;s Blunt Memo on Iraq: Time &#8216;to Go Home&#8217;"
print decode_unicode_references(data)

python - 使用 Python 解码 HTML 实体

4 回答 4

Related

Reference