python - 如何在 Python 3.1 中对字符串中的 HTML 实体进行转义？

Question

我环顾四周，只找到了 python 2.6 及更早版本的解决方案，没有关于如何在 python 3.X 中执行此操作。（我只能访问 Win7 框。）

我必须能够在 3.1 中做到这一点，最好没有外部库。目前，我已经安装了 httplib2 并可以访问命令提示符 curl（这就是我获取页面源代码的方式）。不幸的是，curl 不解码 html 实体，据我所知，我在文档中找不到解码它的命令。

是的，我试图让 Beautiful Soup 工作，很多时候在 3.X 中都没有成功。如果您能提供有关如何使其在 MS Windows 环境中的 python 3 中工作的明确说明，我将不胜感激。

所以，为了清楚起见，我需要把这样的字符串：Suzy & John变成这样的字符串：“Suzy & John”。

score 216 · Accepted Answer

您可以使用函数html.unescape：

在Python3.4+中（感谢 JF Sebastian 的更新）：

import html
html.unescape('Suzy &amp; John')
# 'Suzy & John'

html.unescape('&quot;')
# '"'

在Python3.3或更早版本中：

import html.parser    
html.parser.HTMLParser().unescape('Suzy &amp; John')

在Python2 中：

import HTMLParser
HTMLParser.HTMLParser().unescape('Suzy &amp; John')

score 15 · Accepted Answer

您可以xml.sax.saxutils.unescape用于此目的。该模块包含在 Python 标准库中，可在 Python 2.x 和 Python 3.x 之间移植。

>>> import xml.sax.saxutils as saxutils
>>> saxutils.unescape("Suzy &amp; John")
'Suzy & John'

score 8 · Accepted Answer

显然我没有足够高的声誉来做任何事情，除了发布这个。unutbu 的回答并没有避免引用。我发现做的唯一一件事就是这个功能：

import re
from htmlentitydefs import name2codepoint as n2cp

def decodeHtmlentities(string):
    def substitute_entity(match):        
        ent = match.group(2)
        if match.group(1) == "#":
            return unichr(int(ent))
        else:
            cp = n2cp.get(ent)
            if cp:
                return unichr(cp)
            else:
                return match.group()
    entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")
    return entity_re.subn(substitute_entity, string)[0]

我从这个页面得到的。

score 3 · Accepted Answer

3

Python 3.x 也有html.entities

于 2010-03-02T03:01:41.623 回答

score 2 · Accepted Answer

就我而言，我在 as3 转义函数中转义了一个 html 字符串。经过一个小时的谷歌搜索没有发现任何有用的东西，所以我写了这个 recursive 函数来满足我的需要。这里是，

def unescape(string):
    index = string.find("%")
    if index == -1:
        return string
    else:
        #if it is escaped unicode character do different decoding
        if string[index+1:index+2] == 'u':
            replace_with = ("\\"+string[index+1:index+6]).decode('unicode_escape')
            string = string.replace(string[index:index+6],replace_with)
        else:
            replace_with = string[index+1:index+3].decode('hex')
            string = string.replace(string[index:index+3],replace_with)
        return unescape(string)

Edit-1添加了处理 Unicode 字符的功能。

score 1 · Accepted Answer

我不确定这是否是内置库，但它看起来像您需要的并支持 3.1。

来自：http ://docs.python.org/3.1/library/xml.sax.utils.html?highlight=html%20unescape

xml.sax.saxutils.unescape(data, entity={}) 在数据字符串中取消转义“&”、“<”和“>”。

python - 如何在 Python 3.1 中对字符串中的 HTML 实体进行转义？

6 回答 6

Related

Reference