-1

(编辑:我使用的是 Python 2.7)(编辑 2:我已经检查了Convert XML/HTML Entities into Unicode String in Python,解决方案不起作用。请不要将此标记为已回答。)

我一直无法找到一个 python 包,它可以可靠地转换带有一些 html 实体的文本。我发现 HTMLParser 适用于一些东西,但也有很多问题。BeautifulSoup 似乎永远无法转换为 unicode。如何仅使用一种方法返回字符串广告的 Unicode 表示?

我认为我遇到的问题是我的一些文本既有 unicode 字符又有 html 实体(如示例字符串 d)。

import HTMLParser
from bs4 import BeautifulSoup

astring = "P&O."
bstring = "& "
cstring = ">"
dstring = "> 150ÎC"

pars = HTMLParser.HTMLParser()
a1 = BeautifulSoup(astring)
a2 = pars.unescape(astring)
print "a1:", a1
print "a2:", a2
b1 = BeautifulSoup(bstring)
b2 = pars.unescape(bstring)
print "b1:", b1
print "b2:", b2
c1 = BeautifulSoup(cstring)
c2 = pars.unescape(cstring)
print "c1:", c1
print "c2:", c2
d1 = BeautifulSoup(dstring)
try: d2 = pars.unescape(dstring)
except:d2 = "HTML Parse Broke!"
print "d1:", d1
print "d2:", d2

给出以下输出:

a1: P&O.
a2: P&O.
b1: & 
b2: & 
c1: >
c2: >
d1: > 150ÎC
d2: HTML Parse Broke!

编辑 3:kalhartt 的建议让我找到了解决方案。为了防止混合字符编码的字符串被破坏,我使用了 .decode('utf-8')

4

1 回答 1

1

如果要处理 unicode,请使用 unicode 字符串。一切都按您的示例中的预期工作。

# -*- coding: utf-8 -*-
import HTMLParser
from bs4 import BeautifulSoup

astring = u"P&O."
bstring = u"& "
cstring = u">"
dstring = u"> 150ÎC"

pars = HTMLParser.HTMLParser()
a1 = BeautifulSoup('<span>%s</span>' % astring)
a2 = pars.unescape(astring)
print "a1:", a1
print "a2:", a2
b1 = BeautifulSoup('<span>%s</span>' % bstring)
b2 = pars.unescape(bstring)
print "b1:", b1
print "b2:", b2
c1 = BeautifulSoup('<span>%s</span>' % cstring)
c2 = pars.unescape(cstring)
print "c1:", c1
print "c2:", c2
d1 = BeautifulSoup('<span>%s</span>' % dstring)
try: d2 = pars.unescape(dstring)
except: d2 = "HTML Parse Broke!"
print "d1:", d1
print "d2:", d2

这给出了以下输出。

a1: <span>P&amp;O.</span>
a2: P&O.
b1: <span>&amp; </span>
b2: & 
c1: <span>&gt;</span>
c2: >
d1: <span>&gt; 150ÎC</span>
d2: > 150ÎC

BeautifulSoup 对它们进行编码,HTMLParser 对它们进行解码。

于 2013-10-08T03:24:16.150 回答