python - 如何消除☎ unicode？

Question

在网页抓取期间并在摆脱所有 html 标签后，我得到了 unicode (☎) 中的黑色电话字符 \u260e。但与这个回应不同，我也想摆脱它。

我在 Scrapy 中使用了以下正则表达式来消除 html 标签：

pattern = re.compile("<.*?>|&nbsp;|&amp;",re.DOTALL|re.M)

然后我尝试匹配 \u260e ，我想我被反斜杠瘟疫抓住了。我尝试了这种模式但没有成功：

pattern = re.compile("<.*?>|&nbsp;|&amp;|\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>|&nbsp;|&amp;|\\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>|&nbsp;|&amp;|\\\\u260e",re.DOTALL|re.M)

这些都不起作用，我仍然有 \u260e 作为输出。我怎样才能让它消失？

score 7 · Accepted Answer

使用 Python 2.7.3，以下对我来说很好：

import re

pattern = re.compile(u"<.*?>|&nbsp;|&amp;|\u260e",re.DOTALL|re.M)
s = u"bla ble \u260e blo"
re.sub(pattern, "", s)

输出：

u'bla ble  blo'

正如@Zack 所指出的，这是因为字符串现在是 unicode，即字符串已经被转换，并且字符序列\u260e现在是 - 可能 - 两个字节用于写入那个黑色的小电话☎（：

一旦要搜索的字符串和正则表达式都有黑色电话本身，而不是字符序列\u260e，它们都匹配。

score 4 · Accepted Answer

如果您的字符串已经是 unicode，有两种简单的方法。显然，第二个影响不仅仅是☎。

>>> import string                                   
>>> foo = u"Lorum ☎ Ipsum"                          
>>> foo.replace(u'☎', '')                           
u'Lorum  Ipsum'                                     
>>> "".join(s for s in foo if s in string.printable)
u'Lorum  Ipsum'

删除非 ascii 字符，但保留句点和空格以获取更多信息string.printable
如果您不想要多个空格，那么在 Python 中删除字符串中的多个空格的最短方法。

score 1 · Accepted Answer

您可以尝试使用 BeatifulSoup，如此处所述，使用类似

soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

python - 如何消除☎ unicode？

3 回答 3

Related

Reference