python - 从 Python 中的字符串中删除转义的实体

Question

我有一个巨大的 csv 推文文件。我将它们都读入计算机并将它们存储在两个单独的字典中 - 一个用于负面推文，一个用于正面推文。我想读取文件并将其解析为字典，同时删除任何标点符号。我用过这段代码：

tweets = []
for (text, sentiment) in pos_tweets.items() + neg_tweets.items():
    shortenedText = [e.lower() and e.translate(string.maketrans("",""), string.punctuation) for e in text.split() if len(e) >= 3 and not e.startswith('http')]
print shortenedText

除了一个小问题，一切都很好。不幸的是，我下载的巨大 csv 文件更改了一些标点符号。我不确定这叫什么，所以不能真正用谷歌搜索它，但实际上可能会开始一些句子：

"ampampFightin"
"&quot;The truth is out there"
"&altThis is the way I feel"

有没有办法摆脱所有这些？我注意到后两个以 & 符号开头 - 将一个简单的搜索摆脱它（我问而不做的唯一原因是因为有太多推文让我手动检查）

score 4 · Accepted Answer

首先，取消转义HTML 实体，然后删除标点符号：

import HTMLParser

tweets = []
for (text, sentiment) in pos_tweets.items() + neg_tweets.items():
    text = HTMLParser.HTMLParser().unescape(text)
    shortenedText = [e.lower() and e.translate(string.maketrans("",""), string.punctuation) for e in text.split() if len(e) >= 3 and not e.startswith('http')]
print shortenedText

这是一个例子，它是如何unescape工作的：

>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape("&quot;The truth is out there")
u'"The truth is out there'

UPD：问题的解决方案UnicodeDecodeError：使用text.decode('utf8'). 这是一个很好的解释为什么你需要这样做。

python - 从 Python 中的字符串中删除转义的实体

1 回答 1

Related

Reference