我有一个简单的 RSS 提要脚本,它获取每篇文章的内容并通过一些简单的处理运行它,然后将其保存到数据库中。
问题是,在通过以下所有重音撇号和引号运行文本后,文本中的所有重音撇号和引号都将被删除。
# this is just an example string, I use feed_parser to download the feeds
string = """  <p>This is a sentence. This is a sentence. I'm a programmer. I’m a programmer, however I don’t graphic design.</p>"""
text = BeautifulSoup(string)
# does some simple soup processing
string = text.renderContents()
string = string.decode('utf-8', 'ignore')
string = string.replace('<html>','')
string = string.replace('</html>','')
string = string.replace('<body>','')
string = string.replace('</body>','')
string = unicodedata.normalize('NFKD', string).encode('utf-8', 'ignore')
print "".join([x for x in string if ord(x)<128])
结果是:
> <p> </p><p>This is a sentence. This is a sentence. I'm a programmer. Im a programmer, however I dont graphic design.</p>
所有 html 实体引号/撇号都被删除。我该如何解决?