python - 一次从 python 字符串中删除所有可能不需要的字符

Question

我正在使用 python 模块newspaper3k并使用其 web url 提取文章摘要。作为，

from newspaper import Article
article = Article('https://www.abcd....vnn.com/dhdhd')
article.download()
article.parse()
article.nlp()
text = article.summary
print (text)

给，

Often hailed as Hollywood\xe2\x80\x99s long standing, commercially successful filmmaker, Spielberg\xe2\x80\x99s lifetime gross, if you include his productions, reaches a mammoth\xc2\xa0$17.2 billion\xc2\xa0\xc2\xad\xe2\x80\x93 unadjusted for inflation.
\r\rThe original\xc2\xa0Jurassic Park\xc2\xa0($983.8 million worldwide), which released in 1993, remains Spielberg\xe2\x80\x99s highest grossing film.
Ready Player One,\xc2\xa0currently advancing at a running total of $476.1 million, has become Spielberg\xe2\x80\x99s seventh highest grossing film of his career.It will eventually supplant Aamir\xe2\x80\x99s 2017 blockbuster\xc2\xa0Dangal\xc2\xa0(1.29 billion yuan) if it achieves the Maoyan\xe2\x80\x99s lifetime forecast of 1.31 billion yuan ($208 million) in the PRC.

我只想删除所有不需要的字符，例如\xe2\x80\x99s. 我避免使用多种replace功能。我想要的只是：-

Often hailed as Hollywood long standing, commercially successful filmmaker, 
Spielberg lifetime gross, if you include his productions, reaches a 
mammoth $17.2 billion unadjusted for inflation.
The original Jurassic Park ($983.8 million worldwide), 
which released in 1993, remains Spielberg highest grossing film.
Ready Player One,currently advancing at a running total of $476.1 million, 
has become Spielberg seventh highest grossing film of his career.
It will eventually supplant Aamir 2017 blockbuster Dangal (1.29 billion yuan) 
if it achieves the Maoyan lifetime forecast of 1.31 billion yuan ($208 million) in the PRC

score 0 · Accepted Answer

尝试使用正则表达式：

import re
clear_str = re.sub(r'[\xe2\x80\x99s]', '', your_input)

re.subyour_input用第二个参数替换所有出现的模式。类似模式[abc]匹配a,b或c字符。

score 0 · Accepted Answer

文章被错误解码。它可能在网站上指定了错误的编码，但问题中没有有效的 url 来重现难以证明的输出。

转义码表明 utf8 是正确的编码，因此使用以下代码直接编码回字节（latin1 是从前 256 个 Unicode 码点到字节的 1:1 映射），然后使用 utf8 解码：

text = text.encode('latin1').decode('utf8')

结果：

斯皮尔伯格经常被誉为好莱坞历史悠久、商业上成功的电影制片人，如果包括他的作品，他的一生总票房高达 172 亿美元——未经通货膨胀调整。

1993 年上映的原版侏罗纪公园（全球 9.838 亿美元）仍然是斯皮尔伯格票房最高的电影。目前以4.761亿美元的总票房晋级的《准备好的球员一号》成为斯皮尔伯格职业生涯票房第七高的电影。如果它达到猫眼一生预测的13.1亿元人民币，最终将取代阿米尔2017年的大片《丹格尔》（12.9亿元人民币）。 2.08 亿美元）在中国。

score 0 · Accepted Answer

您可以使用 python 的encode/decode来摆脱每个非拉丁字符

data = text.decode('utf-8')
text = data.encode('latin-1', 'ignore')

score 0 · Accepted Answer

首先用于.encode('ascii',errors='ignore')忽略所有非 ASCII 字符。

如果您需要此文本进行某种情感分析，那么您可能还想删除特殊字符，如\n,\r等，这可以通过首先转义转义字符，然后在正则表达式的帮助下替换它们来完成。

from newspaper import Article
import re
article = Article('https://www.abcd....vnn.com/dhdhd')
article.download()
article.parse()
article.nlp()
text = article.summary
text = text.encode('ascii',errors='ignore')
text = str(text) #converts `\n` to `\\n` which can then be replaced by regex
text = re.sub('\\\.','',text) #Removes all substrings of form \\.
print (text)

python - 一次从 python 字符串中删除所有可能不需要的字符

4 回答 4

Related

Reference