我对 goose 提取的文本有一个小的正则表达式问题。
我已经使用 Goose 从 html 页面中提取了干净的文本,goose 给出的输出很好,但是有一个小问题。我得到下面的字符串。
My name is Sam\'s, I like to play \'football\'
The actual text looks like
My name is Sam's, I like to play 'football'
I am trying to get rid of the backslash. When I try the below code for the text extracted by goose, somehow the code doesn't work, however, if I input the text myself the code works perfectly.
I tried the below code
re.sub(r"\\","",text) or
text.replace("\\","")
text.decode()
请在下面找到代码:
from goose import Goose
url = 'http://economictimes.indiatimes.com/news/politics-and- nation/swach-bharat-drives-draws-inspiration-from-mahatma- gandhi/articleshow/49203355.cms'
g = Goose()
article = g.extract(url=url)
text=article.cleaned_text
print text
.....International School here on Friday, Gandhi\'s 146th birth anniversary.Gurjit Singh said that apart from Gandhi\'s birth anniversary,....
text=re.sub(r"\\","",text)
print text
.....International School here on Friday, Gandhi\'s 146th birth anniversary.Gurjit Singh said that apart from Gandhi\'s birth anniversary,....
我如何摆脱反斜杠。