python - 如何通过 twitter api 使用 python 格式化推文？

Question

我通过 twitter api 收集了一些推文。然后我数了数split(' ')python中使用的单词。但是，有些词看起来像这样：

correct! 
correct.
,correct
blah"
...

那么如何在没有标点符号的情况下格式化推文呢？或者也许我应该尝试另一种split发推文的方式？谢谢。

score 3 · Accepted Answer

您可以使用re.split...对多个字符进行拆分

from string import punctuation
import re

puncrx = re.compile(r'[{}\s]'.format(re.escape(punctuation)))
print filter(None, puncrx.split(your_tweet))

或者，只查找包含某些连续字符的单词：

print re.findall(re.findall('[\w#@]+', s), your_tweet)

例如：

print re.findall(r'[\w@#]+', 'talking about #python with @someone is so much fun! Is there a     140 char limit? So not cool!')
# ['talking', 'about', '#python', 'with', '@someone', 'is', 'so', 'much', 'fun', 'Is', 'there', 'a', '140', 'char', 'limit', 'So', 'not', 'cool']

我最初在示例中确实有一个笑脸，但当然这些最终会被这种方法过滤掉，所以需要注意这一点。

score 1 · Accepted Answer

在进行拆分之前尝试从字符串中删除标点符号。

import string
s = "Some nice sentence.  This has punctuation!"  
out = s.translate(string.maketrans("",""), string.punctuation)

然后执行spliton out。

score 1 · Accepted Answer

我建议在使用以下代码拆分之前从特殊符号中清除文本：

tweet_object["text"] = re.sub(u'[!?@#$.,#:\u2026]', '', tweet_object["text"])

您需要在使用函数 sub 之前导入 re

import re

python - 如何通过 twitter api 使用 python 格式化推文？

3 回答 3

Related

Reference