python - 如何从推文中删除日期？

Question

需要一些建议...我有一些推文

Mon Apr 06 22:19:45 PDT @switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. :( You shoulda got David Carr of Third Day to do it. ;D
Mon Apr 06 22:19:49 PDT is upset that he can't update his Facebook by texting it... and might cry as a result :( School today also. Blah!
Mon Apr 06 22:19:53 PDT @Kenichan I dived many times for the ball. Managed to save 50% :( The rest go out of bounds
Mon Apr 06 22:19:57 PDT my whole body feels itchy and like its on fire :(

如何删除这个 Mon Apr 06 22:19:57 PDT？使用正则表达式？

score 2 · Accepted Answer

如果这是一个字符串，只需在第一行拆分PDT：

for line in tweets.splitlines():
    print line.split(' PDT ', 1)[1]

该行在字符第一次出现时被分割PDT（带有空格），并打印结果的后半部分。

但也许您可以改为阻止输出字符串的代码首先添加日期？

score 2 · Accepted Answer

for line in lines:
    print line[24:]

如果日期/时间格式始终相同，可能会很简单。

score 1 · Accepted Answer

如果它们都是以相同方式存储的字符串，则可以进行拆分：

tweet = "Mon Apr 06 22:19:57 PDT SomeGuy Im not white enough to be excited for a new version of Windows".

tweet= tweet.split(None, 5)[-1]

结果在推文中

“SomeGuy 我不够白，不会对新版本的 Windows 感到兴奋”

score 0 · Accepted Answer

似乎将其拆分为单词列表并删除前六个单词更有可能在时区更改时保持一致。

clean_tweets = []

for tweet in tweets:
    words = tweet.split()
    del words[0:5]
    clean_tweet = " ".join(words)
    clean_tweets.append(clean_tweet)

默认情况下，split()将按空格进行拆分，因此您不必指定分隔符。

score 0 · Accepted Answer

我假设您不能使用 PDT，因为您不能假设它们将始终是 PDT。似乎字符串中最容易识别的部分是 [0-9]+:[0-9]+:[0-9]+ - 时间。

/^.*[0-9]+:[0-9]+:[0-9]+\s+[A-Z]{3}\s*(.*)$/

捕获时间之后的字符串和全部大写的 3 个字母的时区。

score 0 · Accepted Answer

如果日期和时区发生变化，我会创建一个通用模式

 data="""Mon Apr 06 22:19:45 PDT @switchfoot http://twitpic.com/2y1zl - Awww, 
 that's a bummer. :( You shoulda got David Carr of Third Day to do it. ;D
 Mon Apr 06 22:19:49 PDT is upset that he can't update his Facebook by texting 
 it... and might cry as a result :( School today also. Blah!
 Mon Apr 06 22:19:53 PDT @Kenichan I dived many times for the ball. Managed to 
 save 50% :( The rest go out of bounds
 Mon Apr 06 22:19:57 PDT my whole body feels itchy and like its on fire :("""

for line in data.splitlines():
    pattern=r'[a-zA-Z]{3}\s[a-zA-Z]{3}\s\d{2}\s(\d{2}\:*){3}\s[a-zA-Z]{3}'
    line=re.sub(pattern,'',line)
    print("{}\n".format(line))

输出：

@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. :( You shoulda got David Carr of Third Day to do it. ;D

 is upset that he can't update his Facebook by texting it... and might cry as a result :( School today also. Blah!

 @Kenichan I dived many times for the ball. Managed to save 50% :( The rest go out of bounds my whole body feels itchy and like its on fire :(

python - 如何从推文中删除日期？

6 回答 6

Related

Reference