我想标记包含多个表情符号的推文,并且它们不是空格分隔的。我尝试了这两种方法NLTK TweetTokenizer
,Spacy
但它们无法标记表情符号肤色修饰符。这需要应用于庞大的数据集,因此性能可能是一个问题。有什么建议么?
您可能需要使用 Firefox 或 Safari 才能看到确切的色调表情符号,因为 Chrome 有时无法呈现它!
# NLTK
from nltk.tokenize.casual import TweetTokenizer
sentence = "I'm the most famous emoji but what about and "
t = TweetTokenizer()
print(t.tokenize(sentence))
# Output
["I'm", 'the', 'most', 'famous', 'emoji', '', '', '', 'but', 'what', 'about', '', 'and', '', '', '', '', '', '']
和
# Spacy
import spacy
nlp = spacy.load("en_core_web_sm")
sentence = nlp("I'm the most famous emoji but what about and ")
print([token.text for token in sentence])
Output
['I', "'m", 'the', 'most', 'famous', 'emoji', '', '', '', 'but', 'what', 'about', '', 'and', '', '', '', '', '', '']
预期产出
["I'm", 'the', 'most', 'famous', 'emoji', '', '', '', 'but', 'what', 'about', '', 'and', '', '', '', '']