python - 从python中的unicode（外语）段落中提取主题标签

Question

我正在尝试开发一个函数来从段落中提取主题标签，基本上是以 # ( #cool #life #cars #سيارات)开头的单词

我尝试了几种方法，例如使用split()和使用正则表达式，但没有尝试包含阿拉伯语、俄语等的 unicode 字符。

我尝试使用split()which 工作正常，但它会包含任何单词，在我的情况下，我不能包含带有特殊字符的单词，例如,.%$]{}{)(..还尝试包含一些验证，例如单词长度不超过 15 个字符。

我试过这种方法 -

def _strip_hash_tags(self, ):
    """tags should not be more than 15 characters"""
    hash_tags = re.compile(r'(?i)(?<=\#)\w+')
    return [i for i in hash_tags.findall(self.content) if len(i) < 15]

这仅适用于英语，不适用于外语。有什么建议吗？

score 3 · Accepted Answer

正如这里所讨论的 -带有 utf8 问题的 python 正则表达式。

首先你应该使用re.compile(ur'<unicode string>'). 添加标志也很好re.UNICODE（虽然不确定这里是否真的需要）。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import re


def strip_hash_tags(content):
    """tags should not be more than 15 characters"""
    hash_tags = re.compile(ur'(?i)(?<=\#)\w+',re.UNICODE)
    return [i for i in hash_tags.findall(content) if len(i) < 15]

str = u"am trying to work on a function to extract hashtags from paragraphs, basically words that starts with # (#cool #life #cars #سيارات)"

print(strip_hash_tags(str))

# [u'cool', u'life', u'cars', u'\xd8\xb3\xd9']

python - 从python中的unicode（外语）段落中提取主题标签

1 回答 1

Related

Reference