python - 在 Python 中通过 Unicode 脚本分割文本

Question

我有一些这样的文字：

এর জন্য বুদ্ধির (原因) প্রয়োজন নেই, প্রয়োজন নিজের

语言是孟加拉语（当然除了一个英语单词）。

我想获得文本中的孟加拉语单词列表（即单词标记化问题）。孟加拉语的 Unicode 范围是 0980 到 09FF。还有一个\p{Bengali} 脚本（不知道怎么用）。这是我所拥有的：

import re
Pattern = re.compile(r'\[\u0980-\u09FF]+')
Words = split(Pattern, Text)

这是行不通的。我怎样才能让它工作？如果可能的话，我也更喜欢使用 \p{Bengali}，而不是明确的 Unicode 范围。

score 4 · Accepted Answer

Python 还不理解 Unicode 脚本属性，例如\p{...}.

在您删除转义括号的反斜杠并且不使用split()but findall()（您甚至没有使用re.split()但我猜这只是一个错字）之后，您的版本应该可以工作。

此外，由于您没有像您在最近的评论中所说的那样使用 Python 3，因此您可能需要使用该re.UNICODE选项并确保它text实际上是一个Unicode字符串。

import re
pattern = re.compile(ur'[\u0980-\u09FF]+', re.UNICODE)
words = re.findall(pattern, text)

score 0 · Accepted Answer

您可以使用 pip 安装备用regex库：

pip3 install regex

并使用该\p{ScriptName}模式查找您要查找的脚本：

import regex
t = "এর জন্য বুদ্ধির (Reason) প্রয়োজন নেই, প্রয়োজন নিজের&quot;
t = regex.findall(r"[\p{Bengali}]+", t)
print(t)

更多关于正则表达式模块的信息

score -1 · Accepted Answer

你可以只用空格分割：

>>> import re
>>> x = 'এর জন্য বুদ্ধির (Reason) প্রয়োজন নেই, প্রয়োজন নিজের'
>>> re.split('\s', x)
['\xe0\xa6\x8f\xe0\xa6\xb0', '\xe0\xa6\x9c\xe0\xa6\xa8\xe0\xa7\x8d\xe0\xa6\xaf', '\xe0\xa6\xac\xe0\xa7\x81\xe0\xa6\xa6\xe0\xa7\x8d\xe0\xa6\xa7\xe0\xa6\xbf\xe0\xa6\xb0', '(Reason)', '\xe0\xa6\xaa\xe0\xa7\x8d\xe0\xa6\xb0\xe0\xa6\xaf\xe0\xa6\xbc\xe0\xa7\x8b\xe0\xa6\x9c\xe0\xa6\xa8', '\xe0\xa6\xa8\xe0\xa7\x87\xe0\xa6\x87,', '\xe0\xa6\xaa\xe0\xa7\x8d\xe0\xa6\xb0\xe0\xa6\xaf\xe0\xa6\xbc\xe0\xa7\x8b\xe0\xa6\x9c\xe0\xa6\xa8', '\xe0\xa6\xa8\xe0\xa6\xbf\xe0\xa6\x9c\xe0\xa7\x87\xe0\xa6\xb0']

python - 在 Python 中通过 Unicode 脚本分割文本

3 回答 3

Related

Reference