python - 如何在 Python 中过滤掉 unicode 字符？

Question

我现在正在研究 python 中的 unicode 值。首先，所有的问题和答案都有很大帮助。谢谢：）

接下来，我被困在我想要隔离每种语言的 unicode 值的项目中。

比如，某个函数只接受从 unicode 值 0900 到 097F 的印地语代码。现在我希望它拒绝所有 unicode 值的其余部分......

到目前为止，我已经完成了

for i in range(len(l1)):
    for j in range(len(l1[i])):
        unn = '%04x' % ord(l1[i][j])
        unn1 = int(unn, 16)
        if unn1 not in range(2304, 2431):
            l1[i] = l1[i].replace(l1[i][j], '')

此代码从列表 l1 中获取值并执行我想要的操作。但问题是它解决了一个字符，然后在第 3 行终止

再次手动运行它时，它会运行并再次解决一两个字符，然后再次终止。

我什至不能把它放在一个循环中......

请帮忙

更新：

我不想再发一篇文章，所以只使用这个我得到了一些帮助并修改了代码。有索引问题。

for i in range(len(dictt)):
    j=0
    while(1):
        if j >= len(dictt[i]):
            break
        unn = '%04x' % ord(dictt[i][j])
        unn1 = int(unn, 16)
        j = j+1
        if unn1 not in range(2304, 2431):
            dictt[i] = dictt[i].replace(dictt[i][j-1], '')
            j=0

这段代码非常适合我之前的查询，我指的是特定范围，但是如果我更改范围或功能，那么同样的问题会再次出现在同一行。为什么那行给出错误？

score 1 · Accepted Answer

最好的解决方案很可能是使用正则表达式来过滤掉不需要的字符。您基本上需要一个正则表达式来匹配您的印地语字符，但据我所知，印地语字符在“re”模块中存在错误，因此我建议使用以下命令下载“regex”模块：

$ pip 安装正则表达式

之后，您可以简单地逐字检查是否所有单词都是用印地语写的：

// kinda pseudo code, sorry
import regex
yourString = your_string_in_hindi
words = yourString.split(" ")
for word in words:
    if not regex.match(HINDI_WORD_REGEX, word):
        // whatever you want to do

您还可以在此处找到与您的问题相关的一些有用信息：

Python - pyparsing unicode 字符

Python unicode 正则表达式匹配失败并带有一些 unicode 字符-bug 或错误？

希望这至少可以帮助您开始。祝你好运！

score 0 · Accepted Answer

0

def filter(text, range):
    return ''.join([char for char in text if ord(char) in range])

于 2013-06-11T09:06:56.007 回答

score 0 · Accepted Answer

试试这个：

def converter(string_, range_ = (2304, 2431)):
    """ Filter the unicode characters """
    min, max = range_
    return ''.join(c for c in string_ if (min <= ord(c) < max))

python - 如何在 Python 中过滤掉 unicode 字符？

3 回答 3

Related

Reference