python - Python：测试字符串中的 utf-8 字符

Question

我需要测试已经用 str.encode('utf-8') 编码的字符串是否从右到左。我试过

if u'\u200f' in str.decode('utf-8'):
  print 'found it'

它既不抱怨也不工作。

问：测试字符串中是否出现单个非 ASCII 字符的正确语法是什么？Python 2.6，我不能使用 3。

问：我记得我读到过，即使没有明确的 RML，主要从右到左的字符也默认为 RTL。有谁知道在不知道期望哪种语言的情况下测试这样的字符串的方法（即字符串可以是阿拉伯语、希伯来语或任何其他 RTL 语言）？

感谢所有帮助。

score 8 · Accepted Answer

每个 unicode 字符都有一个“双向”类。您可以使用unicodedata.bidirectional找到双向类。该函数返回一个字符串，例如'L'、'R'、'AL'等，含义如下：

| L   | Left_To_Right           | any strong left-to-right character                                |
| LRE | Left_To_Right_Embedding | U+202A: the LR embedding control                                  |
| LRO | Left_To_Right_Override  | U+202D: the LR override control                                   |
| R   | Right_To_Left           | any strong right-to-left (non-Arabic-type) character              |
| AL  | Arabic_Letter           | any strong right-to-left (Arabic-type) character                  |
| RLE | Right_To_Left_Embedding | U+202B: the RL embedding control                                  |
| RLO | Right_To_Left_Override  | U+202E: the RL override control                                   |
| PDF | Pop_Directional_Format  | U+202C: terminates an embedding or override control               |
| EN  | European_Number         | any ASCII digit or Eastern Arabic-Indic digit                     |
| ES  | European_Separator      | plus and minus signs                                              |
| ET  | European_Terminator     | a terminator in a numeric format context, includes currency signs |
| AN  | Arabic_Number           | any Arabic-Indic digit                                            |
| CS  | Common_Separator        | commas, colons, and slashes                                       |
| NSM | Nonspacing_Mark         | any nonspacing mark                                               |
| BN  | Boundary_Neutral        | most format characters, control codes, or noncharacters           |
| B   | Paragraph_Separator     | various newline characters                                        |
| S   | Segment_Separator       | various segment-related control codes                             |
| WS  | White_Space             | spaces                                                            |
| ON  | Other_Neutral           | most other symbols and punctuation marks                          |

例如：

In [3]: import unicodedata as UD
In [5]: UD.bidirectional(u'\u0688')
Out[5]: 'AL'

In [6]: UD.bidirectional(u'\u200f')
Out[6]: 'R'

In [7]: UD.bidirectional(u'H')
Out[7]: 'L'

因此，您可以通过确定字符串是否主要由双向类为或的字符组成来猜测字符串是否从右到左。RAL

例如，

# coding: utf-8
import unicodedata as UD

texts = ['ڈوگرى'.decode('utf-8'),
         u'Hello']
for text in texts:
    x = len([None for ch in text if UD.bidirectional(ch) in ('R', 'AL')])/float(len(text))
    print('{t} => {c}'.format(t=text.encode('utf-8'), c='RTL' if x>0.5 else 'LTR'))

产量

ڈوگرى => RTL
Hello => LTR

关于第一个问题：

问：测试字符串中是否出现单个非 ASCII 字符的正确语法是什么？Python 2.6，我不能使用 3。

您测试字符是否在 a 中的方法unicode是正确的。如果u'\u200f' in str.decode('utf-8')既不抱怨也不工作，u'\u200f'则不在unicode.

python - Python：测试字符串中的 utf-8 字符

1 回答 1

Related

Reference