每个 unicode 字符都有一个“双向”类。您可以使用unicodedata.bidirectional找到双向类。该函数返回一个字符串,例如'L'、'R'、'AL'等,含义如下:
| L | Left_To_Right | any strong left-to-right character |
| LRE | Left_To_Right_Embedding | U+202A: the LR embedding control |
| LRO | Left_To_Right_Override | U+202D: the LR override control |
| R | Right_To_Left | any strong right-to-left (non-Arabic-type) character |
| AL | Arabic_Letter | any strong right-to-left (Arabic-type) character |
| RLE | Right_To_Left_Embedding | U+202B: the RL embedding control |
| RLO | Right_To_Left_Override | U+202E: the RL override control |
| PDF | Pop_Directional_Format | U+202C: terminates an embedding or override control |
| EN | European_Number | any ASCII digit or Eastern Arabic-Indic digit |
| ES | European_Separator | plus and minus signs |
| ET | European_Terminator | a terminator in a numeric format context, includes currency signs |
| AN | Arabic_Number | any Arabic-Indic digit |
| CS | Common_Separator | commas, colons, and slashes |
| NSM | Nonspacing_Mark | any nonspacing mark |
| BN | Boundary_Neutral | most format characters, control codes, or noncharacters |
| B | Paragraph_Separator | various newline characters |
| S | Segment_Separator | various segment-related control codes |
| WS | White_Space | spaces |
| ON | Other_Neutral | most other symbols and punctuation marks |
例如:
In [3]: import unicodedata as UD
In [5]: UD.bidirectional(u'\u0688')
Out[5]: 'AL'
In [6]: UD.bidirectional(u'\u200f')
Out[6]: 'R'
In [7]: UD.bidirectional(u'H')
Out[7]: 'L'
因此,您可以通过确定字符串是否主要由双向类为或的字符组成来猜测字符串是否从右到左。R
AL
例如,
# coding: utf-8
import unicodedata as UD
texts = ['ڈوگرى'.decode('utf-8'),
u'Hello']
for text in texts:
x = len([None for ch in text if UD.bidirectional(ch) in ('R', 'AL')])/float(len(text))
print('{t} => {c}'.format(t=text.encode('utf-8'), c='RTL' if x>0.5 else 'LTR'))
产量
ڈوگرى => RTL
Hello => LTR
关于第一个问题:
问:测试字符串中是否出现单个非 ASCII 字符的正确语法是什么?Python 2.6,我不能使用 3。
您测试字符是否在 a 中的方法unicode
是正确的。如果u'\u200f' in str.decode('utf-8')
既不抱怨也不工作,u'\u200f'
则不在unicode
.