python - 在 Python 中如何检查 UTF-8 字节数组中的字节是否为 ASCII a-zA-Z

Question

假设一个 UTF-8 字节数组如何检查任意单个字节是否在字符范围 a-zA-Z 中知道这些字符由单个字节表示？由于这些字符对应的 ASCII 字母字符整数值是 UTF-8 中的一个字节，并且多字节字符的任何单个字节永远不会匹配这些字符之一的整数值，看起来检查字节的整数值是最快的和最安全的。

这对我有用，但它是最有效的吗？

def isAsciiAlphaByte(c):
    return ((c>96 and c<123) or (c> 64 and c<91))

isAsciiAlphaByte(b"abc"[0])
>>> True

score 2 · Accepted Answer

您可以调用.isalpha()整个 bytearray 及其切片（包括一个字节切片）：

>>> a = b"azAZ 123"
>>> b = bytearray(a)
>>> b.isalpha() # not all bytes in the array are ascii letters
False
>>> b[:4].isalpha() # but the first 4 bytes are alphabetic ([a-zA-Z])
True
>>> b[0:1].isalpha() # you need to use the slice notation even for a single byte
True

上面使用了这样一个事实，尽管 utf-8 是可变宽度字符编码，但多字节字符中没有单个字节属于字母的 ascii 范围。

它还假设.isalpha()for 方法bytearray不依赖于语言环境，例如，b"abа".isalpha()在 Python 2 中依赖于语言环境。

如果要测试单个字节：

>>> from curses.ascii import isalpha
>>> b[0]
97
>>> isalpha(b[0]) # it accepts either integer (byte) or a string
True

score 1 · Accepted Answer

您可以使用reduce将序列缩减为单个值。在这里，我只是and在调用str.isalpha中的每个字节后应用二进制文件bytearray：

ba = bytearray('test data')
reduce(lambda x,y: x and y, (chr(b).isalpha() for b in ba))

但真的

str(ba).isalpha()

会工作得很好。

score 1 · Accepted Answer

正如我评论的那样，您不会测试单个字节......但您可以直接在字节数组类型上使用字符串方法，包括 isalpha()：

>>> s = 'nowisthetime'
>>> b = bytearray(s, "UTF-8")
>>> b
bytearray(b'nowisthetime')
>>> b.isalpha()
True

编辑添加：但是， isalpha() 方法似乎没有使用编码来进行每个字符的处理，所以这似乎只适用于 ASCII 字母。例如：

>>> b2 = bytearray("αβγ", "utf_8")
>>> b2.isalpha()
False
>>> str(b2,"utf_8")
'αβγ'
>>> str(b2,"utf_8").isalpha()
True
>>>

因此，如果您确实需要了解其他字母表，这可能不会那么热门。哦，好吧，无论如何，它更快...... :(

PS：我在上面使用了 Idle/Python 3.3。在 Python 2 中，您将需要使用 u"" 字符串作为希腊字母。

score 0 · Accepted Answer

正如其他人所提到的，非 ASCII 编码，如 UTF-8，是多字节编码。这意味着你遇到了麻烦。方法如下：

>>> x = bytearray("string","utf8")
>>> y = bytearray([x[0]],"utf8")
>>> print y
s

一切看起来都不错，但是如果我们不使用所有英文字母会发生什么。

>>> x = bytearray(u"touché","utf8")
>>> len(x)
7

哦哦..

>>> y = bytearray([x[-1]])
>>> print y
©

那不好。

score 0 · Accepted Answer

这个函数看起来像是一个更快的解决方案——根据 timeit 基准，它快了大约 50%，根据 cProfile 基准，它快了大约 40%。无论哪种方式 ch(b).isalpha() 都非常快并且节省了编写单独的函数。所以两者都工作正常。

def isalphabyte(c):
   return ((c>96 and c<123) or (c> 64 and c<91))
a=bytearray(b"azAZ 123")
isalphabyte(a[0])
20: True
isalphabyte(a[4]) 
False

>>> timeit.timeit('for i in range(1000000): chr(b"abc"[0]).isalpha()',number=1)
36: 0.31040439769414263
>>> timeit.timeit('for i in range(1000000): isalphabyte(b"abc"[0])',"from __main__ import isalphabyte",number=1)
37: 0.22895044913212814

>>> cProfile.run('for i in range(1000000): chr(b"abc"[0]).isalpha()')
         2000003 function calls in 0.571 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.364    0.364    0.571    0.571 <string>:1(<module>)
  1000000    0.156    0.000    0.156    0.000 {built-in method chr}
        1    0.000    0.000    0.571    0.571 {built-in method exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  1000000    0.051    0.000    0.051    0.000 {method 'isalpha' of 'str' objects}


>>> cProfile.run('for i in range(1000000): isalphabyte(b"abc"[0])')
         1000003 function calls in 0.335 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000000    0.133    0.000    0.133    0.000 <pyshell#74>:1(isalphabyte)
        1    0.202    0.202    0.335    0.335 <string>:1(<module>)
        1    0.000    0.000    0.335    0.335 {built-in method exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

python - 在 Python 中如何检查 UTF-8 字节数组中的字节是否为 ASCII a-zA-Z

5 回答 5

Related

Reference