python - Most efficient way to check strings for illegal characters

Question

Where the illegal character set is in many different ranges and individual points, what would be the most efficient way to check strings against such illegal set.

I timed two methods, and one is surprisingly much slower than the other (check the code below - and assuming my timing has no issues). Can the search pattern method below be improved on, and not being restricted to using regex's.

import re
import timeit

# match pattern
matchPat = re.compile(r'[^'
                   r'\u0000-\u0008'    # C0 block first segment
                   r'\u000B\u000C'    # allow TAB U+0009, LF U+000A, and CR U+000D
                   r'\u000E-\u001F'    # rest of C0
                   r'\u007F'           # disallow DEL U+007F
                   r'\u0080-\u009F'    # All C1 block
                   r'\u2028\u2029'     # LS and PS unicode newlines
                   r'\uD800-\uDFFF'    # surrogate block
                   r'\uFFFE\uFFFF'     # non-characters
                   r'\uFEFF]*$',       # BOM only allowed at the start of the stream
                   )

# search pattern
searchPat = re.compile(r'['
                   r'\u0000-\u0008'    # C0 block first segment
                   r'\u000B\u000C'    # allow TAB U+0009, LF U+000A, and CR U+000D
                   r'\u000E-\u001F'    # rest of C0
                   r'\u007F'           # disallow DEL U+007F
                   r'\u0080-\u009F'    # All C1 block
                   r'\u2028\u2029'     # LS and PS unicode newlines
                   r'\uD800-\uDFFF'    # surrogate block
                   r'\uFFFE\uFFFF'     # non-characters
                   r'\uFEFF]',         # BOM only allowed at the start of the stream
                   )

s = 'allow TAB 0009, LF 000A, and CR 000D -- only allowed at the start of the stream' # sample legal string

def fmatch(s):
    if matchPat.match(s):
        valid = True

def fsearch(s):
    if searchPat.search(s):
        valid = False

print ('fmatch==',timeit.timeit("fmatch(s)", setup="from __main__ import fmatch,s", number=1000000))
print ('fsearch==',timeit.timeit("fsearch(s)", setup="from __main__ import fsearch,s", number=1000000))


$ python3 valid.py
fmatch== 5.631323281995719
fsearch== 1.320517893997021

python - Most efficient way to check strings for illegal characters

0 回答 0

Related

Reference