python - python regex: how to split string into distinct groups based on alphabets, digits and punctuation

Question

I am learning regular expressions using python 2.7

Given a sentence(assume lowercase and ascii) such as:

input = 'i like: a, b, 007 and c!!'

How would I tokenize the input string into

['i', 'like', ':', 'a', ',', 'b', ',', '007', 'and', 'c', '!!']

I can write the automata and code the transition matrix in C++, but I would like to do this in python

I am unable to come up with a regex that will match these distinct classes of alphabets, digits and punctuations in one go.

I have seen a couple of stackoverflow posts here and here, but do not quite follow their approaches.

I have tried this for some time now and I would appreciate your help on this.

P.S: This is not a homework question

score 3 · Accepted Answer

>>> from string import punctuation
>>> text = 'i like: a, b, 007 and c!!'
>>> re.findall('\w+|[{0}]+'.format(punctuation),text)
['i', 'like', ':', 'a', ',', 'b', ',', '007', 'and', 'c', '!!']

这也有效，但如果找不到字母数字字符，则会找到任何非空白字符

>>> re.findall('\w+|\S+',text)
['i', 'like', ':', 'a', ',', 'b', ',', '007', 'and', 'c', '!!']

python - python regex: how to split string into distinct groups based on alphabets, digits and punctuation

1 回答 1

Related

Reference