1

I just learned about re.Scanner while looking for ways to parse a series of lines that could be a bit flexible in definition. It looks (not knowing what it's meant to do) like it's exactly what I want, but I'm having an issue.

I define my scanner:

scanner = re.Scanner([
    (r"([0-9]+(?:\ h|h))",    lambda scanner,token:("HOURS", token)),
    ])

results, remainder = scanner.scan(line)

which should be able to find something like '1h' or '1 h' in the supplied string. But, this only works if the hour is at the beginning of the string.

Passing in:

1 h words words words
bla 2 h words words

only the first entry gets parsed as an hour. Without being able to read up on Scanner, I thought it would be able to find a match anywhere in the supplied string, but it looks like it's just at the beginning. It also seems to ignore a lot of the standard regex controls (like () for capturing and (?:) for non capturing.

Should I be looking somewhere else? Is it a bad idea to use a class that doesn't look like it's going to make it into the official version of Python?

4

1 回答 1

4

Scanner.scan does indeed start at the beginning of the line, and requires that every bit of it match some pattern. The scan method stops at the first point where none of the patterns match, and the rest of the string is returned as the remainder.

If you want to skip over anything that does not match, just put

(r'.', lambda scanner, token: None),

at the end of the list of patterns/functions.


The Scanner class has been in the standard library for quite a few years now (at least as far back as 2003), it's just not documented (yet?).

I don't think you have to worry about it disappearing any time soon. And even if it were to disappear, the definition of the Scanner class is quite short and is right here.


import re
line = '''\
1 h words words words
bla 2 h words words
'''

scanner = re.Scanner([
    (r"([0-9]+(?:\ h|h))",    lambda scanner, token: ("HOURS", token)),
    (r'.', lambda scanner, token: None),
    ], flags=re.DOTALL)

results, remainder = scanner.scan(line)
print(results)

yields

[('HOURS', '1 h'), ('HOURS', '2 h')]
于 2013-10-09T19:32:07.187 回答