0

我想将文件解析为令牌列表。每个令牌至少包含一行,但可以包含更多。每个标记都匹配一个正则表达式。如果输入不是标记序列(即没有垃圾前导、中间或尾随),我想发出错误信号。我不关心内存消耗,因为输入文件相对较小。

在 Perl 中,我会使用类似(伪代码)的东西:

$s = slurp_file ();
while ($s ne '') {
  if ($s =~ s/^\nsection (\d)\n\n/p) {
    push (@r, ['SECTION ' . $1, ${^MATCH}]);
  } elsif ($s =~ s/^some line\n/p) {
    push (@r, ['SOME LINE', ${^MATCH}]);
  [...]
  } else {
    die ("Found garbage: " . Dumper ($s));
  }
}

我当然可以将此 1:1 移植到 Python,但有没有更 Pythonic 的方式来做到这一点?(我不想逐行解析,然后在上面构建一个手工制作的状态机。)

4

1 回答 1

2

模块中有一个未记录的工具,在re这里可能会有所帮助。你可以像这样使用它:

import re
import sys

def section(scanner, token):
    return "SECTION", scanner.match.group(1)

def some_line(scanner, token):
    return "SOME LINE", token

def garbage(scanner, token):
    sys.exit('Found garbage: {}'.format(token))

# scanner will attempt to match these patterns in the order listed.
# If there is a match, the second argument is called.
scanner = re.Scanner([  
    (r"section (\d+)$$", section),
    (r"some line$", some_line), 
    (r"\s+", None),  # skip whitespace
    (r".+", garbage), # if you get here it's garbage
    ], flags=re.MULTILINE)


tokens, remainder = scanner.scan('''\

section 1

some line
''')
for token in tokens:
    print(token)

产量

('SECTION', '1')
('SOME LINE', 'some line')
于 2013-06-20T13:03:51.333 回答