python - Parsing a string pattern (Python)

Question

I have a file with following data:

<<row>>12|xyz|abc|2.34<</row>>
<<eof>>

The file may have several rows like this. I am trying to design a parser which will parse each row present in this file and return an array with all rows. What would be the best way of doing it? The code has to be written in python. Code should not take rows that do not start with <<row>> or should raise error.

=======> UPDATE <========

I just found that a particular <<row>> can span multiple lines. So my code and the code present below aren't working anymore. Can someone please suggest an efficient solution?

The data files can contain hundreds to several thousands of rows.

score 1 · Accepted Answer

def parseFile(fileName):
  with open(fileName) as f:

    def parseLine(line):
      m = re.match(r'<<row>>(\d+)\|(\w+)\|(\w+)\|([\d\.]+)<</row>>$', line)
      if m:
        return m.groups()

    return [ values for values in (
      parseLine(line)
        for line in f
        if line.startswith('<<row>>')) if values ]

和？我不一样吗？;-)

score 1 · Accepted Answer

没有正则表达式的简单方法：

output = []
with open('input.txt', 'r') as f:
    for line in f:
        if line == '<<eof>>':
            break
        elif not line.startswith('<<row>>'):
            continue
        else:
            output.append(line.strip()[7:-8].split('|'))

这使用从开始的每一行，<<row>>直到一行只包含<<eof>>

score 0 · Accepted Answer

一个好的做法是测试不需要的情况并忽略它们。一旦你确定你有一条合规的线路，你就可以处理它。请注意，实际处理不在 if 语句中。如果没有将行拆分为多行，您只需要两个测试：

rows = list()
with open('newfile.txt') as file:
    for line in file.readlines():
        line = line.strip()
        if not line.startswith('<<row>>'):
            continue
        if not line[-8:] == '<</row>>':
            continue
        row = line[7:-8]
        rows.append(row)

将行拆分为多行，在某些情况下您需要保存前一行：

rows = list()
prev = None
with open('newfile.txt') as file:
    for line in file.readlines():
        line = line.strip()
        if not line.startswith('<<row>>') and prev is not None:
            line = prev + line
        if not line.startswith('<<row>>'):
            continue
        if not line[-8:] == '<</row>>':
            prev = line
            continue
        row = line[7:-8]
        rows.append(row)
        prev = None

如果需要，您可以使用以下方法拆分列：cols = row.split('|')

python - Parsing a string pattern (Python)

3 回答 3

Related

Reference