0

刚刚发现我的文件结构可能会有所不同,而我的正则表达式有时会因为这种变化而起作用。我的正则表达式是
v6 = re.findall(r'(?s)----------\s*LOW VOLTAGE SUMMARY BY AREA.*?\rACTIVITY.+?',wholefile)

它当前与文件的以下部分匹配。

----------           LOW VOLTAGE SUMMARY BY AREA            ----------

         BUS   NAME   BASKV    VOLT    TIME       AREA     ZONE

       12006  [AMISTAD 69.0]   0.971   1.8700  10 NEW MEXICO    121
       11223  [WHITESA213.8]   0.918   1.9900  11 EL PASO       110
       70044  [B.HYDROB4.16]   0.955   2.3233  70 PSCOLORADO    703
       70044  [B.HYDROB4.16]   0.955   2.3233  70 PSCOLORADO    703
       79086  [PAGOSA   115]   0.937   2.0333  73 WAPA R.M.     791

ACTIVITY? 
PDEV

ENTER OUTPUT DEVICE CODE:
 0 FOR NO OUTPUT
 1 FOR PROGRESS WINDOW

但是,文件的该部分有时如下

    ----------           LOW VOLTAGE SUMMARY BY AREA            ----------

         BUS   NAME   BASKV    VOLT    TIME       AREA     ZONE

       12006  [AMISTAD 69.0]   0.742  13.2060  10 NEW MEXICO    121
       11223  [WHITESA213.8]   0.916   1.8367  11 EL PASO       110
       70187  [FTGARLND69.0]   0.936  19.6099  70 PSCOLORADO    710
       73216  [WINDRIVR 115]   0.858   3.6100  73 WAPA R.M.     750

(VFSCAN) AT TIME = 20.0000 UP TO  100 BUSES WITH LOW FREQUENCY BELOW 59.600:

X ----- BUS ------ X    FREQ       X ----- BUS ------ X    FREQ
12063 [ROSEBUD 13.8]   59.506     

在这两种情况下,我只想捕获以下部分:

----------           LOW VOLTAGE SUMMARY BY AREA            ----------

     BUS   NAME   BASKV    VOLT    TIME       AREA     ZONE

   12006  [AMISTAD 69.0]   0.971   1.8700  10 NEW MEXICO    121
   11223  [WHITESA213.8]   0.918   1.9900  11 EL PASO       110
   70044  [B.HYDROB4.16]   0.955   2.3233  70 PSCOLORADO    703
   70044  [B.HYDROB4.16]   0.955   2.3233  70 PSCOLORADO    703
   79086  [PAGOSA   115]   0.937   2.0333  73 WAPA R.M.     791

无论我查看的是哪个版本的文件,我的正则表达式如何返回上面的部分?

4

2 回答 2

1

这应该工作

v6 = re.findall(r'(?s)----------\s*LOW VOLTAGE SUMMARY BY AREA.*?\r(ACTIVITY|\(VFSCAN\)).+?',wholefile)
于 2012-10-01T14:08:20.063 回答
1

我不建议使用正则表达式,而是做一些解析。假设您的数据在一个名为的字符串中data

lines = [line for line in data.split("\n")]

# find start of header
for index, line in enumerate(lines):
    if "LOW VOLTAGE SUMMARY BY AREA" in line:
        start_index = index
        break

# first first data entry (line starting with whitespace and then a number)
for index, line in enumerate(lines[start_index:]):
    if line.strip() and line.split()[0].isdigit():
        first_entry_index = start_index + index
        break

# find last data entry (line starting with whitespace and then a number)
for index, line in enumerate(lines[first_entry_index:]):
    # we don't do this inside the if because it's possible
    # to end the data with only entries and whitespace
    end_entry_index = first_entry_index + index

    if line.strip() and not line.split()[0].isdigit():
        break

# print all lines between header and last data entry
print("\n".join(lines[start_index:end_entry_index]))
于 2012-10-01T14:12:05.103 回答