2

所以我正在编写一个 Python 包,它将解析来自实时 NOAAPORT 提要的数据,并将其放入 SQL 数据库中。我很难理解 RegEx。

我特别希望匹配 LAT...LON 行:

LAT...LON 3153 10127 3153 10118 3152 10118 3142 10122
      3141 10127 3152 10127

我能想到的最好的是:

r'^LAT...LON(.*)'

但是 LAT...LON 之后的每两个数字都是纬度和经度点,我似乎无法让它与下一行点相匹配。

这也是可选的,我还想对这个龙卷风警告的某些部分进行分组。我想将 WMO 标题“TORSJT”分开(前三个字母是咨询类型,后三个是发布咨询的气象局。TOR=Tornado Warning SJT=San Angelo, TX weather office)

然后我只想将警告文本分开:

BULLETIN - EAS ACTIVATION REQUESTED
TORNADO WARNING
NATIONAL WEATHER SERVICE SAN ANGELO TX
802 PM CDT SAT APR 7 2012

THE NATIONAL WEATHER SERVICE IN SAN ANGELO HAS ISSUED A

* TORNADO WARNING FOR...
  NORTHWESTERN IRION COUNTY IN WEST CENTRAL TEXAS...

* UNTIL 815 PM CDT

* AT 757 PM CDT...A SEVERE THUNDERSTORM CAPABLE OF PRODUCING A
  TORNADO WAS OVER EXTREME NORTHWESTERN IRION COUNTY...OR 24 MILES
  NORTHEAST OF BIG LAKE...MOVING SOUTH SOUTHWEST AT 15 MPH. THIS
  STORM HAS A HISTORY OF PRODUCING A TORNADO AND MAY PRODUCE A
  TORNADO AT ANY TIME.

  IN ADDITION TO DANGEROUS TORNADIC WINDS...OTHER HAZARDS INCLUDE...
  LARGE DAMAGING HAIL UP TO TENNIS BALL SIZE.
  DAMAGING STRAIGHT LINE WINDS IN EXCESS OF 60 MPH.
  POTENTIALLY DEADLY LIGHTNING.

*THE TORNADO WILL REMAIN OVER MAINLY RURAL AREAS OF...
  NORTHWESTERN IRION COUNTY.

PRECAUTIONARY/PREPAREDNESS ACTIONS...

A SEVERE THUNDERSTORM WATCH REMAINS IN EFFECT UNTIL 1000 PM CDT
SATURDAY EVENING FOR WEST CENTRAL TEXAS.

&&

我基本上希望将所有内容放入字典中,所有 3 个都有自己的密钥。

这是完整的警告以供参考:

368
WFUS54 KSJT 080102
TORSJT
TXC235-080115-
/O.NEW.KSJT.TO.W.0012.120408T0102Z-120408T0115Z/

BULLETIN - EAS ACTIVATION REQUESTED
TORNADO WARNING
NATIONAL WEATHER SERVICE SAN ANGELO TX
802 PM CDT SAT APR 7 2012

THE NATIONAL WEATHER SERVICE IN SAN ANGELO HAS ISSUED A

* TORNADO WARNING FOR...
  NORTHWESTERN IRION COUNTY IN WEST CENTRAL TEXAS...

* UNTIL 815 PM CDT

* AT 757 PM CDT...A SEVERE THUNDERSTORM CAPABLE OF PRODUCING A
  TORNADO WAS OVER EXTREME NORTHWESTERN IRION COUNTY...OR 24 MILES
  NORTHEAST OF BIG LAKE...MOVING SOUTH SOUTHWEST AT 15 MPH. THIS
  STORM HAS A HISTORY OF PRODUCING A TORNADO AND MAY PRODUCE A
  TORNADO AT ANY TIME.

  IN ADDITION TO DANGEROUS TORNADIC WINDS...OTHER HAZARDS INCLUDE...
  LARGE DAMAGING HAIL UP TO TENNIS BALL SIZE.
  DAMAGING STRAIGHT LINE WINDS IN EXCESS OF 60 MPH.
  POTENTIALLY DEADLY LIGHTNING.

*THE TORNADO WILL REMAIN OVER MAINLY RURAL AREAS OF...
  NORTHWESTERN IRION COUNTY.

PRECAUTIONARY/PREPAREDNESS ACTIONS...

A SEVERE THUNDERSTORM WATCH REMAINS IN EFFECT UNTIL 1000 PM CDT
SATURDAY EVENING FOR WEST CENTRAL TEXAS.

&&

LAT...LON 3153 10127 3153 10118 3152 10118 3142 10122
      3141 10127 3152 10127
TIME...MOT...LOC 0102Z 355DEG 11KT 3148 10125

$$
4

3 回答 3

2

您基本上是在尝试执行多行正则表达式匹配。

而不是使用贪婪匹配,.*尝试使用这样的东西:

import re

regex = re.compile('LAT...LON([0-9\s]+)', flags=re.MULTILINE)
matches = regex.search('''LAT...LON 3153 10127 3153 10118 3152 10118 3142 10122
      3141 10127 3152 10127
TIME...MOT...LOC 0102Z 355DEG 11KT 3148 10125''')
print re.split('\s+', matches.group(1))[1:-1]

在实时控制台会话中:

>>> import re
>>> 
>>> regex = re.compile('LAT...LON([0-9\s]+)', flags=re.MULTILINE)
>>> matches = regex.search('''LAT...LON 3153 10127 3153 10118 3152 10118 3142 10122
...       3141 10127 3152 10127
... TIME...MOT...LOC 0102Z 355DEG 11KT 3148 10125''')
>>> print re.split('\s+', matches.group(1))[1:-1]
['3153', '10127', '3153', '10118', '3152', '10118', '3142', '10122', '3141', '10127', '3152', '10127']
>>>
于 2012-04-09T04:14:32.630 回答
2

其他人已经介绍了正则表达式的构造,但您也应该注意如何使用 re。如果你想抓取一大块文本re.findall会更容易,因为它仍然是多行的并且它返回一个字符串列表(而不是匹配对象)。显然,这取决于你想做什么。相反, re.match在行首查找匹配项。

于 2012-04-09T04:40:44.523 回答
1

.除非指定了 DOTALL 标志,否则匹配除换行符以外的所有内容。
所以只需添加一个\n然后另一个.*

import re

test = '''LAT...LON 3153 10127 3153 10118 3152 10118 3142 10122
      3141 10127 3152 10127
TIME...MOT...LOC 0102Z 355DEG 11KT 3148 10125'''

result = re.search(r'LAT...LON.*\n.*', test)

print result.group()

或者

result = re.search(r'LAT...LON[\d\s]+\n[\d\s]+', test)
于 2012-04-09T04:15:47.320 回答