1

我正在尝试从如下所示的字符串中提取内容:

A.content content 
  content 
B.content  C. content content
content D.content

这是我在 Python 中的正则表达式模式:

reg = re.compile(r''' 
     (?xi)
     (\w\.\t*\s*)+ (?# e.g. A. or b.)
     (.+)          (?# the alphanumeric content with common symbols)
     ^(?:\1)       (?# e.g. 'not A.' or 'not b.')
     ''')

m = reg.findall(s)

让我给你举个例子。假设我有以下字符串:

s = '''
 a.   $1000 abcde!?
 b.  (December 31, 1993.)
 c.  8/1/2013
 d.   $690 * 10% = 69 Blah blah
'''

以下正则表达式有效并将正则表达式组的内容返回给我:

reg = re.compile(r'''
            (?xi)
            \w\.\t*
            ([^\n]+) (?# anything not newline char)
''')

for c in reg.findall(s): print "line:", c
>>>line:    $1000 abcde!?
>>>line:  (December 31, 1993.)
>>>line:    8/1/2013
>>>line:   $690 * 10% = 69 Blah blah

如果内容渗入另一行,则正则表达式不起作用

s = '''
   a.   $1000 abcde!? B.     December 
   31, 1993 c.  8/1/2013 D.   $690 * 10% = 
   69 Blah blah
'''
reg = re.compile(r''' 
     (?xi)
     (\w\.\t*\s*)+ (?# e.g. A. or b.)
     (.+)          (?# the alphanumeric content with common symbols)
     ^(?:\1)       (?# e.g. 'not A.' or 'not b.')
     ''')
for c in reg.findall(s): print "line:", c # no matches :(
>>> blank :(

无论是否有换行符分隔内容,我都希望获得相同的匹配。

这就是我尝试使用否定匹配词组的原因。那么关于如何使用正则表达式或其他一些解决方法来解决这个问题的任何想法?

谢谢。

保罗

4

1 回答 1

1

我明白你想要什么。你想分开

a.   $1000 abcde!? B.     December 
31, 1993 c.  8/1/2013 D.   $690 * 10% = 
69 Blah blah

进入

  • a. $1000 abcde!?
  • B. December \n31, 1993
  • c. 8/1/2013
  • D. $690 * 10% = \n69 Blah blah

对?那么你想要的是负面的前瞻断言:

reg = re.compile(r''' 
     (?xs)               # no need for i, but for s (dot matches newlines)
     (\b\w\.\s*)         # e.g. A. or b. (word boundary to restrict to 1 letter)
     ((?:(?!\b\w\.).)+)  # everything until the next A. or b.
     ''')

使用它findall()

>>> reg.findall(s)
[('a.   ', '$1000 abcde!? '), ('B.     ', 'December \n   31, 1993 '), 
 ('c.  ', '8/1/2013 '), ('D.   ', '$690 * 10% = \n   69 Blah blah\n')]

如果您不想要这些a.零件,请使用

reg = re.compile(r''' 
     (?xs)               # no need for i, but for s (dot matches newlines)
     (?:\b\w\.\s*)       # e.g. A. or b. (word boundary to restrict to 1 letter)
     ((?:(?!\b\w\.).)+)  # everything until the next A. or b.
     ''')
于 2013-03-04T20:58:44.093 回答