python - python中的正则表达式可选匹配失败

Question

tickettypepat = (r'MIS Notes:.*(//p//)?.*')
retype = re.search(tickettypepat,line)
if retype:
  print retype.group(0)
  print retype.group(1)

给定输入。

MIS Notes: //p//

谁能告诉我为什么 group(0) 是

MIS Notes: //p//

并且 group(1) 返回为 None？

我最初使用的是正则表达式，因为在遇到问题之前，匹配比仅匹配 //p// 更复杂，这是完整的代码。我在这方面还很陌生，所以请原谅我的菜鸟，我相信有更好的方法可以完成大部分工作，如果有人想指出那些会很棒的方法。但除了 //[pewPEW]// 的正则表达式过于贪婪之外，它似乎是功能性的。我很感激帮助。

接受文本并清理/转换一些东西。

filename = (r'.\4-12_4-26.txt')
import re
import sys
#Clean up output from the web to ensure that you have one catagory per line
f = open(filename)
w = open('cleantext.txt','w')

origdatepat = (r'(Ticket Date: )([0-9]+/[0-9]+/[0-9]+),( [0-9]+:[0-9]+ [PA]M)')
tickettypepat = (r'MIS Notes:.*(//[pewPEW]//)?.*')

print 'Begining Blank Line Removal'
for line in f:
    redate = re.search(origdatepat,line)
    retype = re.search(tickettypepat,line)
    if line == ' \n':
        line = ''
        print 'Removing blank Line'
#remove ',' from time and date line    
    elif redate:
        line = redate.group(1) + redate.group(2)+ redate.group(3)+'\n'
        print 'Redating... ' + line

    elif retype:
        print retype.group(0)
        print retype.group(1)
        
        if retype.group(1) == '//p//':
            line = line + 'Type: Phone\n'
            print 'Setting type for... ' + line
        elif retype.group(1) == '//e//':
            line = line + 'Type: Email\n'
            print 'Setting type for... ' + line
        elif retype.group(1) == '//w//':
            line = line + 'Type: Walk-in\n'
            print 'Setting type for... ' + line
        elif retype.group(1) == ('' or None):
            line = line + 'Type: Ticket\n'
            print 'Setting type for... ' + line

    w.write(line)

print 'Closing Files'                 
f.close()
w.close()

这是一些示例输入。

Ticket No.: 20100426132 
Ticket Date: 04/26/10, 10:22 AM 
Close Date:  
Primary User: XXX
Branch: XXX
Help Tech: XXX
Status: Pending  
Priority: Medium  
Application: xxx
Description: some issue
Resolution: some resolution
MIS Notes: some random stuff //p// followed by more stuff
Key Words:  

Ticket No.: 20100426132 
Ticket Date: 04/26/10, 10:22 AM 
Close Date:  
Primary User: XXX
Branch: XXX
Help Tech: XXX
Status: Pending  
Priority: Medium  
Application: xxx
Description: some issue
Resolution: some resolution
MIS Notes: //p//
Key Words:  

Ticket No.: 20100426132 
Ticket Date: 04/26/10, 10:22 AM 
Close Date:  
Primary User: XXX
Branch: XXX
Help Tech: XXX
Status: Pending  
Priority: Medium  
Application: xxx
Description: some issue
Resolution: some resolution
MIS Notes: //e// stuff....
Key Words:  


Ticket No.: 20100426132 
Ticket Date: 04/26/10, 10:22 AM 
Close Date:  
Primary User: XXX
Branch: XXX
Help Tech: XXX
Status: Pending  
Priority: Medium  
Application: xxx
Description: some issue
Resolution: some resolution
MIS Notes:
Key Words:

score 4 · Accepted Answer

MIS Notes:.*(//p//)?.*像这样工作，在"MIS Notes: //p//"作为目标的例子中：

MIS Notes:比赛"MIS Notes:"，这里没有惊喜。
.*立即运行到字符串的末尾（匹配到目前为止"MIS Notes: //p//"）
(//p//)? is optional. Nothing happens.
.* has nothing left to match, we are at the end of the string already. Since the star allows zero matches for the preceding atom, the regex engine stops reporting the entire string as a match, and the sub-group as empty.

Now when you change the regex to MIS Notes:.*(//p//).*, the behavior changes:

MIS Notes: matches "MIS Notes:", still no surprises here.
.* immediately runs to the end of the string (match so far "MIS Notes: //p//")
(//p//) is necessary. The engine starts to backtrack character by character in order to fulfill this requirement. (Match so far "MIS Notes: ")
(//p//) can match. Sub-group one is saved and contains "//p//".
.* runs to the end of the string. Hint: If you are not interested in what it matches, it is superfluous and you can remove it.

Now when you change the regex to MIS Notes:.*?//(p)//, the behavior changes again:

MIS Notes: matches "MIS Notes:", and still no surprises here.
.*? is non-greedy and checks the following atom before it proceeds (match so far "MIS Notes: ")
//(p)// can match. Sub-group one is saved and contains "p".
Done. Note that no backtracking occurs, this saves time.

Now if you know that there can be no / before the //p//, you can use: MIS Notes:[^/]*//(p)//:

MIS Notes: matches "MIS Notes:", you get the idea.
[^/]* can fast-forward to the first slash (this is faster than .*?)
//(p)// can match. Sub-group one is saved and contains "p".
Done. Note that no backtracking occurs, this saves time. This should be faster than version #3.

score 1 · Accepted Answer

正则表达式是贪婪的，这意味着.*尽可能多地匹配整个字符串。所以没有任何东西可以匹配可选组。group(0)始终是整个匹配的刺。

根据您的评论，您为什么要使用正则表达式？这样的事情还不够吗：

if line.startswith('MIS Notes:'): # starts with that string
    data = line[len('MIS Notes:'):] # the rest in the interesting part
    if '//p//' in data:
        stuff, sep, rest = data.partition('//p//') # or sothing like that
    else:
        pass #other stuff

score 0 · Accepted Answer

对于您的目的，该模式是模棱两可的。最好按前缀或后缀对它们进行分组。在此处的示例中，我选择了前缀分组。基本上，如果//p//出现在该行中，则前缀是非空的。后缀将在该//p//项目之后的所有内容，或者如果它不存在则在该行中的所有内容。

import re
lines = ['MIS Notes: //p//',
    'MIS Notes: prefix//p//suffix']

tickettypepat = (r'MIS Notes: (?:(.*)//p//)?(.*)')
for line in lines:
    m = re.search(tickettypepat,line)
    print 'line:', line
    if m: print 'groups:', m.groups()
    else: print 'groups:', m

结果：

line: MIS Notes: //p//
groups: ('', '')
line: MIS Notes: prefix//p//suffix
groups: ('prefix', 'suffix')

python - python中的正则表达式可选匹配失败

接受文本并清理/转换一些东西。

3 回答 3

Related

Reference