2

I need to implement a Python regular expression to search for a all occurrences A1a or A_1_a or A-1-a or _A_1_a_ or _A1a, where:

  • A can be A to Z.
  • 1 can be 1 to 9.
  • a can be a to z.

Where there are only three characters letter number letter, separated by Underscores, Dashes or nothing. The case in the search string needs to be matched exactly.

The main problem I am having is that sometimes these three letter combinations are connected to other text by dashes and underscores. Also creating the same regular expression to search for A1a, A-1-a and A_1_a.

Also I forgot to mention this is an XML file.

Thanks this found every occurrence of what I was looking for with a slight modification [-]?[A][-]?[1][-]?[a][-]?, but I need to have these be variables something like

    [-]?[var_A][-]?[var_3][-]?[Var_a][-]? 

would that be done like this

    regex = r"[-]?[%s][-]?[%s][-]?[%s][-]?" 
    print re.findall(regex,var_A,var_Num,Var_a)

Or more like:

    regex = ''.join(['r','\"','[-]?[',Var_X,'][-]?[',Var_Num,'][-]?[',Var_x,'][-]?','\"'‌​])
    print regex 
    for sstr in searchstrs: 
            matches = re.findall(regex, sstr, re.I)

But this isn't working

Sample Lines of the File: Before Running Script

<t:ION t:SA="BoolObj" t:H="2098947" t:P="2098944" t:N="AN7 Result" t:CI="Boolean_Register" t:L="A_3_a  Fdr2" t:VS="true">
<t:ION t:SA="RegisterObj" t:H="20971785" t:P="20971776" t:N="ART1 Result 1" t:CI="NumericVariable_Register" t:L="A3a1 Status" t:VS="1">
<t:ION t:SA="ModuleObj" t:H="2100736" t:P="2097152" t:N="AND/OR 14" t:CI="AndOr_Module" t:L="A_3_a**_2 Energized from Norm" t:S="0" t:SC="5">

After Running Script What I am getting: (It's deleting the entire line and leaving only what is below)

  • B_1_c
  • B1c1
  • B_1_c_2

What I Want to get:

<t:ION t:SA="BoolObj" t:H="2098947" t:P="2098944" t:N="AN7 Result" t:CI="Boolean_Register" t:L="B_1_c  Fdr2" t:VS="true">
<t:ION t:SA="RegisterObj" t:H="20971785" t:P="20971776" t:N="ART1 Result 1" t:CI="NumericVariable_Register" t:L="B1c1 Status" t:VS="1">
<t:ION t:SA="ModuleObj" t:H="2100736" t:P="2097152" t:N="AND/OR 14" t:CI="AndOr_Module" t:L="B_1_c_2 Energized from Norm" t:S="0" t:SC="5">

import re
import os

search_file_name = 'Alarms Test.fwn'
pattern = 'A3a'
fileName, fileExtension = os.path.splitext(search_file_name)
newfilename = fileName + '_' + pattern + fileExtension
outfile = open(newfilename, 'wb')


def find_ext(text):
    matches = re.findall(r'([_-]?[A{1}][_-]?[3{1}][_-]?[a{1}][_-]?)', text)
    records = [m.replace('3', '1').replace('A', 'B').replace('a', 'c') for m in matches]
    if matches:
        outfile.writelines(records)
        return 1
    else:
        outfile.writelines(text)
        return 0


def main():
    success = 0
    count = 0
    with open(search_file_name, 'rb') as searchfile:
        try:
            searchstrs = searchfile.readlines()
            for s in searchstrs:
                success = find_ext(s)
                count = count + success
        finally:
            searchfile.close()

    print count

if __name__ == "__main__":
    main()
4

4 回答 4

2

您想使用以下内容来查找您的匹配项。

matches = re.findall(r'([_-]?[a-z][_-]?[1-9][_-]?[a-z][_-]?)', s, re.I)

regex101 演示

如果您正在寻找匹配项,则删除所有-,_字符,您可以这样做..

import re

s = '''
A1a _A_1 A_ A_1_a A-1-a _A_1_a_ _A1a _A-1-A_ a1_a  A-_-5-a
_A-_-5-A a1_-1 XMDC_A1a or XMDC-A1a or XMDC_A1-a XMDC_A_1_a_ _A-1-A_
'''

def find_this(text):
    matches = re.findall(r'([_-]?[a-z][_-]?[1-9][_-]?[a-z][_-]?)', text, re.I)
    records = [m.replace('-', '').replace('_', '') for m in matches]
    print records

find_this(s)

输出

['A1a', 'A1a', 'A1a', 'A1a', 'A1a', 'A1A', 'a1a', 'A1a', 'A1a', 'A1a', 'A1a', 'A1A']

查看工作演示

于 2013-10-18T19:08:24.603 回答
1

为了在没有标点符号的情况下快速取出A1as,并且不必从捕获的部分重建字符串......

t = '''A1a _B_2_z_ 
A_1_a 
A-1-a 
_A_1_a_ 
_C1c '''

re.findall("[A-Z][0-9][a-z]",t.replace("-","").replace("_",""))

输出:

['A1a', 'B2z', 'A1a', 'A1a', 'A1a', 'C1c']

(但如果您不想从 捕获FILE.TXT-2b,那么您将不得不小心大多数这些解决方案......)

于 2013-10-18T19:23:24.580 回答
0

如果字符串可以用多个下划线或破折号分隔(例如A__1a):

[_-]*[A-Z][_-]*[1-9][_-]*[a-z]

如果只能有一个或零个下划线或破折号:

[_-]?[A-Z][_-]?[1-9][_-]?[a-z]
于 2013-10-18T18:48:41.727 回答
0
regex = r"[A-Z][-_]?[1-9][-_]?[a-z]"
print re.findall(regex,some_string_variable)

应该管用

只捕获您感兴趣的部分,将它们包裹在括号中

regex = r"([A-Z])[-_]?([1-9])[-_]?([a-z])"
print re.findall(regex,some_string_variable)

如果下划线或破折号或缺少下划线或短划线必须匹配,否则将返回错误结果,您将需要状态机,而正则表达式是无状态的

于 2013-10-18T18:48:59.890 回答