1

我正在寻找一种打开文件的简单方法,并搜索每一行以查看该行是否具有未闭合的括号和引号。如果该行有未闭合的括号/引号,我想将该行打印到文件中。我知道我可以用一个丑陋的 if/for 语句来做到这一点,但我知道 python 可能有更好的方法来处理 re 模块(我对此一无所知)或其他东西,但我不太了解这种语言这样做。

谢谢!

编辑:一些示例行。如果您将其复制到记事本或其他东西中并关闭自动换行,可能会更容易阅读(有些行可能很长)。此外,文件中有超过 100k 行,所以有效率的东西会很棒!

SL  ID=0X14429A0B TY=STANDARD OWN=0X429A03 EXT=22 SLTK=0X1C429A0B MP=0X684003F0 SUB=0X24400007
RT  ID=0X18429A19 TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0C CMDS=(N:0X8429A04,C:0X14429A0B) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
RT  ID=0X18429A1A TY=CALONSC OWN=0X14429A0B EXLP=0X14429A08 CMDS=(R:0X8429A04,N:0X8429A05,C:0X14429A0B) SGCC=2 REL=2 DESC="AURANT YD TO TRK.1" ATIS=T
RT  ID=0X18429A1B TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0A CMDS=(R:0X8429A04,R:0X8429A05,C:0X14429A0B) SGCC=2 REL=3 DESC="AURANT YD TO TRK.2" ATIS=T
SL  ID=0X14429A0C TY=STANDARD OWN=0X429A03 EXT=24 SLTK=0X1C429A0B MP=0X684003F1 SUB=0X24400007
RT  ID=0X18429A1C TY=CALONSC OWN=0X14429A0C EXLP=0X14429A0B CMDS=(N:0X8429A04,C:0X14429A0C) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
TK  ID=0X1C429A08 TY=BLKTK OWN=0X429A03 EXT=12 LRMP=0X6C40BDAF LEN=5837 FSPD=60 PSPD=65 QUAL=TRK.1 MAXGE=0 MAXGW=0 JAL=4 ALT=12 SUB=0X24400007 RULES=(CTC:B:UP:0X24400007:485.7305:486.8359:T) LLON=-118.1766772 RLON=-118.1620059 LLAT=34.06838375 RLAT=34.07811764 LELE=416.6983 RELE=425.0596 ULAD=NO URAD=NO
PT  ID=0X20429A0F TY=STANDARD OWN=0X1C429A08 LTK=0X1C40006C RTK=0X1C429A0C REL=1 LEN=1 LQUAL="TRK.1" RQUAL="TRK.1"
PTK OWN=0X1C429A08 PID=0X1C429A13
4

8 回答 8

6

如果您认为不会有向后不匹配的括号(即“)(”),您可以这样做:

with open("myFile.txt","r") as readfile, open("outFile.txt","w") as outfile:
    for line in readfile:
        if line.count("(") != line.count(")") or line.count('"') % 2 != 0:
            outfile.write(line)

否则,您将不得不一次数一个以检查是否不匹配,如下所示:

with open("myFile.txt","r") as readfile, open("outFile.txt","w") as outfile:
    for line in readfile:
        count = 0
        for char in line:
            if char == ")":
                count -= 1
            elif char == "(":
                count += 1
            if count < 0:
                break
         if count != 0 or text.count('"') % 2 != 0:
             outfile.write(line)

我想不出更好的方法来处理它。Python 不支持递归正则表达式,所以一个正则表达式解决方案是正确的。

关于这一点的另一件事:鉴于您的数据,最好将其放入一个函数并拆分您的字符串,这很容易使用正则表达式完成,如下所示:

import re
splitre = re.compile(".*?=(.*?)(?:(?=\s*?\S*?=)|(?=\s*$))")
with open("myFile.txt","r") as readfile, open("outFile.txt","w") as outfile:
    for line in readfile:
        def matchParens(text):
            count = 0
            for char in text:
                if char == ")":
                    count -= 1
                elif char == "(":
                    count += 1
                if count < 0:
                    break
            return count != 0 or text.count('"') % 2 != 0
        if any(matchParens(text) for text in splitre.findall(line)):
            outfile.write(line)

这样做可能更好的原因是它单独检查每个值对,这样如果你在一个值对中有一个开放的括号,而在后面的一个中有一个关闭的括号,它就不会认为没有不平衡的括号。

于 2012-08-09T19:43:05.670 回答
5

使用解析器包似乎有点过头了,但它很快:

text = """\
SL  ID=0X14429A0B TY=STANDARD OWN=0X429A03 EXT=22 SLTK=0X1C429A0B MP=0X684003F0 SUB=0X24400007
RT  ID=0X18429A19 TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0C CMDS=(N:0X8429A04,C:0X14429A0B) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
RT  ID=0X18429A1A TY=CALONSC OWN=0X14429A0B EXLP=0X14429A08 CMDS=(R:0X8429A04,N:0X8429A05,C:0X14429A0B) SGCC=2 REL=2 DESC="AURANT YD TO TRK.1" ATIS=T
RT  ID=0X18429A1B TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0A CMDS=(R:0X8429A04,R:0X8429A05,C:0X14429A0B) SGCC=2 REL=3 DESC="AURANT YD TO TRK.2" ATIS=T
SL  ID=0X14429A0C TY=STANDARD OWN=0X429A03 EXT=24 SLTK=0X1C429A0B MP=0X684003F1 SUB=0X24400007
RT  ID=0X18429A1C TY=CALONSC OWN=0X14429A0C EXLP=0X14429A0B CMDS=(N:0X8429A04,C:0X14429A0C) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
TK  ID=0X1C429A08 TY=BLKTK OWN=0X429A03 EXT=12 LRMP=0X6C40BDAF LEN=5837 FSPD=60 PSPD=65 QUAL=TRK.1 MAXGE=0 MAXGW=0 JAL=4 ALT=12 SUB=0X24400007 RULES=(CTC:B:UP:0X24400007:485.7305:486.8359:T) LLON=-118.1766772 RLON=-118.1620059 LLAT=34.06838375 RLAT=34.07811764 LELE=416.6983 RELE=425.0596 ULAD=NO URAD=NO
PT  ID=0X20429A0F TY=STANDARD OWN=0X1C429A08 LTK=0X1C40006C RTK=0X1C429A0C REL=1 LEN=1 LQUAL="TRK.1" RQUAL="TRK.1"
PTK OWN=0X1C429A08 PID=0X1C429A13 GOOD
PTK OWN=0X1C429A(08 PID=0X1C429A13 BAD
PTK OWN=0X1C429A08 )PID=0X1C429A13 BAD
PTK OWN=0X1C(42(9A))08 PID=0X1C429A13 GOOD
PTK OWN=0X1C(42(9A))08 PID=0X1C42(9A13 BAD
PTK OWN=0X1C(42(9A))08 PID=0X1C42"("9A13 GOOD
"""

from pyparsing import nestedExpr, quotedString

paired_exprs = nestedExpr('(',')')  |  quotedString

for i, line in enumerate(text.splitlines(), start=1):
    # use pyparsing expression to strip out properly nested quotes/parentheses
    stripped_line = paired_exprs.suppress().transformString(line)

    # if there are any quotes or parentheses left, they were not 
    # properly nested
    if any(unwanted in stripped_line for unwanted in '()"\''):
        print i, ':', line

印刷:

10 : PTK OWN=0X1C429A(08 PID=0X1C429A13 BAD
11 : PTK OWN=0X1C429A08 )PID=0X1C429A13 BAD
13 : PTK OWN=0X1C(42(9A))08 PID=0X1C42(9A13 BAD
于 2012-08-09T19:48:52.957 回答
3
  1. 只需从一行中提取所有有趣的符号。
  2. 每当您获得结束符号时,将开始符号压入堆栈并从堆栈中弹出。
  3. 如果堆栈是干净的,则符号是平衡的。如果堆栈下溢或没有完全展开,则您的线不平衡。

下面是检查一行的示例代码 - 我在第一行中插入了一个杂散括号。

d = """SL  ID=0X14429A0B TY=STANDARD OWN=0X429A(03 EXT=22 SLTK=0X1C429A0B MP=0X684003F0 SUB=0X24400007
RT  ID=0X18429A19 TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0C CMDS=(N:0X8429A04,C:0X14429A0B) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
RT  ID=0X18429A1A TY=CALONSC OWN=0X14429A0B EXLP=0X14429A08 CMDS=(R:0X8429A04,N:0X8429A05,C:0X14429A0B) SGCC=2 REL=2 DESC="AURANT YD TO TRK.1" ATIS=T
RT  ID=0X18429A1B TY=CALONSC OWN=0X14429A0B EXLP=0X14429A0A CMDS=(R:0X8429A04,R:0X8429A05,C:0X14429A0B) SGCC=2 REL=3 DESC="AURANT YD TO TRK.2" ATIS=T
SL  ID=0X14429A0C TY=STANDARD OWN=0X429A03 EXT=24 SLTK=0X1C429A0B MP=0X684003F1 SUB=0X24400007
RT  ID=0X18429A1C TY=CALONSC OWN=0X14429A0C EXLP=0X14429A0B CMDS=(N:0X8429A04,C:0X14429A0C) SGCC=2 REL=1 DESC="AURANT YD-INDSTRY LD" ATIS=T
TK  ID=0X1C429A08 TY=BLKTK OWN=0X429A03 EXT=12 LRMP=0X6C40BDAF LEN=5837 FSPD=60 PSPD=65 QUAL=TRK.1 MAXGE=0 MAXGW=0 JAL=4 ALT=12 SUB=0X24400007 RULES=(CTC:B:UP:0X24400007:485.7305:486.8359:T) LLON=-118.1766772 RLON=-118.1620059 LLAT=34.06838375 RLAT=34.07811764 LELE=416.6983 RELE=425.0596 ULAD=NO URAD=NO
PT  ID=0X20429A0F TY=STANDARD OWN=0X1C429A08 LTK=0X1C40006C RTK=0X1C429A0C REL=1 LEN=1 LQUAL="TRK.1" RQUAL="TRK.1"
PTK OWN=0X1C429A08 PID=0X1C429A13"""

def unbalanced(line):
    close_symbols = {'"' : '"', '(': ")", '[': ']', "'" : "'"}
    syms = [x for x in line if x in '\'"[]()']
    stack = []
    for s in syms:
        try:
            if len(stack) > 0 and s == close_symbols[stack[-1]]:
                stack.pop()
            else:
                stack.append(s)
        except: # catches stack underflow or closing symbol lookup
            return True
    return len(stack) != 0


print unbalanced("hello 'there' () []")
print unbalanced("hello 'there\"' () []")
print unbalanced("][")

lines = d.splitlines()  # in your case you can do open("file.txt").readlines()

print [line for line in lines if unbalanced(line)]

对于大文件,您不想将所有文件读入内存,因此请改用以下片段:

with open("file.txt") as infile:
    for line in infile:
        if unbalanced(line):
            print line
于 2012-08-09T19:42:53.157 回答
1

正则表达式 - 如果您的行不包含嵌套括号,则解决方案非常简单:

for line in myFile:
    if re.search(r"\([^\(\)]*($|\()", line):
        #this line contains unbalanced parentheses.

如果您正在使用嵌套语句的可能性,它会变得有点复杂:

for line in myFile:
    paren_stack = []
    for char in line:
        if char == '(':
            paren_stack.append(char)
        elif char == ')':
            if paren_stack:
                paren_stack.pop()
            else:
                #this line contains unbalanced parentheses.
于 2012-08-09T19:45:07.993 回答
0

我只会做类似的事情:

for line in open(file, r):
    if line.count('"') % 2 != 0 or line.count('(') != line.count(')'):
        print(line)

但我不能确定这是否完全符合您的需求。

更健壮:

for line in open(file, r):
    paren_count = 0
    paren_count_start_quote = 0
    quote_open = False
    for char in line:
        if char == ')':
            paren_count -= 1
        elif char == '(':
            paren_count += 1
        elif char == '"':
            quote_open = not quote_open
            if quote_open:
                paren_count_start_quote = paren_count
            elif paren_count != paren_count_start_quote:
                print(line)
                break
        if paren_count < 0:
            break
    if quote_open or paren_count != 0:
        print(line)

没有测试健壮的,我认为应该可以。它现在可以确保诸如: (" ) " 之类的事情,其中​​引号内的一组括号会打印该行。

于 2012-08-09T19:39:08.057 回答
0

检查此代码

from tokenize import *
def syntaxCheck(line):
    def readline():
        yield line
        yield ''
    par,quo,dquo = 0,0,0
    count = { '(': (1,0,0),')': (-1,0,0),"'": (0,1,0),'"':(0,0,1) }
    for countPar, countQuo,countDQuo in (
      count.get(token,(0,0))+(token,) for _,token,_,_,_ in tokenize(readline().__next__)):
        par  += countPar
        quo  ^= countQuo
        dquo ^= countDQuo
    return par,quo,dquo

请注意,封闭引号内的括号不计算在内,因为它计为单个字符串标记。

于 2012-08-09T19:50:36.140 回答
-1

是否应该在每一行上关闭括号和引号?如果是这种情况,您可以简单地计算括号和引号。如果是偶数,它们是匹配的。如果它是奇怪的,一个丢失。将该逻辑放入一个函数中,将文本文件的行转储到一个数组中,然后调用 map 为数组中的每个字符串执行该函数。

我的蟒蛇生锈了,但假设一切“应该”都在同一条线上,我就是这样做的。

于 2012-08-09T19:29:54.847 回答
-1

好吧,我的解决方案可能没有那么花哨,但我说你只计算括号和引号的数量。如果结果不是偶数,你就知道你错过了一些东西!

于 2012-08-09T20:18:28.973 回答