34

Python中解析匹配括号中包含的文本块的最佳方法是什么?

"{ { a } { b } { { { c } } } }"

最初应该返回:

[ "{ a } { b } { { { c } } }" ]

把它作为输入应该返回:

[ "a", "b", "{ { c } }" ]

应该返回:

[ "{ c }" ]

[ "c" ]

[]
4

9 回答 9

51

或者这个 pyparsing 版本:

>>> from pyparsing import nestedExpr
>>> txt = "{ { a } { b } { { { c } } } }"
>>>
>>> nestedExpr('{','}').parseString(txt).asList()
[[['a'], ['b'], [[['c']]]]]
>>>
于 2009-10-31T01:50:04.790 回答
30

伪代码:

For each string in the array:
    Find the first '{'. If there is none, leave that string alone.
    Init a counter to 0. 
    For each character in the string:  
        If you see a '{', increment the counter.
        If you see a '}', decrement the counter.
        If the counter reaches 0, break.
    Here, if your counter is not 0, you have invalid input (unbalanced brackets)
    If it is, then take the string from the first '{' up to the '}' that put the
     counter at 0, and that is a new element in your array.
于 2009-10-30T18:35:28.690 回答
6

我对 Python 有点陌生,所以请放轻松,但这里有一个可行的实现:

def balanced_braces(args):
    parts = []
    for arg in args:
        if '{' not in arg:
            continue
        chars = []
        n = 0
        for c in arg:
            if c == '{':
                if n > 0:
                    chars.append(c)
                n += 1
            elif c == '}':
                n -= 1
                if n > 0:
                    chars.append(c)
                elif n == 0:
                    parts.append(''.join(chars).lstrip().rstrip())
                    chars = []
            elif n > 0:
                chars.append(c)
    return parts

t1 = balanced_braces(["{{ a } { b } { { { c } } } }"]);
print t1
t2 = balanced_braces(t1)
print t2
t3 = balanced_braces(t2)
print t3
t4 = balanced_braces(t3)
print t4

输出:

['{ a } { b } { { { c } } }']
['a', 'b', '{ { c } }']
['{ c }']
['c']
于 2009-10-30T19:26:54.450 回答
5

解析使用lepl(可通过安装$ easy_install lepl):

from lepl import Any, Delayed, Node, Space

expr = Delayed()
expr += '{' / (Any() | expr[1:,Space()[:]]) / '}' > Node

print expr.parse("{{a}{b}{{{c}}}}")[0]

输出:

节点
 +- '{'
 +- 节点
 | +- '{'
 | +-'一个'
 | `-'}'
 +- 节点
 | +- '{'
 | +-'b'
 | `-'}'
 +- 节点
 | +- '{'
 | +- 节点
 | | +- '{'
 | | +- 节点
 | | | +- '{'
 | | | +- 'c'
 | | | `-'}'
 | | `-'}'
 | `-'}'
 `-'}'
于 2009-10-30T23:54:44.347 回答
3

更清洁的解决方案。这将返回包含在最外层括号中的字符串。如果返回 None ,则没有匹配项。

def findBrackets( aString ):
   if '{' in aString:
      match = aString.split('{',1)[1]
      open = 1
      for index in xrange(len(match)):
         if match[index] in '{}':
            open = (open + 1) if match[index] == '{' else (open - 1)
         if not open:
            return match[:index]
于 2010-05-06T10:51:05.403 回答
2

您也可以一次全部解析它们,尽管我发现它的{a}意思"a"并不["a"]奇怪。如果我正确理解了格式:

import re
import sys


_mbrack_rb = re.compile("([^{}]*)}") # re.match doesn't have a pos parameter
def mbrack(s):
  """Parse matching brackets.

  >>> mbrack("{a}")
  'a'
  >>> mbrack("{{a}{b}}")
  ['a', 'b']
  >>> mbrack("{{a}{b}{{{c}}}}")
  ['a', 'b', [['c']]]

  >>> mbrack("a")
  Traceback (most recent call last):
  ValueError: expected left bracket
  >>> mbrack("{a}{b}")
  Traceback (most recent call last):
  ValueError: more than one root
  >>> mbrack("{a")
  Traceback (most recent call last):
  ValueError: expected value then right bracket
  >>> mbrack("{a{}}")
  Traceback (most recent call last):
  ValueError: expected value then right bracket
  >>> mbrack("{a}}")
  Traceback (most recent call last):
  ValueError: unbalanced brackets (found right bracket)
  >>> mbrack("{{a}")
  Traceback (most recent call last):
  ValueError: unbalanced brackets (not enough right brackets)
  """
  stack = [[]]
  i, end = 0, len(s)
  while i < end:
    if s[i] != "{":
      raise ValueError("expected left bracket")
    elif i != 0 and len(stack) == 1:
      raise ValueError("more than one root")
    while i < end and s[i] == "{":
      L = []
      stack[-1].append(L)
      stack.append(L)
      i += 1
    stack.pop()
    stack[-1].pop()
    m = _mbrack_rb.match(s, i)
    if m is None:
      raise ValueError("expected value then right bracket")
    stack[-1].append(m.group(1))
    i = m.end(0)
    while i < end and s[i] == "}":
      if len(stack) == 1:
        raise ValueError("unbalanced brackets (found right bracket)")
      stack.pop()
      i += 1
  if len(stack) != 1:
    raise ValueError("unbalanced brackets (not enough right brackets)")
  return stack[0][0]


def main(args):
  if args:
    print >>sys.stderr, "unexpected arguments: %r" % args
  import doctest
  r = doctest.testmod()
  print r
  return r[0]

if __name__ == "__main__":
  sys.exit(main(sys.argv[1:]))
于 2009-10-30T19:42:02.487 回答
2

如果您想使用解析器(在这种情况下为 lepl),但仍想要中间结果而不是最终解析列表,那么我认为这就是您正在寻找的那种东西:

>>> nested = Delayed()
>>> nested += "{" + (nested[1:,...]|Any()) + "}"
>>> split = (Drop("{") & (nested[:,...]|Any()) & Drop("}"))[:].parse
>>> split("{{a}{b}{{{c}}}}")
['{a}{b}{{{c}}}']
>>> split("{a}{b}{{{c}}}")
['a', 'b', '{{c}}']
>>> split("{{c}}")
['{c}']
>>> split("{c}")
['c']

起初可能看起来不透明,但它真的很简单:o)

嵌套是嵌套括号匹配器的递归定义(定义中的“+”和[...]在匹配后将所有内容保留为单个字符串)。然后split表示尽可能多地匹配 ("[:]") 被 "{" ... "}" (我们用 "Drop" 丢弃) 并包含嵌套表达式或任何字母的内容。

最后,这是一个 lepl 版本的“all in one”解析器,它给出的结果格式与上面的 pyparsing 示例相同,但(我相信)它在输入中的空格如何出现方面更灵活:

>>> with Separator(~Space()[:]):
...     nested = Delayed()
...     nested += Drop("{") & (nested[1:] | Any()) & Drop("}") > list
...
>>> nested.parse("{{ a }{ b}{{{c}}}}")
[[['a'], ['b'], [[['c']]]]]
于 2009-10-31T12:38:37.067 回答
1

使用Grako(语法编译器)

#!/usr/bin/env python
import json
import grako # $ pip install grako

grammar_ebnf = """
    bracketed = '{' @:( { bracketed }+ | any ) '}' ;
    any = /[^{}]+?/ ;
"""
model = grako.genmodel("Bracketed", grammar_ebnf)
ast = model.parse("{ { a } { b } { { { c } } } }", "bracketed")
print(json.dumps(ast, indent=4))

输出

[
    "a", 
    "b", 
    [
        [
            "c"
        ]
    ]
]
于 2014-10-05T18:17:45.743 回答
1

这是我为类似用例提出的解决方案。这大致基于公认的伪代码答案。我不想为外部库添加任何依赖项:

def parse_segments(source, recurse=False):
    """
    extract any substring enclosed in parenthesis
    source should be a string
    """
    unmatched_count = 0
    start_pos = 0
    opened = False
    open_pos = 0
    cur_pos = 0

    finished = []
    segments = []

    for character in source:
        #scan for mismatched parenthesis:
        if character == '(':
            unmatched_count += 1
            if not opened:
                open_pos = cur_pos
            opened = True

        if character == ')':
            unmatched_count -= 1

        if opened and unmatched_count == 0:
            segment = source[open_pos:cur_pos+1]
            segments.append(segment)
            clean = source[start_pos:open_pos]
            if clean:
                finished.append(clean)
            opened = False
            start_pos = cur_pos+1

        cur_pos += 1

    assert unmatched_count == 0

    if start_pos != cur_pos:
        #get anything that was left over here
        finished.append(source[start_pos:cur_pos])

    #now check on recursion:
    for item in segments:
        #get rid of bounding parentheses:
        pruned = item[1:-1]
        if recurse:
            results = parse_tags(pruned, recurse)
            finished.expand(results)
        else:
            finished.append(pruned)

    return finished
于 2014-12-08T15:22:17.517 回答