python - 使用正则表达式分隔文本块 - Python

Question

我从斯坦福解析器得到以下输出：

nicaragua president ends visit to finland .

nn(ends-3, nicaragua-1)
nn(ends-3, president-2)
nsubj(visit-4, ends-3)
xsubj(finland-6, ends-3)
root(ROOT-0, visit-4)
aux(finland-6, to-5)
xcomp(visit-4, finland-6)

guatemala president ends visit to tropos .

nn(ends-3, guatemala-1)
nn(ends-3, president-2)
nsubj(visit-4, ends-3)
xsubj(finland-6, ends-3)
root(ROOT-0, visit-4)
aux(tropos-6, to-5)
xcomp(visit-4, tropos-6)

[...]

我必须分割这个输出，以便得到包含句子的元组和所有依赖项的列表（就像(sentence,[list of dependencies])每个句子一样。有人可以建议我用 Python 做到这一点吗？谢谢！

score 0 · Accepted Answer

You could do something like this, although its probably overkill for the structure you're parsing. It should be relatively easy to extend if you need to parse the dependencies as well. I haven't run this yet, or even checked the syntax so don't kill me if it doesn't work right away.

READ_SENT = 0
PRE_DEPS = 1
DEPS = 2
POST_DEPS = 3
def parse_output(input):
    state = READ_SENT
    results = []
    sent = None
    deps = []
    for line in input.splitlines():
        if state == READ_SENT:
            sent = line
            state = PRE_DEPS
        elif state == PRE_DEPS:
             if line:
                 raise Exception('invalid format')
             else:
                 state = DEPS
         elif state == DEPS:
             if line:
                 deps.append(line)
             else:
                 state = POST_DEPS
         elif state == POST_DEPS:
             if line:
                 raise Exception('invalid format')
             else:
                 results.append((sent, deps))
                 sent = None
                 deps = []
                 state = READ_SENT
    return results

python - 使用正则表达式分隔文本块 - Python

1 回答 1

Related

Reference