1

我有以下输入:

str = """

    Q: What is a good way of achieving this?

    A: I am not sure. Try the following:

    1. Take this first step. Execute everything.

    2. Then, do the second step

    3. And finally, do the last one



    Q: What is another way of achieving this?

    A: I am not sure. Try the following alternatives:

    1. Take this first step from before. Execute everything.

    2. Then, don't do the second step

    3. Do the last one and then execute the above step

"""

我想捕获输入中的 QA 对,但我无法获得一个好的正则表达式来执行此操作。我管理了以下内容:

(?ms)^[\s#\-\*]*(?:Q)\s*:\s*(\S.*?\?)[\s#\-\*]+(?:A)\s*:\s*(\S.*)$

但是,我能够捕获如下输入:

('Q', 'What is a good way of achieving this?')
('A', "I am not sure. Try the following:\n    1. Take this first step. Execute everything.\n    2. Then, do the second step\n    3. And finally, do the last one\n\n    Q: What is another way of achieving this?\n    A: I am not sure. Try the following alternatives:\n    1. Take this first step from before. Execute everything.\n    2. Then, don't do the second step\n    3. Do the last one and then execute the above step\n")

注意第二个 QA 对是如何被第一个捕获的。如果我在答案正则表达式的末尾使用贪婪?,它不会捕获枚举。关于如何解决这个问题的任何建议?

4

4 回答 4

1

只是使用它对我来说很好。只需要修剪一些空白。

(?s)(Q):((?:(?!A:).)*)(A):((?:(?!Q:).)*)

使用示例:

>>> import re
>>> str = """
...
...     Q: What is a good way of achieving this?
...
...     A: I am not sure. Try the following:
...
...     1. Take this first step. Execute everything.
...
...     2. Then, do the second step
...
...     3. And finally, do the last one  ...      ...   ...
...     Q: What is another way of achieving this?
...
...     A: I am not sure. Try the following alternatives:
...
...     1. Take this first step from before. Execute everything.
...
...     2. Then, don't do the second step
...
...     3. Do the last one and then execute the above step
...
... """
>>> regex = r"(?s)(Q):((?:(?!A:).)*)(A):((?:(?!Q:).)*)"
>>> match = re.findall(regex, str)
>>> map(lambda x: [part.strip().replace('\n', '') for part in x], match)
[['Q', 'What is a good way of achieving this?', 'A', 'I am not sure. Try the following:    1. Take this first step. Execute everything.    2. Then, do the second step    3. And finally, do the last one'], ['Q', 'What is another way of achieving this?', 'A', "I am not sure. Try the following alternatives:    1. Take this first step from before. Execute everything.    2. Then, don't do the second step    3. Do the last one and then execute the above step"]]

甚至添加了一些小东西来帮助您清理最后的空白。

于 2013-05-03T18:13:31.187 回答
1

懒惰但不是最好的解决方法是用“Q:”分解字符串,然后用简单的 /Q:(.+)A:(.+)/msU 解析部分(通常是正则表达式) .

于 2013-05-03T17:39:56.673 回答
0

我还没有写出巨大的正则表达式(还),所以这是我的非正则表达式解决方案 -

>>> str = """

    Q: What is a good way of achieving this?

    A: I am not sure. Try the following:

    1. Take this first step. Execute everything.

    2. Then, do the second step

    3. And finally, do the last one



    Q: What is another way of achieving this?

    A: I am not sure. Try the following alternatives:

    1. Take this first step from before. Execute everything.

    2. Then, don't do the second step

    3. Do the last one and then execute the above step

"""
>>> qas = str.strip().split('Q:')
>>> clean_qas = map(lambda x: x.strip().split('A:'), filter(None, qas))
>>> print clean_qas
[['What is a good way of achieving this?\n\n    ', ' I am not sure. Try the following:\n\n    1. Take this first step. Execute everything.\n\n    2. Then, d
o the second step\n\n    3. And finally, do the last one'], ['What is another way of achieving this?\n\n    ', " I am not sure. Try the following alternativ
es:\n\n    1. Take this first step from before. Execute everything.\n\n    2. Then, don't do the second step\n\n    3. Do the last one and then execute the
above step"]]

不过,您应该清理空格。或者你可以按照 Puciek 所说的去做。

纯娱乐 -

>>> clean_qas = map(lambda x: map(lambda s: s.strip(), x.strip().split('A:')), filter(None, qas))
>>> print clean_qas
[['What is a good way of achieving this?', 'I am not sure. Try the following:\n\n    1. Take this first step. Execute everything.\n\n    2. Then, do the sec
ond step\n\n    3. And finally, do the last one'], ['What is another way of achieving this?', "I am not sure. Try the following alternatives:\n\n    1. Take
 this first step from before. Execute everything.\n\n    2. Then, don't do the second step\n\n    3. Do the last one and then execute the above step"]]

不过看起来很丑。

于 2013-05-03T18:32:52.123 回答
0

稍微修改您的原始解决方案:

(?ms)^[\s#\-\*]*(?:Q)\s*:\s+(\S[^\n\r]*\?)[\s#\-\*]+(?:A)\s*:\s+(\S.*?)\s*(?=$|Q\s*:\s+)
  • 问题和答案的后面必须至少有一个空格:
  • 不要不贪婪地匹配问题(不允许?在一个问题中有多个 's),不要在问题中允许换行符。
  • 不是匹配到字符串的结尾,而是非贪婪地匹配,直到匹配之后是字符串的结尾或者之后是另一个问题。

用于re.findall获取所有问题/答案匹配项。

于 2013-05-03T18:33:18.210 回答