0

例如,我有 3 个句子,如下所示,其中 1 个句子中间包含 citation mark (Warren and Pereira, 1982)。引用总是用这种格式放在括号中:(~string~comma(,)~space~number~)

他住在 Nidarvoll,今晚我必须在 6 点钟到达前往奥斯陆的火车。该系统称为 BusTUC,建立在经典系统 CHAT-80(Warren 和 Pereira,1982)之上。CHAT-80 是最先进的自然语言系统,其优点令人印象深刻。

我正在使用 Regex 仅提取中间句子,但它会保留打印所有 3 个句子。结果应该是这样的:

该系统称为 BusTUC,建立在经典系统 CHAT-80(Warren 和 Pereira,1982)之上。

4

2 回答 2

2

设置... 2 句话代表感兴趣的案例:

text = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits."

t2 = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural. CHAT-80 was a state of the art natural language system that was impressive on its own merits."

首先,在引用位于句尾的情况下进行匹配:

p1 = "\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"

当引文不在句末时匹配:

p2 = "\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)"

将这两种情况与 `|' 结合起来 正则表达式运算符:

p_main = re.compile("\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
                "|\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)")

跑步:

>>> print(re.findall(p_main, text))
[('The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982).', '')]

>>>print(re.findall(p_main, t2))
[('', 'The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural.')]

在这两种情况下,您都会得到带有引用的句子。

一个很好的资源是 python 正则表达式文档和随附的 regex howto页面。

干杯

于 2017-08-13T10:27:17.267 回答
0
text = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits."

您可以将文本拆分为句子列表,然后选择以“)”结尾的句子。

sentences = text.split(".")[:-1]

for sentence in sentences:
    if sentence[-1] == ")":
        print sentence
于 2017-08-13T08:47:18.093 回答