0

我希望阅读文本,使用正则表达式查找模式的所有实例,然后打印匹配的字符串。如果我使用 re.search() 方法,我可以成功抓取并打印所需模式的第一个实例:

import re

text = "Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian."

match = re.search(r'(cello|Cello)(\W{1,80}\w{1,60}){0,9}\W{0,20}(lillian|Lillian)', text)
print match.group()

不幸的是,re.search() 方法只能找到所需模式的第一个实例,所以我替换了 re.findall():

import re

text = "Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian."

match = re.findall(r'(cello|Cello)(\W{1,80}\w{1,60}){0,9}\W{0,20}(lillian|Lillian)', text)
print match

此例程在示例文本中找到目标模式的两个实例,但我找不到打印出现模式的句子的方法。后一段代码的打印功能产生:('Cello', 'with', 'Lillian'), ('Cello', 'yellow', 'Lillian'),而不是我想要的输出:“Cello is a和莉莲一起唱歌的黄色长尾小鹦鹉。大提琴是黄色的莉莲。

有没有办法修改第二段代码以获得所需的输出?我将非常感谢任何人可以就这个问题提供任何建议。

4

2 回答 2

3

描述

在这个正则表达式中使用前瞻,它将捕获包含大提琴和莉莲的完整句子。

(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\b[Cc]ello(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$)).*?\.(?=\s|$))

在此处输入图像描述

表达式被分解为这些功能组件:

  • (?:(?<=\.)\s+|^)在 a 之后开始匹配这个句子,.后跟任意数量的空格或字符串的开头
  • (开始捕获组 1,它将捕获整个句子
  • (?=开始展望
    • (?:(?!\.(?:\s|$)).)*?确保正则表达式引擎不会通过强制它确认 a.后跟空格或字符串结尾来保留此句子
    • \b匹配单词break
    • [Cc]ello匹配所需文本全部小写或大写首字母
    • (?=\s|\.|$)向前看以确保字符串有一个尾随空格.,或字符串的结尾
    • )展望结束
  • (?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$))这基本上是一样的,但对于 Lillian
  • .*?\.(?=\s|$)捕获句子的其余部分,包括句点,并确保句点后跟空格或字符串的结尾
  • )句末捕获组 1

代码示例

我不太了解python,所以我提供了一个PHP示例。请注意,在 match 语句中,我使用了s允许.表达式匹配换行符的选项

输入文本

Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian.
Cello likes Lillian and kittens.
Lillian likes Cello and dogs.  Cello has no friends. And Lillian also hasn't met anyone.

代码

<?php
$sourcestring="your source string";
preg_match_all('/(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\b[Cc]ello(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$)).*?\.(?=\s|$))/s',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

火柴

$matches Array:
(
    [0] => Array
        (
            [0] => Cello is a yellow parakeet who sings with Lillian.
            [1] =>  Cello is a yellow Lillian.
            [2] => 
Cello likes Lillian and kittens.
            [3] => 
Lillian likes Cello and dogs.
        )

    [1] => Array
        (
            [0] => Cello is a yellow parakeet who sings with Lillian.
            [1] => Cello is a yellow Lillian.
            [2] => Cello likes Lillian and kittens.
            [3] => Lillian likes Cello and dogs.
        )

)

如果您绝对需要匹配字符串 Cello 出现在 Lillian 之前的句子,那么您可以使用这样的表达式。在这里,我只是移动了一个右括号。

(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\b[Cc]ello(?=\s|\.|$)(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$))).*?\.(?=\s|$))

在此处输入图像描述

输入文本

Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian.
Cello likes Lillian and kittens.
Lillian likes Cello and dogs.  Cello has no friends. And Lillian also hasn't met anyone.

捕获组 1 的输出

[1] => Array
    (
        [0] => Cello is a yellow parakeet who sings with Lillian.
        [1] => Cello is a yellow Lillian.
        [2] => Cello likes Lillian and kittens.
    )
于 2013-06-19T05:54:32.700 回答
2

我只想围绕两个端点建立一个大型捕获组:

import re

text = "Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian."

for match in re.findall(r'(Cello(?:\W{1,80}\w{1,60}){0,9}\W{0,20}Lillian)', text, flags=re.I):
    print match

现在,你得到两个句子:

Cello is a yellow parakeet who sings with Lillian
Cello is a yellow Lillian

一些技巧:

  • flags=re.I使正则表达式不区分大小写,因此Cello同时匹配celloCello
  • (?:foo)就像(foo),只是捕获的文本不会显示为匹配项。这对于将事物分组而不使它们匹配很有用。
于 2013-06-19T03:02:57.027 回答