python - 用于捕获科学引文的 RegEx

Question

我正在尝试捕获其中至少包含一位数字的文本括号（想想引文）。这是我现在的正则表达式，它工作正常：https ://regex101.com/r/oOHPvO/5

\((?=.*\d).+?\)

所以我希望它能够捕获(Author 2000)，(2000)但不是(Author)。

我正在尝试使用 python 来捕获所有这些括号，但在 python 中，即使它们没有数字，它也会捕获括号中的文本。

import re

with open('text.txt') as f:
    f = f.read()

s = "\((?=.*\d).*?\)"

citations = re.findall(s, f)

citations = list(set(citations))

for c in citations:
    print (c)

任何想法我做错了什么？

score 1 · Accepted Answer

可能处理此表达式的最可靠方法可能是在您的表达式可能增长时添加边界。例如，我们可以尝试创建 char 列表，我们希望在其中收集所需的数据：

(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\)).

演示

测试

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\))."

test_str = "some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author, 2000) some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author; 2000)"

matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

演示

const regex = /(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\))./mgi;
const str = `some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author, 2000) some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author; 2000)`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

正则表达式电路

jex.im可视化正则表达式：

score 1 · Accepted Answer

您可以使用

re.findall(r'\([^()\d]*\d[^()]*\)', s)

查看正则表达式演示

细节

\(- 一个(字符
[^()\d]*(- 除了,)和 digit之外的0 个或多个字符
\d- 一个数字
[^()]*- 0 个或更多字符，而不是(,)
\)- 一个)字符。

请参阅正则表达式图：

Python演示：

import re
rx = re.compile(r"\([^()\d]*\d[^()]*\)")
s = "Some (Author) and (Author 2000)"
print(rx.findall(s)) # => ['(Author 2000)']

要获得不带括号的结果，请添加一个捕获组：

rx = re.compile(r"\(([^()\d]*\d[^()]*)\)")
                    ^                ^

请参阅此 Python 演示。

python - 用于捕获科学引文的 RegEx

2 回答 2

演示

测试

演示

正则表达式电路

Related

Reference