javascript - 获取 JavaScript 正则表达式中每个捕获的索引

Question

我想匹配一个像/(a).(b)(c.)d/with这样的正则表达式"aabccde"，并得到以下信息：

"a" at index = 0
"b" at index = 2
"cc" at index = 3

我怎样才能做到这一点？String.match 返回匹配列表和完整匹配开始的索引，而不是每个捕获的索引。

编辑：一个不适用于普通 indexOf 的测试用例

regex: /(a).(.)/
string: "aaa"
expected result: "a" at 0, "a" at 2

注意：问题类似于Javascript Regex: How to find index of each subexpression？，但我无法修改正则表达式以使每个子表达式都成为捕获组。

score 8 · Accepted Answer

目前有一个提案（第 4 阶段）在本机 Javascript 中实现这一点：

ECMAScript 的正则表达式匹配索引

ECMAScript RegExp 匹配索引提供了有关捕获的子字符串相对于输入字符串开头的开始和结束索引的附加信息。

...我们建议在的数组结果（子字符串数组）上采用一个附加indices属性。此属性本身是一个索引数组，其中包含每个捕获的子字符串的一对开始和结束索引。任何不匹配的捕获组都将类似于它们在子字符串数组中的对应元素。此外，索引数组本身将具有一个 groups 属性，其中包含每个命名捕获组的开始和结束索引。RegExp.prototype.exec()undefined

这是一个事情如何运作的例子。以下代码段至少在 Chrome 中运行没有错误：

const re1 = /a+(?<Z>z)?/d;

// indices are relative to start of the input string:
const s1 = "xaaaz";
const m1 = re1.exec(s1);
console.log(m1.indices[0][0]); // 1
console.log(m1.indices[0][1]); // 5
console.log(s1.slice(...m1.indices[0])); // "aaaz"

console.log(m1.indices[1][0]); // 4
console.log(m1.indices[1][1]); // 5
console.log(s1.slice(...m1.indices[1])); // "z"

console.log(m1.indices.groups["Z"][0]); // 4
console.log(m1.indices.groups["Z"][1]); // 5
console.log(s1.slice(...m1.indices.groups["Z"])); // "z"

// capture groups that are not matched return `undefined`:
const m2 = re1.exec("xaaay");
console.log(m2.indices[1]); // undefined
console.log(m2.indices.groups.Z); // undefined

因此，对于问题中的代码，我们可以这样做：

const re = /(a).(b)(c.)d/d;
const str = 'aabccde';
const result = re.exec(str);
// indices[0], like result[0], describes the indices of the full match
const matchStart = result.indices[0][0];
result.forEach((matchedStr, i) => {
  const [startIndex, endIndex] = result.indices[i];
  console.log(`${matchedStr} from index ${startIndex} to ${endIndex} in the original string`);
  console.log(`From index ${startIndex - matchStart} to ${endIndex - matchStart} relative to the match start\n-----`);
});

输出：

aabccd from index 0 to 6 in the original string
From index 0 to 6 relative to the match start
-----
a from index 0 to 1 in the original string
From index 0 to 1 relative to the match start
-----
b from index 2 to 3 in the original string
From index 2 to 3 relative to the match start
-----
cc from index 3 to 5 in the original string
From index 3 to 5 relative to the match start

请记住，该indices数组包含匹配组的索引相对于字符串的开头，而不是相对于匹配的开头。

一个 polyfill 在这里可用。

score 6 · Accepted Answer

前段时间我为此编写了MultiRegExp。只要您没有嵌套的捕获组，它就可以解决问题。它的工作原理是在 RegExp 中的捕获组之间插入捕获组，并使用所有中间组来计算请求的组位置。

var exp = new MultiRegExp(/(a).(b)(c.)d/);
exp.exec("aabccde");

应该返回

{0: {index:0, text:'a'}, 1: {index:2, text:'b'}, 2: {index:3, text:'cc'}}

现场版

score 5 · Accepted Answer

我创建了一个小的正则表达式解析器，它也能够像魅力一样解析嵌套组。它很小但很大。不完全是。像唐纳德的手。如果有人可以测试它，我会非常高兴，因此它将经过实战测试。它可以在以下位置找到：https ://github.com/valorize/MultiRegExp2

用法：

let regex = /a(?: )bc(def(ghi)xyz)/g;
let regex2 = new MultiRegExp2(regex);

let matches = regex2.execForAllGroups('ababa bcdefghixyzXXXX'));

Will output:
[ { match: 'defghixyz', start: 8, end: 17 },
  { match: 'ghi', start: 11, end: 14 } ]

score 1 · Accepted Answer

所以，你有一个文本和一个正则表达式：

txt = "aabccde";
re = /(a).(b)(c.)d/;

第一步是获取与正则表达式匹配的所有子字符串的列表：

subs = re.exec(txt);

然后，您可以对每个子字符串的文本进行简单搜索。您必须将最后一个子字符串的位置保存在变量中。我已经命名了这个变量cursor。

var cursor = subs.index;
for (var i = 1; i < subs.length; i++){
    sub = subs[i];
    index = txt.indexOf(sub, cursor);
    cursor = index + sub.length;


    console.log(sub + ' at index ' + index);
}

编辑：感谢@nhahtdh，我改进了机制并完成了功能：

String.prototype.matchIndex = function(re){
    var res  = [];
    var subs = this.match(re);

    for (var cursor = subs.index, l = subs.length, i = 1; i < l; i++){
        var index = cursor;

        if (i+1 !== l && subs[i] !== subs[i+1]) {
            nextIndex = this.indexOf(subs[i+1], cursor);
            while (true) {
                currentIndex = this.indexOf(subs[i], index);
                if (currentIndex !== -1 && currentIndex <= nextIndex)
                    index = currentIndex + 1;
                else
                    break;
            }
            index--;
        } else {
            index = this.indexOf(subs[i], cursor);
        }
        cursor = index + subs[i].length;

        res.push([subs[i], index]);
    }
    return res;
}


console.log("aabccde".matchIndex(/(a).(b)(c.)d/));
// [ [ 'a', 1 ], [ 'b', 2 ], [ 'cc', 3 ] ]

console.log("aaa".matchIndex(/(a).(.)/));
// [ [ 'a', 0 ], [ 'a', 1 ] ] <-- problem here

console.log("bababaaaaa".matchIndex(/(ba)+.(a*)/));
// [ [ 'ba', 4 ], [ 'aaa', 6 ] ]

score 1 · Accepted Answer

基于ecma 正则表达式语法，我编写了一个解析器，分别是 RegExp 类的扩展，它解决了这个问题（完整索引的 exec 方法）以及 JavaScript RegExp 实现的其他限制，例如：基于组的搜索和替换。您可以在此处测试和下载实现（也可以作为 NPM 模块使用）。

实现工作如下（小例子）：

//Retrieve content and position of: opening-, closing tags and body content for: non-nested html-tags.
var pattern = '(<([^ >]+)[^>]*>)([^<]*)(<\\/\\2>)';
var str = '<html><code class="html plain">first</code><div class="content">second</div></html>';
var regex = new Regex(pattern, 'g');
var result = regex.exec(str);

console.log(5 === result.length);
console.log('<code class="html plain">first</code>'=== result[0]);
console.log('<code class="html plain">'=== result[1]);
console.log('first'=== result[3]);
console.log('</code>'=== result[4]);
console.log(5=== result.index.length);
console.log(6=== result.index[0]);
console.log(6=== result.index[1]);
console.log(31=== result.index[3]);
console.log(36=== result.index[4]);

我也尝试了@velop 的实现，但实现似乎有问题，例如它不能正确处理反向引用，例如“/a(?: )bc(def( \1 ghi)xyz)/g” - 当在前面添加括号时反向引用\1需要相应地增加（在他的实现中不是这种情况）。

score -2 · Accepted Answer

我不确定您对搜索的确切要求是什么，但这是您如何在第一个示例中使用Regex.exec()while 循环获得所需输出的方法。

JavaScript

var myRe = /^a|b|c./g;
var str = "aabccde";
var myArray;
while ((myArray = myRe.exec(str)) !== null)
{
  var msg = '"' + myArray[0] + '" ';
  msg += "at index = " + (myRe.lastIndex - myArray[0].length);
  console.log(msg);
}

输出

"a" at index = 0
"b" at index = 2
"cc" at index = 3

使用该lastIndex属性，您可以减去当前匹配字符串的长度以获得起始索引。

javascript - 获取 JavaScript 正则表达式中每个捕获的索引

6 回答 6

Related

Reference