regex - 在matlab中用正则表达式拆分一个单词；“拆分”的 startIndex？

Question

我的目标是根据一组规则为任何单词生成音标。

首先，我想将单词分成音节。例如，我想要一个算法在一个单词中找到“ch”，然后将其分开，如下所示：

Input: 'aachbutcher'
Output: 'a' 'a' 'ch' 'b' 'u' 't' 'ch' 'e' 'r'

我到目前为止：

check=regexp('aachbutcher','ch');

if (isempty(check{1,1})==0)          % Returns 0, when 'ch' was found.

   [match split startIndex endIndex] = regexp('aachbutcher','ch','match','split')

   %Now I split the 'aa', 'but' and 'er' into single characters:
   for i = 1:length(split)
       SingleLetters{i} = regexp(split{1,i},'.','match');
   end

end

我的问题是：如何将单元格放在一起，以使它们的格式与所需的输出一样？我只有匹配部分（'ch'）的起始索引，但没有拆分部分（'aa'、'but'、'er'）的起始索引。

有任何想法吗？

score 0 · Accepted Answer

您不需要使用索引或长度。简单的逻辑：从匹配中处理第一个元素，然后从拆分中处理第一个元素，然后从匹配中处理第二个元素等......

[match,split,startIndex,endIndex] = regexp('aachbutcher','ch','match','split');

%Now I split the 'aa', 'but' and 'er' into single characters:
SingleLetters=regexp(split{1,1},'.','match');

for i = 2:length(split)
   SingleLetters=[SingleLetters,match{i-1},regexp(split{1,i},'.','match')];
end

score 0 · Accepted Answer

所以，你知道 'ch' 的长度，它是 2。你知道你从正则表达式中找到它的位置，因为这些索引存储在 startIndex 中。我假设（如果我错了，请纠正我）您想将单词的所有其他字母拆分为单字母单元格，就像上面的输出一样。所以，你可以使用 startIndex 数据来构造你的输出，使用条件，像这样：

check=regexp('aachbutcher','ch');

if (isempty(check{1,1})==0)          % Returns 0, when 'ch' was found.

    [match split startIndex endIndex] = regexp('aachbutcher','ch','match','split')

    %Now I split the 'aa', 'but' and 'er' into single characters:
    for i = 1:length(split)
       SingleLetters{i} = regexp(split{1,i},'.','match');
    end

end

j = 0;
for i = 1 : length('aachbutcher')
    if (i ~= startIndex(1)) && (i ~= startIndex(2)) 
        j = j +1;
        output{end+1} = SingleLetters{j};
    else
        i = i + 1;    
        output{end+1} = 'ch';
    end
end

我现在没有 MATLAB，所以我无法测试它。我希望这个对你有用！如果没有，请告诉我，我会采取另一种方式。

regex - 在matlab中用正则表达式拆分一个单词；“拆分”的 startIndex？

2 回答 2

Related

Reference