string - 在 Matlab 中使用 strrep 替换多个子字符串

Question

我有一个大字符串（大约 25M 个字符），我需要在其中替换特定模式的多个子字符串。

Frame 1
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0

...........

Frame 2
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0

...........

Frame 7670
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0

...........

我需要删除的子字符串是“Frame #”，它出现了大约 7670 次。我可以使用单元格数组在 strrep 中提供多个搜索字符串

strrep(text,{'Frame 1','Frame 2',..,'Frame 7670'},';')

但是，这会返回一个单元格数组，在每个单元格中，我的原始字符串与我的一个输入单元格的相应子字符串发生了更改。

除了使用 regexprep 之外，有没有办法从字符串中替换多个子字符串？我注意到它比 strrep 慢得多，这就是我试图避免它的原因。

使用 regexprep 它将是：

regexprep(text,'Frame \d*',';')

对于 25MB 的字符串，替换所有实例大约需要 47 秒。

编辑 1：添加了等效的 regexprep 命令

编辑 2：添加了字符串的大小以供参考，子字符串的出现次数和 regexprep 的执行时间

score 2 · Accepted Answer

好的，最后我找到了解决问题的方法。我没有使用 regexprep 来更改子字符串，而是删除了“Frame”子字符串（包括空格，但不包括数字）

rawData = strrep(text,'Frame ','');

这会导致这样的结果：

1
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0

...........

2
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0

...........

7670
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0

...........

然后，我将所有逗号 (,) 和换行符 (\n) 更改为分号 (;)，再次使用 strrep，并创建一个包含所有数字的大向量

rawData = strrep(rawData,sprintf('\r\n'),';');
rawData = strrep(rawData,';;',';');
rawData = strrep(rawData,';;',';');
rawData = strrep(rawData,',',';');
rawData = textscan(rawData,'%f','Delimiter',';');

然后我删除了不必要的数字 (1,2,...,7670)，因为它们位于数组中的特定点（每帧包含特定数量的数字）。

rawData{1}(firstInstance:spacing:lastInstance)=[];

然后我继续我的操作。似乎额外的 strrep 和从数组中删除值比等效的 regexprep 快得多。使用带有 regexprep 的 25M 字符字符串，我可以在大约 47 英寸内完成整个操作，而使用这种解决方法只需要 5 英寸！

希望这会有所帮助。

score 1 · Accepted Answer

使用正则表达式：

result = regexprep(text,'Frame [0-9]+','');

可以避免如下正则表达式。我使用strrep合适的替换字符串作为掩码。得到的字符串是等长的，并且保证对齐，因此可以使用掩码组合成最终结果。我也包括了;你想要的。我不知道它是否会比它更快regexprep，但它肯定更有趣:-)

% Data
text = 'Hello Frame 1 test string Frame 22 end of Frame 2 this'; %//example text
rep_orig = {'Frame 1','Frame 2','Frame 22'}; %//strings to be replaced.
%//May be of different lengths

% Computations    
rep_dest = cellfun(@(s) char(zeros(1,length(s))), rep_orig, 'uni', false);
%//series of char(0) of same length as strings to be replaced (to be used as mask)
aux = cell2mat(strrep(text,rep_orig.',rep_dest.'));
ind_keep = all(double(aux)); %//keep characters according to mask
ind_semicolon = diff(ind_keep)==1; %//where to insert ';' 
ind_keep = ind_keep | [ind_semicolon 0]; %// semicolons will also be kept
result = aux(1,:); %//for now
result(ind_semicolon) = ';'; %//include `;`
result = result(ind_keep); %//remove unwanted characters

使用这些示例数据：

>> text

text =

Hello Frame 1 test string Frame 22 end of Frame 2 this

>> result

result =

Hello ; test string ; end of ; this

score 1 · Accepted Answer

我认为这可以使用 only 来完成textscan，已知速度非常快。指定 a'CommentStyle'行'Frame #'被剥离。这可能只起作用，因为这些'Frame #'线在它们自己的线上。此代码将原始数据作为一个大向量返回：

s = textscan(text,'%f','CommentStyle','Frame','Delimiter',',');
s = s{:}

您可能想知道每帧中有多少元素，甚至将数据重新整形为矩阵。您可以textscan再次使用（或在上述方法之前）仅获取第一帧的数据：

f1 = textscan(text,'%f','CommentStyle','Frame 1','Delimiter',',');
f1 = s{:}

事实上，如果你只想要第一行的元素，你可以使用这个：

l1 = textscan(text,'%f,','CommentStyle','Frame 1')
l1 = l1{:}

但是，另一个好处textscan是您可以使用它直接读取文件（看起来您目前可能正在使用其他方式），仅fopen用于获取 FID。因此字符串数据text不必在内存中。

string - 在 Matlab 中使用 strrep 替换多个子字符串

3 回答 3

Related

Reference