regex - 如何通过正则表达式删除尾随注释？

Question

对于不熟悉 MATLAB 的读者：不确定它们属于哪个系列，但此处详细描述了 MATLAB 正则表达式。MATLAB 的注释字符是%(percent)，其字符串分隔符是'(撇号)。字符串中的字符串定界符写成双撇号 ( 'this is how you write "it''s" in a string.')。更复杂的是，矩阵转置运算符也是撇号（A'（Hermitian）或A.'（regular））。

现在，出于黑暗的原因（我不会详细说明:)，我正在尝试用 MATLAB 自己的语言解释 MATLAB 代码。

目前，我正在尝试删除字符串单元格数组中的所有尾随注释，每个字符串都包含一行 MATLAB 代码。乍一看，这似乎很简单：

>> str = 'simpleCommand(); % simple trailing comment';
>> regexprep(str, '%.*$', '')
ans =
    simpleCommand();

但是，当然，可能会出现这样的情况：

>> str = ' fprintf(''%d%*c%3.0f\n'', value, args{:}); % Let''s do this! ';
>> regexprep(str, '%.*$', '') 
ans = 
    fprintf('        %//   <-- WRONG!

显然，我们需要从匹配项中排除所有位于字符串中的注释字符，同时还要考虑到语句之后的单个撇号（或点撇号）是运算符，而不是字符串分隔符。

基于注释字符之前的字符串打开/关闭字符的数量必须是偶数的假设（我知道这是不完整的，因为矩阵转置运算符），我想出了以下动态正则表达式来处理这种情况：

>> str = {
       'myFun( {''test'' ''%''}); % let''s '                 
       'sprintf(str, ''%*8.0f%*s%c%3d\n''); % it''s '        
       'sprintf(str, ''%*8.0f%*s%c%3d\n''); % let''s '       
       'sprintf(str, ''%*8.0f%*s%c%3d\n'');  '
       'A = A.'';%tight trailing comment'
   };
>> 
>> C = regexprep(str, '(^.*)(?@mod(sum(\1==''''''''),2)==0;)(%.*$)', '$1')

然而，

C = 
    'myFun( {'test' '%'}); '              %// sucess
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '  %// sucess
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '  %// sucess
    'sprintf(str, '%*8.0f%*s%c'           %// FAIL
    'A = A.';'                            %// success (although I'm not sure why)

所以我快到了，但还没有:)

不幸的是，我已经用尽了我可以花时间思考这个问题并且需要继续做其他事情，所以也许其他有更多时间的人足够友好地思考这些问题：

字符串中的注释字符是我需要注意的唯一例外吗？
这样做的正确和/或更有效的方法是什么？

score 5 · Accepted Answer

您如何看待使用未记录的功能？如果您不反对，您可以使用该mtree函数解析代码并去除注释。不涉及正则表达式，我们都知道我们不应该尝试使用正则表达式来解析上下文无关的语法。

此函数是用纯 M 代码编写的 MATLAB 代码的完整解析器。据我所知，它是一个实验性的实现，但它已经被 Mathworks 在一些地方使用（这与MATLAB Cody和Contests用来测量代码长度的函数相同），并且可以用于其他有用的事情。

如果输入是一个字符串元胞数组，我们这样做：

>> str = {..};
>> C = deblank(cellfun(@(s) tree2str(mtree(s)), str, 'UniformOutput',false))
C = 
    'myFun( { 'test', '%' } );'
    'sprintf( str, '%*8.0f%*s%c%3d\n' );'
    'sprintf( str, '%*8.0f%*s%c%3d\n' );'
    'sprintf( str, '%*8.0f%*s%c%3d\n' );'
    'A = A.';'

如果你已经有一个 M 文件存储在磁盘上，你可以简单地去掉注释：

s = tree2str(mtree('myfile.m', '-file'))

如果您想查看评论，请添加：mtree(.., '-comments')

score 4 · Accepted Answer

这通过检查一个字符之前允许的字符来匹配共轭转置大小写

数字 2'
信件A'
点A.'
左括号，大括号和括号A(1)'，A{1}'和[1 2 3]'

这些是我现在唯一能想到的案例。

C = regexprep(str, '^(([^'']*''[^'']*''|[^'']*[\.a-zA-Z0-9\)\}\]]''[^'']*)*[^'']*)%.*$', '$1')

在您的示例中，我们返回

>> C = regexprep(str, '^(([^'']*''[^'']*''|[^'']*[\.a-zA-Z0-9\)\}\]]''[^'']*)*[^'']*)%.*$', '$1')

C = 

    'myFun( {'test' '%'}); '
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '
    'sprintf(str, '%*8.0f%*s%c%3d\n');  '
    'A = A.';'

score 4 · Accepted Answer

看我找到了什么！:)

评论剥离工具箱，作者 Peter J. Acklam。

对于 m 代码，它包含以下正则表达式：

mainregex = [ ...
     ' (                   ' ... % Grouping parenthesis (content goes to $1).
     '   ( ^ | \n )        ' ... % Beginning of string or beginning of line.
     '   (                 ' ... % Non-capturing grouping parenthesis.
     '                     ' ...
     '' ... % Match anything that is neither a comment nor a string...
     '       (             ' ... % Non-capturing grouping parenthesis.
     '           [\]\)}\w.]' ... % Either a character followed by
     '           ''+       ' ... %    one or more transpose operators
     '         |           ' ... % or else
     '           [^''%]    ' ... %   any character except single quote (which
     '                     ' ... %   starts a string) or a percent sign (which
     '                     ' ... %   starts a comment).
     '       )+            ' ... % Match one or more times.
     '                     ' ...
     '' ...  % ...or...
     '     |               ' ...
     '                     ' ...
     '' ...  % ...match a string.
     '       ''            ' ... % Opening single quote that starts the string.
     '         [^''\n]*    ' ... % Zero or more chars that are neither single
     '                     ' ... %   quotes (special) nor newlines (illegal).
     '         (           ' ... % Non-capturing grouping parenthesis.
     '           ''''      ' ... % An embedded (literal) single quote character.
     '           [^''\n]*  ' ... % Again, zero or more chars that are neither
     '                     ' ... %   single quotes nor newlines.
     '         )*          ' ... % Match zero or more times.
     '       ''            ' ... % Closing single quote that ends the string.
     '                     ' ...
     '   )*                ' ... % Match zero or more times.
     ' )                   ' ...
     ' [^\n]*              ' ... % What remains must be a comment.
              ];

  % Remove all the blanks from the regex.
  mainregex = mainregex(~isspace(mainregex));

变成

mainregex  = '((^|\n)(([\]\)}\w.]''+|[^''%])+|''[^''\n]*(''''[^''\n]*)*'')*)[^\n]*'

并且应该用作

C = regexprep(str, mainregex, '$1')

到目前为止，它经受住了我所有的测试，所以我认为这应该很好地解决我的问题:)

score 2 · Accepted Answer

我更喜欢滥用校验码（旧mlint的替代品）来进行解析。这是一个建议

function strNC = removeComments(str)
if iscell(str)
    strNC = cellfun(@removeComments, str, 'UniformOutput', false);
elseif regexp(str, '%', 'once')
    err = getCheckCodeId(str);
    strNC = regexprep(str, '%[^%]*$', '');
    errNC = getCheckCodeId(strNC);
    if strcmp(err, errNC),
        strNC = removeComments(strNC);
    else
        strNC = str;
    end
else
    strNC = str;
end
end

function errid = getCheckCodeId(line)
fName = 'someTempFileName.m';
fh = fopen(fName, 'w');
fprintf(fh, '%s\n', line);
fclose(fh);
if exist('checkcode')
    structRep = checkcode(fName, '-id');
else
    structRep = mlint(fName, '-id');
end
delete(fName);
if isempty(structRep)
    errid = '';
else
    errid = structRep.id;
end
end

对于每一行，它通过从最后一行修剪到行尾来检查我们是否引入了错误%。

对于您的示例，它返回：

>> removeComments(str)

ans = 

    'myFun( {'test' '%'}); '
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '
    'sprintf(str, '%*8.0f%*s%c%3d\n');  '
    'A = A.';'

它不会删除抑制指令%#ok，因此您会得到：

>> removeComments('a=1; %#ok')

ans =

a=1; %#ok

这可能是一件好事。

score 1 · Accepted Answer

如何确保评论前的所有撇号成对出现，如下所示：

>> str = {
       'myFun( {''test'' ''%''}); % let''s '                 
       'sprintf(str, ''%*8.0f%*s%c%3d\n''); % it''s '        
       'sprintf(str, ''%*8.0f%*s%c%3d\n''); % let''s '       
       'sprintf(str, ''%*8.0f%*s%c%3d\n'');  '
   };

>> C = regexprep(str, '^(([^'']*''[^'']*'')*[^'']*)%.*$', '$1')

C = 
    myFun( {'test' '%'}); 
    sprintf(str, '%*8.0f%*s%c%3d\n'); 
    sprintf(str, '%*8.0f%*s%c%3d\n'); 
    sprintf(str, '%*8.0f%*s%c%3d\n');

regex - 如何通过正则表达式删除尾随注释？

5 回答 5

Related

Reference