ruby - 如何使用 Parslet 在 Ruby 中处理 C 风格的注释？

Question

以Parslet自己的创建者的代码示例（可在此链接中获得）作为起点，我需要对其进行扩展，以便从以类 C 语法编写的文件中检索所有未注释的文本。

提供的示例能够成功解析 C 风格的注释，将这些区域视为常规行空间。但是，这个简单的示例只需要文件的非注释区域中的“a”字符，例如输入示例：

         a
      // line comment
      a a a // line comment
      a /* inline comment */ a 
      /* multiline
      comment */

用于检测非注释文本的规则很简单：

   rule(:expression) { (str('a').as(:a) >> spaces).as(:exp) }

因此，我需要概括前面的规则以从更通用的文件中获取所有其他（未注释的）文本，例如：

     word0
  // line comment
   word1 // line comment
  phrase /* inline comment */ something 
  /* multiline
  comment */

我是 Parsing Expression Grammars 的新手，我之前的试验都没有成功。

score 4 · Accepted Answer

//一般的想法是，在其中一个序列或/*出现之前，一切都是代码（也称为非注释）。您可以使用如下规则反映这一点：

rule(:code) {
  (str('/*').absent? >> str('//').absent? >> any).repeat(1).as(:code)
}

正如我在评论中提到的，字符串存在一个小问题。当注释出现在字符串中时，它显然是字符串的一部分。如果您要从代码中删除注释，那么您将更改此代码的含义。因此，我们必须让解析器知道字符串是什么，其中的任何字符都属于它。另一件事是转义序列。例如"foo \" bar /*baz*/"，包含文字双引号的字符串实际上会被解析为"foo \"，然后是一些代码。这当然是需要解决的问题。我编写了一个完整的解析器来处理上述所有情况：

require 'parslet'

class CommentParser < Parslet::Parser
  rule(:eof) { 
    any.absent? 
  }

  rule(:block_comment_text) {
    (str('*/').absent? >> any).repeat.as(:comment)
  }

  rule(:block_comment) {
    str('/*') >> block_comment_text >> str('*/')
  }

  rule(:line_comment_text) {
    (str("\n").absent? >> any).repeat.as(:comment)
  }

  rule(:line_comment) {
    str('//') >> line_comment_text >> (str("\n").present? | eof)
  }

  rule(:string_text) {
    (str('"').absent? >> str('\\').maybe >> any).repeat
  }

  rule(:string) {
    str('"') >> string_text >> str('"')
  }

  rule(:code_without_strings) {
    (str('"').absent? >> str('/*').absent? >> str('//').absent? >> any).repeat(1)
  }

  rule(:code) {
    (code_without_strings | string).repeat(1).as(:code)
  }

  rule(:code_with_comments) {
    (code | block_comment | line_comment).repeat
  }

  root(:code_with_comments)
end

它会解析你的输入

     word0
  // line comment
   word1 // line comment
  phrase /* inline comment */ something 
  /* multiline
  comment */

到这个 AST

[{:code=>"\n   word0\n "@0},
 {:comment=>" line comment"@13},
 {:code=>"\n  word1 "@26},
 {:comment=>" line comment"@37},
 {:code=>"\n phrase "@50},
 {:comment=>" inline comment "@61},
 {:code=>" something \n "@79},
 {:comment=>" multiline\n comment "@94},
 {:code=>"\n"@116}]

要提取除注释之外的所有内容，您可以执行以下操作：

input = <<-CODE
     word0
  // line comment
   word1 // line comment
  phrase /* inline comment */ something 
  /* multiline
  comment */
CODE

ast = CommentParser.new.parse(input)
puts ast.map{|node| node[:code] }.join

这将产生

   word0

  word1
 phrase  something

score 1 · Accepted Answer

处理评论的另一种方法是将它们视为空白。例如：

rule(:space?) do
  space.maybe
end

rule(:space) do
  (block_comment | line_comment | whitespace).repeat(1)
end

rule(:whitespace) do
  match('/s')
end

rule(:block_comment) do
  str('/*') >>
  (str('*/').absent >> match('.')).repeat(0) >>
  str('*/')
end

rule (:line_comment) do
  str('//') >> match('[^\n]') >> str("\n")
end

然后，当您使用空格编写规则时，例如这个完全即兴且可能是错误的 C 规则，

rule(:assignment_statement) do
  lvalue >> space? >> str('=') >> space? >> rvalue >> str(';')
end

注释会被解析器“吃掉”而不会大惊小怪。任何可以或必须出现空白的地方，任何类型的评论都是允许的，并被视为空白。

这种方法不适合您的确切问题，即识别 C 程序中的非注释文本，但它在必须识别完整语言的解析器中工作得很好。

ruby - 如何使用 Parslet 在 Ruby 中处理 C 风格的注释？

2 回答 2

Related

Reference