haskell - 如何使用 Alex lexer 解析 C 风格的注释？

Question

我想为 C 风格的注释创建词法分析器。我目前的方法为开始评论、结束、中间和单行创建单独的标记

%wrapper "monad"

tokens :-
  <0> $white+ ;
  <0> "/*"               { mkL LCommentStart `andBegin` comment }
  <comment> .            { mkL LComment }
  <comment> "*/"         { mkL LCommentEnd `andBegin` 0 }
  <0> "//" .*$           { mkL LSingleLineComment }

data LexemeClass
  = LEOF
  | LCommentStart
  | LComment
  | LCommentEnd
  | LSingleLineComment

如何减少中间令牌的数量？对于输入/*blabla*/，我将获得 8 个令牌而不是 1 个！
如何//从单行注释标记中剥离部分？
monad是否可以在没有包装器的情况下使用 lex 评论？

score 2 · Accepted Answer

看看这个：

http://lpaste.net/107377

用类似的东西进行测试：

echo "This /* is a */ test" | ./c_comment

应该打印：

Right [W "This",CommentStart,CommentBody " is a ",CommentEnd,W "test"]

您需要使用的关键 alex 例程是：

alexGetInput -- gets the current input state
alexSetInput -- sets the current input state
alexGetByte  -- returns the next byte and input state
andBegin     -- return a token and set the current start code

每个例程commentBegin，commentEnd并commentBody具有以下签名：

AlexInput -> Int -> Alex Lexeme

whereLexeme代表你的令牌类型。该AlexInput参数具有以下形式（对于 monad 包装器）：

（AlexPosn，字符，[字节]，字符串）

Int参数是存储在String字段中的匹配长度。因此，大多数令牌处理程序的形式将是：

handler :: AlexInput -> Int -> Alex Lexeme
handler (pos,_,_,inp) len = ... do something with (take len inp) and pos ...

一般来说，处理程序似乎可以忽略Charand[Bytes]字段。

处理程序commentBegin和commentEnd可以忽略AlexInput和Int参数，因为它们只匹配固定长度的字符串。

commentBody处理程序通过调用累积注释正文来工作，alexGetByte直到找到“*/”。据我所知，C 注释可能不会嵌套，因此注释在第一次出现“*/”时结束。

请注意，注释正文的第一个字符在match0变量中。事实上，我的代码中有一个错误，因为它不能正确匹配“/**/”。它应该看match0以决定是否从loop或开始loopStar。

您可以使用相同的技术来解析“//”样式注释 - 或任何需要非贪婪匹配的标记。

另一个关键点是，像这样的模式$white+是用起始码限定的：

<0>$white+

这样做是为了使它们在处理评论时不活动。

您可以使用另一个包装器，但请注意AlexInput类型的结构可能不同——例如，对于基本包装器，它只是一个 3-tuple: (Char,[Byte],String)。只需查看AlexInput生成的 .hs 文件中的定义即可。

++最后一点......当然，使用累积字符是相当低效的。您可能希望使用Text(或ByteString) 作为累加器。

haskell - 如何使用 Alex lexer 解析 C 风格的注释？

1 回答 1

Related

Reference