unicode - 使用 LPeg 匹配 Unicode 标点符号

Question

我正在尝试创建一个与 UTF-8 编码输入中的任何 Unicode 标点符号匹配的 LPeg 模式。我想出了以下 Selene Unicode 和 LPeg 的结合：

local unicode     = require("unicode")
local lpeg        = require("lpeg")
local punctuation = lpeg.Cmt(lpeg.Cs(any * any^-3), function(s,i,a)
  local match = unicode.utf8.match(a, "^%p")
  if match == nil
    return false
  else
    return i+#match
  end
end)

这似乎有效，但它会错过由几个 Unicode 代码点组合而成的标点符号（如果存在这样的字符），因为我只提前读取 4 个字节，它可能会降低解析器的性能，并且它是未定义的库match函数会做，当我给它一个包含矮小的 UTF-8 字符的字符串时（尽管它现在似乎可以工作）。

我想知道这是否是一种正确的方法，或者是否有更好的方法来实现我想要实现的目标。

score 3 · Accepted Answer

匹配 UTF-8 字符的正确方法显示在LPeg 主页中的示例中。UTF-8 字符的第一个字节决定了它的一部分还有多少字节：

local cont = lpeg.R("\128\191") -- continuation byte

local utf8 = lpeg.R("\0\127")
           + lpeg.R("\194\223") * cont
           + lpeg.R("\224\239") * cont * cont
           + lpeg.R("\240\244") * cont * cont * cont

在此utf8模式的基础上，我们可以使用lpeg.CmtSelene Unicodematch函数，类似于您提出的：

local punctuation = lpeg.Cmt(lpeg.C(utf8), function (s, i, c)
    if unicode.utf8.match(c, "%p") then
        return i
    end
end)

注意我们 return i，这是符合Cmt预期的：

给定的函数将整个主题、当前位置（匹配 patt 之后）以及 patt 生成的任何捕获值作为参数。函数返回的第一个值定义了匹配的发生方式。如果调用返回一个数字，则匹配成功，返回的数字成为新的当前位置。

这意味着我们应该返回函数接收到的相同数字，即紧跟在 UTF-8 字符之后的位置。

unicode - 使用 LPeg 匹配 Unicode 标点符号

1 回答 1

Related

Reference