4

I am currently working on a poker hand history parser as a part of my bachelor project. I've been doing some research past couple of days, and came across a few nice parser generators (of which I chose JavaCC, since the project itself will be coded in Java).

Despite the hand history grammar being pretty basic and straightforward, there's an ambiguity problem due to allowed set of characters in player's nickname.

Suppose we have a line in a following format:

Seat 5: myNickname (1500 in chips)

Token myNickname can contain any character as well as white spaces. This means, that both (1500 in chip and Seat 5: are valid nicknames - which ultimately leads to an ambiguity problem. There are no restrictions on player's nickname except for length (4-12 characters).

I need to parse and store several data along with player's nickname (e.g. seat position and amount of chips in this particular case), so my question is, what are my options here?

I would love to do it using JavaCC, something along this:

SeatRecord seat() :
{ Token seatPos, nickname, chipStack; }
{
    "Seat" seatPos=<INTEGER> ":" nickname=<NICKNAME> "(" chipStack=<INTEGER> 
    "in chips)"
    {
        return new SeatRecord(seatPos.image, nickname.image, chipStack.image); 
    }
}  

Which right now doesn't work (due to the mentioned problem)

I also searched around for GLR parsers (which apparently handle ambigious grammars) - but they mostly seem to be abandoned or poorly documented, except for Bison, but that one doesn't support GLR parsers for Java, and might be too complex to work with anway (aside for the ambiguity problem, the grammar itself is pretty basic, as I mentioned)

Or should I stick to tokenizing the string myself, and use indexOf(), lastIndexOf() etc. to parse the data I need? I would go for it only if it was the only option remaining, since it would be too ugly IMHO and I might miss some cases (which would lead to incorrect parsing)

4

3 回答 3

7

如果您的输入格式与您指定的一样简单,您可能可以使用简单的正则表达式:

^Seat ([0-9]+): (.*) \(([0-9]+) in chips\)$

本例中正则表达式引擎的 NFA 解决了您的歧义,括号内为捕获组,以便您提取您感兴趣的信息。

于 2012-06-18T11:35:37.147 回答
2

你有两个解决方案:

  • 对名称添加一些限制。我几乎不记得有任何广泛使用的系统会接受这样的昵称。让他们使用字母数字字符和“_”分隔符。您也可以为座位添加关键字,例如,这样的单词不能是昵称。
  • 您还可以根据您的语法构建一个用于解析的有限自动机。我认为,FSM 可以处理这种歧义语法。一旦你有了它,你就可以解析你想要的一切。

无论如何,我认为,原来的设计有问题。昵称不应该允许这样的一组名称。此外,为什么不能使用标识符而不是名称 - 名称可以存储在数据库中。

于 2012-06-18T11:31:28.450 回答
2

您系统的语法可能如下所示(编写为上下文无关语法):

S -> seating nickname chips

seating -> "Seat " number ":"
number -> "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
number -> number number

nickname -> "a" | "b" | "c" ...... | "z" | ...."+" | "?" | number
nickname -> nickname nickname 

chips -> "(" number "in chips)"

注意表格的规则:

number -> number number

这基本上允许无限的语法。请注意,“无限语法”并不意味着您封装了所有内容。上面的行基本上相当于 regex (\d*)

我发现在 CFG 中输入语法然后将其转换为常规语法对我很有帮助。更多关于如何在此处执行此操作。祝你好运!

于 2012-06-18T14:47:36.847 回答