4

抱歉标题模糊,但我真的不知道如何简洁地描述这个问题。

我创建了一种(或多或少)简单的特定于域的语言,我将使用它来指定适用于不同实体(通常是从网页提交的表单)的验证规则。我在这篇文章的底部包含了一个示例,说明该语言的外观。

我的问题是我不知道如何开始将这种语言解析为我可以使用的形式(我将使用 Python 进行解析)。我的目标是最终得到一个规则/过滤器列表(作为字符串,包括参数,例如'cocoa(99)'),应该(按顺序)应用于每个对象/实体(也是一个字符串,例如'chocolate','chocolate.lindt'等)。

我不确定从什么技术开始,甚至不知道有什么技术可以解决这样的问题。你认为最好的方法是什么?我不是在寻找一个完整的解决方案,只是在正确的方向上进行一般性的推动。

谢谢。

语言示例文件:

# Comments start with the '#' character and last until the end of the line
# Indentation is significant (as in Python)


constant NINETY_NINE = 99       # Defines the constant `NINETY_NINE` to have the value `99`


*:      # Applies to all data
    isYummy             # Everything must be yummy

chocolate:              # To validate, say `validate("chocolate", object)`
    sweet               # chocolate must be sweet (but not necessarily chocolate.*)

    lindt:              # To validate, say `validate("chocolate.lindt", object)`
        tasty           # Applies only to chocolate.lindt (and not to chocolate.lindt.dark, for e.g.)

        *:              # Applies to all data under chocolate.lindt
            smooth      # Could also be written smooth()
            creamy(1)   # Level 1 creamy
        dark:           # dark has no special validation rules
            extraDark:
                melt            # Filter that modifies the object being examined
                c:bitter        # Must be bitter, but only validated on client
                s:cocoa(NINETY_NINE)    # Must contain 99% cocoa, but only validated on server. Note constant
        milk:
            creamy(2)   # Level 2 creamy, overrides creamy(1) of chocolate.lindt.* for chocolate.lindt.milk
            creamy(3)   # Overrides creamy(2) of previous line (all but the last specification of a given rule are ignored)



ruleset food:       # To define a chunk of validation rules that can be expanded from the placeholder `food` (think macro)
    caloriesWithin(10, 2000)        # Unlimited parameters allowed
    edible
    leftovers:      # Nested rules allowed in rulesets
        stale

# Rulesets may be nested and/or include other rulesets in their definition



chocolate:              # Previously defined groups can be re-opened and expanded later
    ferrero:
        hasHazelnut



cake:
    tasty               # Same rule used for different data (see chocolate.lindt)
    isLie
    ruleset food        # Substitutes with rules defined for food; cake.leftovers must now be stale


pasta:
    ruleset food        # pasta.leftovers must also be stale




# Sample use (in JavaScript):

# var choc = {
#   lindt: {
#       cocoa: {
#           percent: 67,
#           mass:    '27g'
#       }
#   }
#   // Objects/groups that are ommitted (e.g. ferrro in this example) are not validated and raise no errors
#   // Objects that are not defined in the validation rules do not raise any errors (e.g. cocoa in this example)
# };
# validate('chocolate', choc);

# `validate` called isYummy(choc), sweet(choc), isYummy(choc.lindt), smooth(choc.lindt), creamy(choc.lindt, 1), and tasty(choc.lindt) in that order
# `validate` returned an array of any validation errors that were found

# Order of rule validation for objects:
# The current object is initially the object passed in to the validation function (second argument).
# The entry point in the rule group hierarchy is given by the first argument to the validation function.
# 1. First all rules that apply to all objects (defined using '*') are applied to the current object,
#    starting with the most global rules and ending with the most local ones.
# 2. Then all specific rules for the current object are applied.
# 3. Then a depth-first traversal of the current object is done, repeating steps 1 and 2 with each object found as the current object
# When two rules have equal priority, they are applied in the order they were defined in the file.



# No need to end on blank line
4

7 回答 7

9

首先,如果您想了解解析,请编写您自己的递归下降解析器。您定义的语言只需要少量产品。我建议使用 Python 的tokenize库来省去将字节流转换为标记流的无聊任务。

有关实用的解析选项,请继续阅读...

一个快速而肮脏的解决方案是使用 python 本身:

NINETY_NINE = 99       # Defines the constant `NINETY_NINE` to have the value `99`

rules = {
  '*': {     # Applies to all data
    'isYummy': {},      # Everything must be yummy

    'chocolate': {        # To validate, say `validate("chocolate", object)`
      'sweet': {},        # chocolate must be sweet (but not necessarily chocolate.*)

      'lindt': {          # To validate, say `validate("chocolate.lindt", object)`
        'tasty':{}        # Applies only to chocolate.lindt (and not to chocolate.lindt.dark, for e.g.)

        '*': {            # Applies to all data under chocolate.lindt
          'smooth': {}  # Could also be written smooth()
          'creamy': 1   # Level 1 creamy
        },
# ...
    }
  }
}

有几种方法可以实现这个技巧,例如,这是一种使用类的更简洁(尽管有些不寻常)的方法:

class _:
    class isYummy: pass

    class chocolate:
        class sweet: pass

        class lindt:
            class tasty: pass

            class _:
                class smooth: pass
                class creamy: level = 1
# ...

作为完整解析器的中间步骤,您可以使用“包含电池”的 Python 解析器,它解析 Python 语法并返回 AST。AST 非常深,有很多(IMO)不必要的级别。您可以通过剔除任何只有一个子节点的节点,将它们过滤成更简单的结构。使用这种方法,您可以执行以下操作:

import parser, token, symbol, pprint

_map = dict(token.tok_name.items() + symbol.sym_name.items())

def clean_ast(ast):
    if not isinstance(ast, list):
        return ast
    elif len(ast) == 2: # Elide single-child nodes.
        return clean_ast(ast[1])
    else:
        return [_map[ast[0]]] + [clean_ast(a) for a in ast[1:]]

ast = parser.expr('''{

'*': {     # Applies to all data
  isYummy: _,    # Everything must be yummy

  chocolate: {        # To validate, say `validate("chocolate", object)`
    sweet: _,        # chocolate must be sweet (but not necessarily chocolate.*)

    lindt: {          # To validate, say `validate("chocolate.lindt", object)`
      tasty: _,        # Applies only to chocolate.lindt (and not to chocolate.lindt.dark, for e.g.)

      '*': {            # Applies to all data under chocolate.lindt
        smooth: _,  # Could also be written smooth()
        creamy: 1   # Level 1 creamy
      }
# ...
    }
  }
}

}''').tolist()
pprint.pprint(clean_ast(ast))

这种方法确实有其局限性。最终的 AST 仍然有点嘈杂,您定义的语言必须可解释为有效的 python 代码。例如,你不能支持这个......

*:
    isYummy

...因为此语法不会解析为 python 代码。然而,它的一大优势是您可以控制 AST 转换,因此不可能注入任意 Python 代码。

于 2010-01-10T05:58:27.877 回答
5

再次没有教您解析,但您的格式非常接近合法的YAML,您可能只想将您的语言重新定义为 YAML 的子集并使用标准的 YAML 解析器

于 2010-03-01T23:48:39.860 回答
3

如果你的目标是学习解析,我强烈推荐像PyParsing这样的面向对象风格的库。它们不如更复杂的 antler、lex、yac 选项快,但您可以立即开始解析。

于 2010-01-10T07:01:41.150 回答
2

正如'Marcelo Cantos'建议你可以使用python dict,好处是你不必解析任何东西,你可以在服务器端使用与python dict相同的规则,在客户端使用javascript对象,并且可以将它们从服务器传递到客户端或反之亦然作为 JSON。

如果您真的想自己解析,请参阅此 http://nedbatchelder.com/text/python-parsers.html

但我不确定您是否能够轻松解析缩进语言。

于 2010-01-10T06:20:51.853 回答
1

您展示的示例语言可能过于复杂,无法为其编写简单(且无错误)的解析函数。我建议阅读解析技术,例如递归下降或表驱动解析,例如 LL(1)、LL(k) 等。

但这可能过于笼统和/或复杂。将规则语言简化为简单的东西(如分隔文本)可能更容易。

例如,像

巧克力:甜
巧克力。瑞士莲:美味
巧克力。瑞士莲*:光滑,奶油(1)

这将更容易解析,并且可以在没有正式解析器的情况下完成。

于 2010-01-10T06:23:55.653 回答
0

有一些库和工具可以使解析更容易。其中比较知名的是 lex/yacc。有一个名为“ lex ”的python 库和一个使用它的教程

于 2010-01-10T06:18:55.913 回答
0

定制文件结构的动机是什么?是否有可能将您的数据改造成更知名的结构,例如 XML?如果是这样,您可以使用众多之一来解析您的文件。使用公认的解析工具可以为您节省大量调试时间,如果考虑到这一点,它可能会使您的文件更具可读性

于 2010-01-10T07:02:57.197 回答