好吧,你已经让自己有了一个不错的开始。但是从这里开始,很容易陷入解析器调整的细节中,而且你可能会在这种模式下好几天。让我们从原始查询语法开始逐步解决您的问题。
当你开始这样的项目时,写一个你想要解析的语法的 BNF。它不必非常严格,事实上,这是一个基于我从你的样本中看到的开始:
word :: Word('a'-'z', 'A'-'Z', '0'-'9', '.-/&§')
field_qualifier :: '[' word+ ']'
search_term :: (word+ | quoted_string) field_qualifier?
and_op :: 'and'
or_op :: 'or'
and_term :: or_term (and_op or_term)*
or_term :: atom (or_op atom)*
atom :: search_term | ('(' and_term ')')
word
这非常接近——我们在and 和and_op
andor_op
表达式之间存在一些可能的歧义,因为“and”和“or”确实与单词的定义相匹配。我们需要在实施时加强这一点,以确保“癌症或癌或淋巴瘤或黑色素瘤”被解读为由“或”分隔的 4 个不同的搜索词,而不仅仅是一个大词(我认为这是您当前的解析器会做)。我们还获得了识别运算符优先级的好处——也许不是绝对必要的,但我们现在就开始吧。
转换为 pyparsing 很简单:
LBRACK,RBRACK,LPAREN,RPAREN = map(Suppress,"[]()")
and_op = CaselessKeyword('and')
or_op = CaselessKeyword('or')
word = Word(alphanums + '.-/&')
field_qualifier = LBRACK + OneOrMore(word) + RBRACK
search_term = ((Group(OneOrMore(word)) | quoted_string)('search_text') +
Optional(field_qualifier)('field'))
expr = Forward()
atom = search_term | (LPAREN + expr + RPAREN)
or_term = atom + ZeroOrMore(or_op + atom)
and_term = or_term + ZeroOrMore(and_op + or_term)
expr << and_term
为了解决 'or' 和 'and' 的歧义,我们在单词的开头放置了一个否定的前瞻:
word = ~(and_op | or_op) + Word(alphanums + '.-/&')
为了给结果一些结构,包装在Group
类中:
field_qualifier = Group(LBRACK + OneOrMore(word) + RBRACK)
search_term = Group(Group(OneOrMore(word) | quotedString)('search_text') +
Optional(field_qualifier)('field'))
expr = Forward()
atom = search_term | (LPAREN + expr + RPAREN)
or_term = Group(atom + ZeroOrMore(or_op + atom))
and_term = Group(or_term + ZeroOrMore(and_op + or_term))
expr << and_term
现在解析您的示例文本:
res = expr.parseString(test)
from pprint import pprint
pprint(res.asList())
给出:
[[[[[[['"breast neoplasms"'], ['MeSH', 'Terms']],
'or',
[['breast', 'cancer'], ['Acknowledgments']],
'or',
[['breast', 'cancer'], ['Figure/Table', 'Caption']],
'or',
[['breast', 'cancer'], ['Section', 'Title']],
'or',
[['breast', 'cancer'], ['Body', '-', 'All', 'Words']],
'or',
[['breast', 'cancer'], ['Title']],
'or',
[['breast', 'cancer'], ['Abstract']],
'or',
[['breast', 'cancer'], ['Journal']]]]],
'and',
[[[[['prevention'], ['Acknowledgments']],
'or',
[['prevention'], ['Figure/Table', 'Caption']],
'or',
[['prevention'], ['Section', 'Title']],
'or',
[['prevention'], ['Body', '-', 'All', 'Words']],
'or',
[['prevention'], ['Title']],
'or',
[['prevention'], ['Abstract']]]]]]]
实际上,与解析器的结果非常相似。我们现在可以通过这个结构递归并构建新的查询字符串,但我更喜欢使用解析对象来执行此操作,在解析时通过将类定义为令牌容器而不是Group
s 来创建,然后向类添加行为以获得我们想要的输出。区别在于我们解析的对象标记容器可以具有特定于被解析的表达式类型的行为。
我们将从一个基本抽象类 开始ParsedObject
,它将解析的标记作为其初始化结构。我们还将添加一个抽象方法 ,queryString
我们将在所有派生类中实现它以创建您想要的输出:
class ParsedObject(object):
def __init__(self, tokens):
self.tokens = tokens
def queryString(self):
'''Abstract method to be overridden in subclasses'''
现在我们可以从这个类派生,并且任何子类都可以用作定义语法的解析动作。
当我们这样做时,Group
为结构添加的 s 会妨碍我们,所以我们将在没有它们的情况下重新定义原始解析器:
search_term = Group(OneOrMore(word) | quotedString)('search_text') +
Optional(field_qualifier)('field')
atom = search_term | (LPAREN + expr + RPAREN)
or_term = atom + ZeroOrMore(or_op + atom)
and_term = or_term + ZeroOrMore(and_op + or_term)
expr << and_term
现在我们实现类 for search_term
,self.tokens
用于访问输入字符串中的解析位:
class SearchTerm(ParsedObject):
def queryString(self):
text = ' '.join(self.tokens.search_text)
if self.tokens.field:
return '%s: %s' % (' '.join(f.lower()
for f in self.tokens.field[0]),text)
else:
return text
search_term.setParseAction(SearchTerm)
接下来我们将实现and_term
andor_term
表达式。两者都是二元运算符,仅在输出查询中产生的运算符字符串不同,因此我们可以只定义一个类并让它们为各自的运算符字符串提供一个类常量:
class BinaryOperation(ParsedObject):
def queryString(self):
joinstr = ' %s ' % self.op
return joinstr.join(t.queryString() for t in self.tokens[0::2])
class OrOperation(BinaryOperation):
op = "OR"
class AndOperation(BinaryOperation):
op = "AND"
or_term.setParseAction(OrOperation)
and_term.setParseAction(AndOperation)
请注意,pyparsing 与传统解析器略有不同——我们BinaryOperation
将匹配“a or b or c”作为单个表达式,而不是作为嵌套对“(a or b) or c”。所以我们必须使用 stepping slice 重新加入所有的术语[0::2]
。
最后,我们添加一个解析动作,通过将所有表达式包装在 () 中来反映任何嵌套:
class Expr(ParsedObject):
def queryString(self):
return '(%s)' % self.tokens[0].queryString()
expr.setParseAction(Expr)
为方便起见,以下是一个副本/可粘贴块中的整个解析器:
from pyparsing import *
LBRACK,RBRACK,LPAREN,RPAREN = map(Suppress,"[]()")
and_op = CaselessKeyword('and')
or_op = CaselessKeyword('or')
word = ~(and_op | or_op) + Word(alphanums + '.-/&')
field_qualifier = Group(LBRACK + OneOrMore(word) + RBRACK)
search_term = (Group(OneOrMore(word) | quotedString)('search_text') +
Optional(field_qualifier)('field'))
expr = Forward()
atom = search_term | (LPAREN + expr + RPAREN)
or_term = atom + ZeroOrMore(or_op + atom)
and_term = or_term + ZeroOrMore(and_op + or_term)
expr << and_term
# define classes for parsed structure
class ParsedObject(object):
def __init__(self, tokens):
self.tokens = tokens
def queryString(self):
'''Abstract method to be overridden in subclasses'''
class SearchTerm(ParsedObject):
def queryString(self):
text = ' '.join(self.tokens.search_text)
if self.tokens.field:
return '%s: %s' % (' '.join(f.lower()
for f in self.tokens.field[0]),text)
else:
return text
search_term.setParseAction(SearchTerm)
class BinaryOperation(ParsedObject):
def queryString(self):
joinstr = ' %s ' % self.op
return joinstr.join(t.queryString()
for t in self.tokens[0::2])
class OrOperation(BinaryOperation):
op = "OR"
class AndOperation(BinaryOperation):
op = "AND"
or_term.setParseAction(OrOperation)
and_term.setParseAction(AndOperation)
class Expr(ParsedObject):
def queryString(self):
return '(%s)' % self.tokens[0].queryString()
expr.setParseAction(Expr)
test = """("breast neoplasms"[MeSH Terms] OR breast cancer[Acknowledgments]
OR breast cancer[Figure/Table Caption] OR breast cancer[Section Title]
OR breast cancer[Body - All Words] OR breast cancer[Title]
OR breast cancer[Abstract] OR breast cancer[Journal])
AND (prevention[Acknowledgments] OR prevention[Figure/Table Caption]
OR prevention[Section Title] OR prevention[Body - All Words]
OR prevention[Title] OR prevention[Abstract])"""
res = expr.parseString(test)[0]
print res.queryString()
打印以下内容:
((mesh terms: "breast neoplasms" OR acknowledgments: breast cancer OR
figure/table caption: breast cancer OR section title: breast cancer OR
body - all words: breast cancer OR title: breast cancer OR
abstract: breast cancer OR journal: breast cancer) AND
(acknowledgments: prevention OR figure/table caption: prevention OR
section title: prevention OR body - all words: prevention OR
title: prevention OR abstract: prevention))
我猜你需要收紧一些输出 - 那些 lucene 标签名称看起来很模棱两可 - 我只是在关注你发布的示例。但是您不必对解析器进行太多更改,只需调整queryString
附加类的方法即可。
作为海报的附加练习:在您的查询语言中添加对 NOT 布尔运算符的支持。