我不确定是否有任何语法错误的完整列表,但这里有一些,我们已经处理了:
1) encoding issues: special characters like %, & etc should not be
passed as it is as they may ruin the whole query
2) cases of two asterisks together: ** may cause infinite loops or
put the system down to its knees, if leading and trailing wildcards
are accepted. Case when a search term is just one asterisk isn't
allowed in our system either
3) (optionally) for boolean queries ensure that opening and closing
brackets match
4) strip the punctuation, but do it with care, e.g. if U.S. turns
into US, then to ensure findability (recall matters to us), we make
sure same happens during the tokenization. Also we identify urls and
don't remove punctuation from them
5) some errors may relate to malformed proximity operators (like
near, ~), e.g. we don't allow them to be nested or boolean operators
inside them
我还要说,可以根据您为用户定义的语法来控制一些语法错误。那就是不要让他们做你不想让他们做的事。这也会在您的用户和您的应用程序之间形成某种搜索合同。提供一些类似工具提示的信息也很好,这些信息将告诉用户可以将什么典型语法用于什么目的。