我一直在尝试为类似 Gmail 的搜索找出一个正则表达式,即:
name:Joe surname:(Foo Bar)
...就像在这个话题中一样。但略有不同:如果有一个没有 a 的文本key:
foo:(hello world) bar:(-{bad things}) some text to search
foo:(hello world)
bar:(-{bad things})
some text to search
使用以下正则表达式获取关键字和相关文本(参见 RegExr):
([a-zA-Z]+:(?:\([^)]+?\)|[^( ]+))
Regex.Replace(searchtext, @"[a-zA-Z]+:(?:\([^)]+?\)|[^( ]+)", "");
Regex.Replace(searchtext, @" {2,}", " "); ^-- 注意空格:)
完全可以在#2 的正则表达式中执行空格删除,但是在处理正则表达式时,我倾向于保持它尽可能干净。
我建议要么使用 Lucene .NET 引擎的内置解析器为您提供标记,要么使用语法和解析器,例如 GoldParser、Irony 或 Antlr。
"Name" = 'Spruce Search Grammar'
"Version" = '1.1'
"About" = 'The search grammar for Spruce TFS MVC frontend'
"Start Symbol" = <Query>
! -------------------------------------------------
! Character Sets
! -------------------------------------------------
{Valid} = {All Valid} - ['-'] - ['OR'] - {Whitespace} - [':'] - ["] - ['']
{Quoted} = {All Valid} - ["] - ['']
! -------------------------------------------------
! Terminals
! -------------------------------------------------
AnyChar = {Valid}+
Or = 'OR'
Negate = ['-']
StringLiteral = '' {Quoted}+ '' | '"' {Quoted}+ '"'
! -- Field-specific terms
Project = 'project' ':'
CreatedOn = 'created-on' ':'
ResolvedOn = 'resolved-on' ':'
! -------------------------------------------------
! Rules
! -------------------------------------------------
! The grammar starts below
<Query> ::= <Query> <Keywords> | <Keywords>
<SingleWord> ::= AnyChar
<Keywords> ::= <SingleWord>
| <QuotedString>
| <Or>
| <Negate>
| <FieldTerms>
<Or> ::= <Or> <SingleWord>
| Or Negate
| Or <SingleWord>
| Or <QuotedString>
<Negate> ::= <Negate> Negate <SingleWord>
| <Negate> Negate <QuotedString>
| Negate <SingleWord>
| Negate <QuotedString>
<QuotedString> ::= StringLiteral
<FieldTerms> ::= <FieldTerms> Project | <FieldTerms> Description | <FieldTerms> State
| <FieldTerms> Type | <FieldTerms> Area | <FieldTerms> Iteration
| <FieldTerms> AssignedTo | <FieldTerms> ResolvedBy
| <FieldTerms> ResolvedOn | <FieldTerms> CreatedOn
| Project
| <Description>
| State
| Type
| Area
| Iteration
| CreatedBy
| AssignedTo
| ResolvedBy
| CreatedOn
| ResolvedOn
<Description> ::= <Description> Description | <Description> Description StringLiteral
| Description | Description StringLiteral
解决者:john 项目:“惊人的 tfs 项目”
标记,您会发现它需要一个单字、一个 OR、一个带引号的字符串或一个否定 (NOT)。当这个定义变得递归时,困难的部分就来了,你可以在这<Description>
该语法称为EBNF,它描述了您的语言格式。您可以在其中编写一些简单的东西,例如搜索查询解析器,或整个计算机语言。Goldparser 解析标记的方式会限制您,因为它会提前查找标记 ( LALR ),因此 HTML 和 Wiki 语法等语言会破坏您尝试编写的任何语法,因为这些格式不会强迫您关闭标记/标记. Antlr为您提供 LL(*),它更能容忍丢失的开始标签/令牌,但对于搜索查询解析器来说,您不需要担心。
我的语法和 C# 代码的代码文件夹可以在这个项目中找到。
QueryParser是解析搜索字符串的类,语法文件是 .grm 文件,2mb 文件是 Goldparser 如何优化您的语法以基本上创建自己的可能性表。Calitha 是 GoldParser 的 C# 库,很容易实现。如果不写一个更大的答案,很难准确描述它是如何完成的,但是一旦你编译了语法,它就相当简单了,而且 Goldparser 有一个非常直观的 IDE 用于编写语法和大量现有的语法,如 SQL、C#、我相信 Java 甚至是 Perl 正则表达式。
这不是一个 1 小时的快速修复,就像你从正则表达式中获得的那样,接近 2-3 天,但是你确实学习了“正确”的解析方式。
You don't need to solve this problem using only one regular expression. You can re-use the answer that you linked to that you indicated would partially work.
The last array element is the only one that needs to be corrected.
Using your example you'd initially get:
"foo:(hello world)",
"bar:(-{bad things}) some text to search"
The last item needs to be split into text up to and including the first closing bracket and text following it. You'd then replace the last item with the text up to and including the bracket and then you'd append the text following it to the array.
"foo:(hello world)",
"bar:(-{bad things})",
"some text to search"
The following pseudo code should explain how this can be done:
array; // Array returned when string was split using /\s+(?=\w+:)/
lastPosition = array.length-1;
lastElem = array[lastPosition]; // May contain text without a key
// Key is followed by an opening bracket
// (check for opening bracket after semi-colon following key)
if ( lastElem.match( /^[^:]*:(/ ) ) {
// Need to replace array entry with key and all text up to and including
// closing bracket.
// Additional text needs to be added to array.
maxSplitsAllowed = 1;
results = lastElem.split( /)\w*/ , maxSplitsAllowed );
// White space following the bracket was included in the match so it
// wouldn't be at the front of the text without a key
lastKeyAndText = results[0] + ')'; // Re-append closing bracket
endingTextWithoutKey = results[1];
array[lastPosition] = lastKeyAndText; // Correct array entry for last key
array.append( endingTextWithoutKey ); // Append text without key
// Key is not followed by a closing bracket but has text without a key
// (check for white space following characters that aren't white space
// characters)
} else if (lastElem.match( /^[^:]*:[^\w]*\w/ )) {
// Need to change array entry so that all text before first space
// becomes the key.
// Additional text needs to be added to array.
maxSplitsAllowed = 1;
results = lastElem.split( /\w+/ , maxSplitsAllowed );
lastKeyAndText = results[0];
endingTextWithoutKey = results[1];
array[lastPosition] = lastKeyAndText; // Correct array entry for last key
array.append( endingTextWithoutKey ); // Append text without key
I assumed that brackets are required when white space characters are to be included within text that follows a key.
- 关键。(?:)
- 其次是...
- 括号|
- 或者\S+
- 一些不是空格的字符。|\S+
- 或者只匹配一个单词。请注意,此模式将单词分成不同的匹配项。如果你真的无法处理,你可以使用类似的东西|(?:\S+(\s+(?!\w*:)[^\s:]+)*)
而不是 last |\S+
工作示例:http: //ideone.com/bExFd
在这里,我们可以使用 .Net 模式的一些高级功能 - 它们保留所有组的所有捕获。这是构建完整解析器的有用功能。在这里,我包含了一些其他搜索功能,例如带引号的字符串和运算符(OR
或 range ..
\s # skip over spaces.
(?<Key>\w+): # Key:
(?: # followed by:
(?<KeyValue>[^)]*) # Parentheses
| # or
(?<KeyValue>\S+) # a single word
""(?<Term>[^""]*)"" # quoted term
(?<Term>\w+) # just a word
(?<Invalid>.) # Any other character isn't valid
您现在可以轻松获取所有令牌及其位置(您还可以压缩 Key 和 KeyValue 捕获以将它们配对):
Regex queryParser = new Regex(pattern, RegexOptions.IgnorePatternWhitespace);
Match m = queryParser.Match(query); // single match!
// ...
var terms = m.Groups["Term"].Captures;
工作示例:http: //ideone.com/B7tln
在 Java 中:
p = Pattern.compile("(\\w+:(\\(.*?\\))|.+)\\s*");
m = p.matcher("foo:(hello world) bar:(-{bad things}) some text to search");
Log.v("REGEX", m.group(1));
05-25 15:21:06.242: V/REGEX(18203): foo:(hello world)
05-25 15:21:08.061: V/REGEX(18203): bar:(-{坏事})
05-25 15:21:09.761:V/REGEX(18203):一些要搜索的文本