4

我在“分离”这些数据时遇到了一些麻烦。Altough 辅助函数等一个选项,我真的很想只使用正则表达式来解决这个问题(并在匹配后处理匹配组)。

这是(部分)我拥有的数据:

Belgium
Belgium M_Foo
Belgium A_Bar
Belgium M_FooBar
Belgium S_Whooptee Doo
Belgium Xxx
Belgium S_Foo Bar
United Kingdom
United Kingdom W_Foo-Bar
United Kingdom M_Yay
United Kingdom Xxx
United Kingdom S_Derp
United Kingdom F_Doh Lorem
United Kingdom S_Ipsum Dolor
United States of America L_Foo
Macedonia F.Y.R. Xxx
Macedonia F.Y.R. S_Foo Bar
Cyprus (Greek) M_Foo
Congo (Democratic Republic of)
Congo (Democratic Republic of) Q_Yolo

本质上,这是一个“键/值”类型的字符串数组。它包含一个国家/地区名称(未标准化,因此我不能使用硬编码的国家/地区名称或“查找”,它也可能是国家/地区名称以外的其他字符串),optionally后跟关键字Xxx <random_upcase_char>_<random_text>.

我想出了以下正则表达式:

^(.+?)(?:\s+(Xxx|[A-Z]_.*)?)

或者,第一个匹配组的差异很小:

^(.*?)(?:\s+(Xxx|[A-Z]_.*)?)

这适用于以 . 开头的第一个字符串Belgium。对于这些记录,它返回以下结果:

Group 1     Group 2
================================
Belgium
Belgium     M_Foo
Belgium     A_Bar
Belgium     M_FooBar
Belgium     S_Whooptee Doo
Belgium     Xxx
Belgium     S_Foo Bar

但是,以下几行会引起麻烦:

Group 1     Group 2
================================
United
United
United
United
United
United
United
United
Macedonia
Macedonia
Cyprus
Congo
Congo

我希望正则表达式执行以下操作:

Group 1                         Group 2
================================================
United Kingdom
United Kingdom                  W_Foo-Bar
United Kingdom                  M_Yay
United Kingdom                  Xxx
United Kingdom                  S_Derp
United Kingdom                  F_Doh Lorem
United Kingdom                  S_Ipsum Dolor
United States of America        L_Foo
Macedonia F.Y.R.                Xxx
Macedonia F.Y.R.                S_Foo Bar
Cyprus (Greek)                  M_Foo
Congo (Democratic Republic of)
Congo (Democratic Republic of)  Q_Yolo

但我无法让第一部分匹配。我很确定这与第一个匹配组的贪婪/不贪婪选项有关,但在摆弄了一段时间后,我无法让它工作......

我不在乎是否返回额外/其他/更多匹配组。正则表达式旨在用于.Net C#应用程序(以防您想知道这是哪种“方言”)。

任何帮助将不胜感激。

4

4 回答 4

0

有时,对于非贪婪匹配,锚定非常重要。在这种情况下,锚定到行尾可以解决问题。你的正则表达式应该是:

^(.+?)(?:\s+(Xxx|[A-Z]_.*))?$

请注意,我还将可选的 ( ?) 量词移到了另一个分组级别之外,因此空格是可选的。

于 2013-01-17T10:39:52.250 回答
0

我用这个正则表达式管理了你想要的东西(使用多行运行):

^((?:.+?| )+?)(?:\s+(Xxx|[A-Z]_.*)|\s)?$

使用你的输入给了我这个结果:

1: Belgium                  2: 
1: Belgium                  2: M_Foo
1: Belgium                  2: A_Bar
1: Belgium                  2: M_FooBar
1: Belgium                  2: S_Whooptee Doo
1: Belgium                  2: Xxx
1: Belgium                  2: S_Foo Bar
1: United Kingdom           2: 
1: United Kingdom           2: W_Foo-Bar
1: United Kingdom           2: M_Yay
1: United Kingdom           2: Xxx
1: United Kingdom           2: S_Derp
1: United Kingdom           2: F_Doh Lorem
1: United Kingdom           2: S_Ipsum Dolor
1: United States of America 2: L_Foo
1: Macedonia F.Y.R.         2: Xxx
1: Macedonia F.Y.R.         2: S_Foo Bar
1: Cyprus (Greek)           2: M_Foo
于 2013-01-17T10:44:39.437 回答
0

试试这个(不区分大小写):

^([A-Z]+(?:\s+(?!Xxx)[A-Z]+)*(?:\s+\([^)]+\))?)(?:\s+(Xxx|(?:[-A-Z_.]+(?:\s+[-A-Z_.]+)*)))?$

它适用于您的所有示例。但坦率地说,您应该正确分隔数据。

演示:

$ perl -ne '/^([A-Z]+(?:\s+(?!Xxx)[A-Z]+)*(?:\s+\([^)]+\))?)(?:\s+(Xxx|(?:[-A-Z_.]+(?:\s+[-A-Z_.]+)*)))?$/i and print "MATCH: group 1 is \"$1\", group 2 is \"$2\"\n"'
> Belgium
> Belgium M_Foo
> Belgium A_Bar
> Belgium M_FooBar
> Belgium S_Whooptee Doo
> Belgium Xxx
> Belgium S_Foo Bar
> United Kingdom
> United Kingdom W_Foo-Bar
> United Kingdom M_Yay
> United Kingdom Xxx
> United Kingdom S_Derp
> United Kingdom F_Doh Lorem
> United Kingdom S_Ipsum Dolor
> United States of America L_Foo
> Macedonia F.Y.R. Xxx
> Macedonia F.Y.R. S_Foo Bar
> Cyprus (Greek) M_Foo
> Congo (Democratic Republic of)
> Congo (Democratic Republic of) Q_Yolo
> EOF
MATCH: group 1 is "Belgium", group 2 is ""
MATCH: group 1 is "Belgium", group 2 is "M_Foo"
MATCH: group 1 is "Belgium", group 2 is "A_Bar"
MATCH: group 1 is "Belgium", group 2 is "M_FooBar"
MATCH: group 1 is "Belgium", group 2 is "S_Whooptee Doo"
MATCH: group 1 is "Belgium", group 2 is "Xxx"
MATCH: group 1 is "Belgium", group 2 is "S_Foo Bar"
MATCH: group 1 is "United Kingdom", group 2 is ""
MATCH: group 1 is "United Kingdom", group 2 is "W_Foo-Bar"
MATCH: group 1 is "United Kingdom", group 2 is "M_Yay"
MATCH: group 1 is "United Kingdom", group 2 is "Xxx"
MATCH: group 1 is "United Kingdom", group 2 is "S_Derp"
MATCH: group 1 is "United Kingdom", group 2 is "F_Doh Lorem"
MATCH: group 1 is "United Kingdom", group 2 is "S_Ipsum Dolor"
MATCH: group 1 is "United States of America", group 2 is "L_Foo"
MATCH: group 1 is "Macedonia", group 2 is "F.Y.R. Xxx"
MATCH: group 1 is "Macedonia", group 2 is "F.Y.R. S_Foo Bar"
MATCH: group 1 is "Cyprus (Greek)", group 2 is "M_Foo"
MATCH: group 1 is "Congo (Democratic Republic of)", group 2 is ""
MATCH: group 1 is "Congo (Democratic Republic of)", group 2 is "Q_Yolo"
于 2013-01-17T10:45:29.743 回答
0

/(?:^(.+)\s+(Xxx|[A-Z]_.+)$|^(.+)$)/gm将匹配您的所有字符串,但是,任何只有一个国家/地区的行都将被放入第三个匹配项中(因此在您查看结果时请检查这一点)。

演示

于 2013-01-17T10:46:20.483 回答