7

I'm trying to match a regular expression to textbook definitions that I get from a website. The definition always has the word with a new line followed by the definition. For example:

Zither
 Definition: An instrument of music used in Austria and Germany It has from thirty to forty wires strung across a shallow sounding board which lies horizontally on a table before the performer who uses both hands in playing on it Not to be confounded with the old lute shaped cittern or cithern

In my attempts to get just the word (in this case "Zither") I keep getting the newline character.

I tried both ^(\w+)\s and ^(\S+)\s without much luck. I thought that maybe ^(\S+)$ would work, but that doesn't seem to successfully match the word at all. I've been testing with rubular, http://rubular.com/r/LPEHCnS0ri; which seems to successfully match all my attempts the way I want, despite the fact that Java doesn't.

Here's my snippet

String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above.
Pattern rgx = Pattern.compile("^(\\S+)$");
Matcher mtch = rgx.matcher(str);
if (mtch.find()) {
    String result = mtch.group();
    terms.add(new SearchTerm(result, System.nanoTime()));
}

This is easily solved by triming the resulting string, but that seems like it should be unnecessary if I'm already using a regular expression.

All help is greatly appreciated. Thanks in advance!

4

5 回答 5

9

尝试使用 Pattern.MULTILINE 选项

Pattern rgx = Pattern.compile("^(\\S+)$", Pattern.MULTILINE);

这会导致正则表达式识别字符串中的行分隔符,否则只匹配字符串的开头^$结尾。

尽管此模式没有区别,但该Matcher.group()方法会返回整个匹配项,而该方法会根据您指定的数字Matcher.group(int)返回特定捕获组的匹配项。(...)您的模式指定一个捕获组,这是您想要捕获的。如果您\s在编写时尝试过包含在 Pattern 中,那么Matcher.group()会将该空格包含在其返回值中。

于 2013-08-15T20:52:45.573 回答
2

对于正则表达式,第一组始终是完整的匹配字符串。在您的情况下,您需要第 1 组,而不是第 0 组。

所以改成mtch.group()应该mtch.group(1)可以解决问题:

 String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above.
 Pattern rgx = Pattern.compile("^(\\w+)\s");
 Matcher mtch = rgx.matcher(str);
 if (mtch.find()) {
     String result = mtch.group(1);
     terms.add(new SearchTerm(result, System.nanoTime()));
 }
于 2013-08-15T20:56:08.790 回答
2

迟到的响应,但如果您没有使用模式和匹配器,您可以DOTALL在您的正则表达式字符串中使用这种替代方法

(?s)[Your Expression]

基本上(?s)也告诉点匹配所有字符,包括换行符

详细信息:http ://www.vogella.com/tutorials/JavaRegularExpressions/article.html

于 2016-06-03T11:06:46.313 回答
1

只需更换:

String result = mtch.group();

经过:

String result = mtch.group(1);

这会将您的输出限制为捕获组的内容(例如(\\w+))。

于 2013-08-15T21:01:41.450 回答
0

尝试下一个:

/* The regex pattern: ^(\w+)\r?\n(.*)$ */
private static final REGEX_PATTERN = 
        Pattern.compile("^(\\w+)\\r?\\n(.*)$");

public static void main(String[] args) {
    String input = "Zither\n Definition: An instrument of music";

    System.out.println(
        REGEX_PATTERN.matcher(input).matches()
    );  // prints "true"

    System.out.println(
        REGEX_PATTERN.matcher(input).replaceFirst("$1 = $2")
    );  // prints "Zither =  Definition: An instrument of music"

    System.out.println(
        REGEX_PATTERN.matcher(input).replaceFirst("$1")
    );  // prints "Zither"
}
于 2013-08-15T20:54:24.513 回答