2

I have 40,000 lines and need to divide each line into different sentences. Now I'm using pattern like this:

String patternStr2 = "\\s*[\"']?\\s*([A-Z0-9].*?[\\.\\?!]\\s)['\"]?\\s*";

It can handle almost all the sentences, but for sentences like this: U.S. Navy, World War I. would be divided into 2 part: U.S. and Navy, World War I.

Is there any solution to handle this problem?

4

3 回答 3

2

为什么要在要拆分时尝试匹配

使用以下正则表达式:

(?<!\..)\.(?!.\.)

解释:

  1. (?<!\..): 负向lookbehind,检查后面是否没有点2个字符。

  2. \.: 匹配一个点。

  3. (?!.\.):负向向前看,检查前面是否没有点2个字符。

在线演示

注意:不确定如何在 JAVA 中执行此操作,但我认为您应该尝试(?<!\\..)\\.(?!.\\.). 也不要忘记在拆分的句子中添加一个点。

于 2013-05-16T08:18:51.253 回答
2

好的,我认为您不应该为此使用正则表达式,但我无法抗拒投入一些。

如果这很难理解,请告诉我,我会添加一些评论......

package test;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    private static final Pattern SENTENCE_DELIMITER = 
            Pattern.compile("((.+?)((?<!\\.[A-Z])(\\.\\s)(.+))?)");
    public static void main(String[] args) {
        String lineWithOneSentence = 
                "U.S. Navy, World War I";
        String lineWithTwoSentences = 
                "U.S. Navy, World War I. U.S. Air Force, World War III.";
        Matcher matcher = SENTENCE_DELIMITER.matcher(lineWithOneSentence);
        if (matcher.matches()) {
            for (int i = 0; i <= matcher.groupCount(); i++) {
                switch (i) {
                case 0: 
                    System.out.println("WHOLE MATCH: " + matcher.group(i));
                    break;
                case 2: 
                    System.out.println("FIRST SENTENCE: "+ matcher.group(i));
                    break;
                case 5: 
                    System.out.println("SECOND SENTENCE: " + matcher.group(i));
                default:
                }

            }
        }
        matcher = SENTENCE_DELIMITER.matcher(lineWithTwoSentences);
        if (matcher.matches()) {
            for (int i = 0; i <= matcher.groupCount(); i++) {
                switch (i) {
                case 0: 
                    System.out.println("WHOLE MATCH: " + matcher.group(i));
                    break;
                case 2: 
                    System.out.println("FIRST SENTENCE: "+ matcher.group(i));
                    break;
                case 5: 
                    System.out.println("SECOND SENTENCE: " + matcher.group(i));
                default:
                }
            }
        }
    }
}

这里的解决方法是:

  • 使用组
  • 对后跟空格的点使用负向后查找,以确保它们前面没有点后面跟着大写字母(如“U* .S *._”)

这有点矫枉过正,并且在某些时候可能会成为问题,即如果您的文本按照标点符号不连贯。


输出

WHOLE MATCH: U.S. Navy, World War I
FIRST SENTENCE: U.S. Navy, World War I
SECOND SENTENCE: null
WHOLE MATCH: U.S. Navy, World War I. U.S. Air Force, World War III.
FIRST SENTENCE: U.S. Navy, World War I
SECOND SENTENCE: U.S. Air Force, World War III.
于 2013-05-16T07:44:38.577 回答
0

字符串模式Str2 = " (?<!\\..)(?<![A-Z].)[\\.\\?!](?!.\\.)"; 然后使用 java Matcher find() 方法,可以得到所有的句子。

于 2013-06-24T23:42:22.413 回答