目前,我使用以下正则表达式来解析文档中的句子:
Pattern.compile("(?<=\\w[\\w\\)\\]](?<!Mrs?|Dr|Rev|Mr|Ms|vs|abd|ABD|Abd|resp|St|wt)[\\.\\?\\!\\:\\@]\\s)");
这几乎可以工作。例如:给定这个字符串:
“玛丽有一只小羊羔(即羊羔派)。以下是它的特性: 1. 它有四英尺 2. 它有绒毛 3. 它是哺乳动物。它有白色绒毛。她的父亲 Lamb 先生住在 Mulbery圣在一座小白屋里。”
我得到以下句子:
Mary had a little lamb (i.e. lamby pie).
Here are its properties:
1. It has four feet 2. It has fleece 3. It is a mammal.
It had white fleese.
Her father, Mr. Lamb, live on Mulbery St. in a little white house.
但是,我想要的是:
Mary had a little lamb (i.e. lamby pie).
Here are its properties:
1. It has four feet
2. It has fleece
3. It is a mammal.
It had white fleese.
Her father, Mr. Lamb, lives on Mulbery St. in a little white house.
有没有办法通过改变现有的正则表达式来做到这一点?
现在要完成这项任务,我首先进行初始拆分,然后检查项目符号。以下代码有效,但我想知道是否有更优雅的解决方案:
public static void doHomeMadeSentenceParser(String temp) {
Pattern p = Pattern
.compile("(?<=\\w[\\w\\)\\]](?<!Mrs?|Dr|Rev|Mr|Ms|vs|abd|ABD|Abd|resp|St|wt)[\\.\\?\\!\\:\\@]\\s)");
String[] sentences = p.split(temp);
Vector psentences = new Vector();
Pattern p1 = Pattern.compile("\\b\\d+[.)]\\s");
for (int x = 0; x < sentences.length; x++) {
Matcher matcher = p1.matcher(sentences[x]);
int bstart = 0;
boolean bulletfound = false;
while (matcher.find()) {
bulletfound = true;
String bullet = sentences[x].substring(bstart, matcher.start());
if (bullet.length() > 0) {
psentences.add(bullet);
}
bstart = matcher.start();
}
if (bulletfound)
psentences.add(sentences[x].substring(bstart));
else
psentences.add(sentences[x]);
}
for (int x = 0; x < psentences.size(); x++) {
String s = (String) psentences.get(x);
System.out.println(s.trim());
}
}
提前感谢您的帮助。
艾略特