3

我需要在字符串中找到日期及其位置。考虑示例字符串

“有趣的日期是从今天开始的 4 天,今年 7 月 20 日,另一个日期是 1997 年 2 月 18 日”

我需要输出(假设今天是 2013-07-14)
2013-07-17,位置 25
2013-07-20,位置 56
1997-02-18,位置 93

我已经设法编写代码来获取被识别为日期的字符串的各个部分。需要增强/更改它以实现上述输出。任何提示或帮助表示赞赏:

    Properties props = new Properties();
    AnnotationPipeline pipeline = new AnnotationPipeline();
    pipeline.addAnnotator(new PTBTokenizerAnnotator(false));
    pipeline.addAnnotator(new WordsToSentencesAnnotator(false));
    pipeline.addAnnotator(new POSTaggerAnnotator(false));
    pipeline.addAnnotator(new TimeAnnotator("sutime", props));

    Annotation annotation = new Annotation("The interesting date is 4 days from today and it is 20th july of this year, another date is 18th Feb 1997");
    annotation.set(CoreAnnotations.DocDateAnnotation.class, "2013-07-14");
    pipeline.annotate(annotation);
    List<CoreMap> timexAnnsAll = annotation.get(TimeAnnotations.TimexAnnotations.class);
    timexAnnsAll.each(){
        println it
    }

使用上面的代码,我得到的输出为:
4 days from today
20th July of this
18th Feb 1997

编辑::
设法获取日期部分,并进行以下更改

timexAnnsAll.each(){it ->  
    Timex timex = it.get(TimeAnnotations.TimexAnnotation.class);  
    println timex.val + " from : $it"  
}

现在的输出是:
2013-07-18 from : 4 days from today
2013-07-20 from : 20th of this year
1997-02-18 from : 18th Feb 1997

我现在需要解决的就是在原始字符串中找到日期的位置。

4

1 回答 1

4

从列表中返回的每个 CoreMapannotation.get(TimeAnnotations.TimexAnnotations.class)都是一个Annotation,您可以获取它的其他属性,例如标记列表,每个标记都存储字符偏移信息。所以你可以像这样完成你的例子:

List<CoreMap> timexAnnsAll = annotation.get(TimeAnnotations.TimexAnnotations.class);
for (CoreMap cm : timexAnnsAll) {
  List<CoreLabel> tokens = cm.get(CoreAnnotations.TokensAnnotation.class);
  System.out.println(cm +
          " [from char offset " +
          tokens.get(0).get(CoreAnnotations.CharacterOffsetBeginAnnotation.class) +
          " to " + tokens.get(tokens.size() -1)
          .get(CoreAnnotations.CharacterOffsetEndAnnotation.class) + ']');
  /* -- This shows printing out each token and its character offsets
  for (CoreLabel token : tokens) {
    System.out.println(token +
            ", start: " + token.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class) +
            ", end: " + token.get(CoreAnnotations.CharacterOffsetEndAnnotation.class));
  }
  */
}

然后输出是:

4 days from today [from char offset 24 to 41]
20th july of this year [from char offset 52 to 74]
18th Feb 1997 [from char offset 92 to 105]
于 2013-07-14T22:07:07.643 回答