背景
使用 Solr 4.0.0。我已经索引了一组示例文档的文本并启用了术语向量,因此我可以使用快速向量突出显示
<field name="raw_text" type="text_en" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
为了突出显示,我使用了带有 SENTENCE 边界的 Break Iterator Boundary Scanner。
<boundaryScanner name="breakIterator" class="solr.highlight.BreakIteratorBoundaryScanner">
<lst name="defaults">
<!-- type should be one of CHARACTER, WORD(default), LINE and SENTENCE -->
<str name="hl.bs.type">SENTENCE</str>
</lst>
</boundaryScanner>
我做一个简单的查询
http://localhost:8983/solr/documents/select?q=raw_text%3AArtibonite&wt=xml&hl=true&hl.fl=raw_text&hl.useFastVectorHighlighter=true&hl.snippets=100&hl.boundaryScanner=breakIterator
突出显示效果很好
<response>
...
<result name="response" numFound="5" start="0">
<doc>
<str name="id">-1071691270</str>
<str name="raw_text">
Final Report of the Independent Panel of Experts on the Cholera
Outbreak in Haiti Dr. Alejando Cravioto (Chair) International
Center for Diarrhoeal Disease Research, Dhaka, Bangladesh Dr.
Claudio F. Lanata Instituto de Investigación Nutricional, and
The US Navy Medical Research Unit 6, Lima, Peru Engr. Daniele
S. Lantagne Harvard University... ~SNIP~
</str>
<doc>
<lst name="highlighting">
<lst name="-1071691270">
<arr name="raw_text">
...
<str>
The timeline suggests that the outbreak spread along
the <em>Artibonite</em> River. After establishing that
the cases began in the upper reaches of the Artibonite
River, potential sources of contamination that could have
initiated the outbreak were investigated.
</str>
...
</arr>
</lst>
</lst>
问题
我希望能够发送生成的句子以进行进一步处理(实体提取等),但我想跟踪原始(长)文本字段中突出显示的句子的开始/结束偏移量。有没有直接的方法来做到这一点?
将 hl.fragsize 设置为返回整个字段然后以这种方式处理/提取感兴趣的句子会更好吗?