我有大约一千个已经索引的 lucene 文档,我想为所有文档中的所有术语检索每个文档的术语频率,这就是我索引事物的方式
HashMap<Integer, String> documentList = getEachDocumentSeparated();
Analyzer analyzer = new StandardAnalyzer();
Directory index = FSDirectory.open(Paths.get(RESULT_ADDRESS));
IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
IndexWriter w = new IndexWriter(index, config);
FieldType fieldType = new FieldType((TextField.TYPE_STORED));
IndexOptions indexOptions = IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS;
fieldType.setIndexOptions(indexOptions);
for (Map.Entry<Integer, String> pair : documentList.entrySet())
{
Document doc = new Document();
Field bodyField = new Field("body", pair.getValue(), fieldType);
doc.add(new StringField("id", pair.getKey(), Field.Store.YES));
doc.add(bodyField);
w.addDocument(doc);
}
例如,我想实现如下所示的向量
sterm,1(5),2(10),330(2),500(1),1001(3)
意思是sterm在文档一中重复了 5 次,在文档 2 中也重复了 10 次,依此类推...