php - 从 zend lucene 索引中删除重复文档

Question

实际上，我创建和优化索引的方式是每次都创建和优化一大块记录，而不是一次性转换所有记录。现在我面临的问题是我在索引中创建了重复的文档/记录。我需要知道是否有任何函数或代码可以从索引中删除重复项。提前致谢。

score 2 · Accepted Answer

您需要在更新之前删除记录，这是 Lucene 的工作方式。您不能更新现有记录。

这是删除记录的方式

$index = Zend_Search_Lucene::open('data/index');//'data/index' is the file that lucene generated
$query = new Zend_Search_Lucene_Search_Query_Term(new
Zend_Search_Lucene_Index_Term($listing_id, 'listing_id'));// 'listing_id' is a field i added when creating index for the first time. $listing_id is the id value of the row i want to delete
$hits = $index->find($query); 
foreach ($hits as $hit) {
    $index->delete($hit->id);// $hit->id is not listing_id, it's lucene unique index of the row that has listing_id = $listing_id
}

现在您可以进行更新，这基本上是一个插入 :)，这就是 lucene 的工作方式。

score 0 · Accepted Answer

您应该有一个作为唯一标识符的术语。然后，在将文档添加到索引之前，将其删除。

重复只是您拥有多个具有相同唯一 ID 的文档的实例。因此，您只需枚举唯一 id 字段中的所有术语，并搜索具有两个结果的术语。据我所知，没有内置的方法可以做到这一点。

score 0 · Accepted Answer

$index->commit()在添加任何新数据之前不要忘记提交。这就是我的重复数据返回的原因$index->find($query)。

$index = Zend_Search_Lucene::open('/lucene/index');
$query = new Zend_Search_Lucene_Search_Query_Term (new Zend_Search_Lucene_Index_Term($id, 'key'));

$hits = $index->find($query);
foreach ($hits as $hit) {
       $index->delete($hit->id); // $hit->id is not key , it's lucene unique index of the row that has key = $id
}
$index->commit();   // apply changes (delete) before index new data

doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::keyword('key', $id));
$doc->addField(Zend_Search_Lucene_Field::Text('user', $user, 'utf-8'));

php - 从 zend lucene 索引中删除重复文档

3 回答 3

Related

Reference