solr - Apache Solr 4 - 第一次提交后索引不增长

Question

我为 Apache Nutch 2.2.1 编写了自己的插件，用于从选定的站点（我的种子中有 180 个 url）抓取图像、视频和播客。我将此元数据放入 hBase 存储，现在我想将其保存到索引 (Solr)。我有很多元数据要保存（网页 + 图像 + 视频 + 播客）。

我在整个过程中使用 Nutch 脚本 bin/crawl（注入、生成、获取、解析......最后是 solrindex 和 dedup），但我有一个问题。当我第一次运行这个脚本时，索引中存储了大约 6000 个文档（假设它是 3700 个图像文档，1700 个网页文档，其余文档用于视频和播客）。没关系...

但...

当我第二次、第三次等运行脚本时……索引并没有增加文档的数量（仍然有 6000 个文档）但是存储在 hBase 表中的行数增加了（现在有 97383 行)...

请问你现在问题出在哪里？我与这个问题斗争了很长时间，但我不知道......如果它有帮助，这是我的 solrconfix.xml http://pastebin.com/uxMW2nuq的配置，这是我的 nutch-site.xml http： //pastebin.com/4bj1wdmT

当我查看日志时，有：

SEVERE: auto commit error...:java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot commit 
        at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2668) 
        at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2834) 
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2814) 
        at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:529) 
        at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216) 
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) 
        at java.util.concurrent.FutureTask.run(FutureTask.java:166) 
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) 
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) 
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
        at java.lang.Thread.run(Thread.java:722)

score 1 · Accepted Answer

在自动提交之前，您是否尝试过使用较低的值？尝试为每 100 个文档提交一次，以避免内存中的信息过多。

solr - Apache Solr 4 - 第一次提交后索引不增长

1 回答 1

Related

Reference